Multivariate incremental effects for continuous treatments: Studying the health effects of environmental mixtures

Zhuochao Huang Department of Statistics, University of Florida. Corresponding author: Zhuochao Huang, zhuochao.huang@ufl.edu Kejin Dong¹¹footnotemark: 1 Tuo Lin Department of Biostatistics, University of Florida. Joseph Antonelli¹¹footnotemark: 1

Abstract

Evaluating the causal health effects of multivariate, continuous exposures, such as air pollution mixtures, is a critical public health challenge. A primary obstacle is the frequent violation of the positivity assumption, which renders the effects of standard deterministic interventions unidentified or heavily reliant on unreliable model extrapolation. In this paper, we develop a novel causal inference framework to address this challenge. We extend exponential tilting to multivariate exposures and address the critical question of how to compare different intervention directions fairly. This establishes a systematic framework for defining and evaluating various policy-relevant causal estimands, allowing researchers to address diverse scientific questions. We develop numerous methodological advancements, including efficient one-step estimation strategies, a Riemannian BFGS algorithm to solve a constrained manifold optimization problem, semiparametric efficiency bounds for causal estimands, minimax rates for estimators, and asymptotic normality results. We demonstrate our framework’s utility by applying it to a nationwide environmental health dataset to identify the optimal strategy for reducing adverse health outcomes associated with a PM_2.5 chemical mixture.

1 Introduction

Understanding the causal effects of complex air exposure mixtures is crucial for effective public health policy, but traditional statistical methods face significant challenges. A key challenge is that individuals are exposed to a complex mixture of correlated substances. Single-pollutant analyses often fall short, as they fail to account for the confounding effect of pollutants and cannot capture potential interaction effects, leading to misleading policy recommendations (Bobb et al., 2015; Antonelli and Zigler, 2024). This necessitates a causal inference framework for multivariate continuous exposures. A primary obstacle in evaluating the causal effects of multivariate continuous exposures is the positivity assumption. This assumption requires that for any given set of covariates, the exposure levels of interest have nonzero density. In environmental mixtures, this assumption is frequently violated; certain combinations of pollutants may be physically implausible or absent from observational data (Peters et al., 2012). Consequently, estimating the effect of deterministic interventions, such as setting a pollutant mixture to a specific counterfactual level, becomes impossible without resorting to unreliable extrapolation (Antonelli and Zigler, 2024; Rudolph et al., 2025). In the presence of positivity violations, there are (at least) two potential solutions. The less frequently used approach is to modify the estimation method to be robust to extrapolation, such as “extrapolation-aware” inferential methods that explicitly bound or characterize the uncertainty when moving outside the data support (Pfister and Bühlmann, 2021). The more common approach, and the one we focus on throughout, is to modify the scientific question by redefining the estimand.

One useful strategy for modifying the estimand with continuous treatments involves identifying and estimating the derivative of the exposure-response function, then integrating this derivative to recover the full curve. By focusing on local effects first and developing novel bias-corrected and Neyman Orthogonal estimators for the derivative function, this “differentiate-then-integrate” approach can circumvent the need for a global positivity assumption and has been a subject of significant recent research (Zhang et al., 2023; Kallus and Oprescu, 2022; Rothenhäusler and Yu, 2021). Another alternative is to leverage instrumental variables, where recent advances have extended these methods to continuous treatments without relying on positivity to identify local effects (Dorn and Guo, 2024; Bruns and Kallus, 2024).

While these methods are powerful, a distinct set of approaches that are flexible and policy-relevant rely on the use of stochastic interventions. These interventions generally define a policy that modifies the observed distribution of exposures rather than replacing it with a fixed value. Motivated by defining more realistic policies, early applications have been seen in epidemiology. For example, considering interventions that would truncate the exposure distribution, such as enforcing pollution levels below a certain cutoff (Taubman et al., 2009). This idea was later formalized in the statistical literature to provide a general framework for evaluating the population causal effect of shifting an entire exposure distribution (Díaz and van der Laan, 2012). Over the past decade, several distinct stochastic interventions have been developed. One approach is the shift intervention, which defines the post-intervention exposure as the natural exposure value shifted by a certain amount (Haneuse and Rotnitzky, 2013). A second, influential family of interventions is based on the natural value of treatment. These policies, often called Modified Treatment Policies (MTPs), define the counterfactual exposure as a function of the exposure that would have naturally occurred. This concept was pioneered in the context of dynamic treatment regimes (Robins et al., 2004) and later formalized for single time points (Young et al., 2011; Richardson and Robins, 2013). The framework has since been extended to handle longitudinal scenarios with continuous treatments (Díaz et al., 2023). There is ongoing research exploring the properties of generalized policies that depend on an individual’s natural value of treatment, for example using optimal transport to derive tighter bounds under unmeasured confounding (Kallus and Mbougwe, 2024). Generally speaking, this class of estimands relies on weaker positivity conditions than most deterministic estimands, however, they still rely on positivity holding for certain exposure values.

When positivity violations are a big concern and are likely to occur, a potentially more robust alternative that inherently respects the data’s support and thereby does not rely on positivity is the incremental intervention. This approach was initially developed for binary and longitudinal treatments through incremental shifts in propensity scores, and defines a new interventional distribution by tilting the exposure distribution with an exponential function in a specified direction (Kennedy, 2019). This concept has subsequently been generalized to univariate continuous treatments as well (Díaz and Hejazi, 2020; Schindl et al., 2024). Some recent work has criticized exponential tilted estimands such as these (Schindl and Wasserman, 2025) due to their less intuitive parameterization, asymmetric reallocation of probability mass, and for possessing less favorable asymptotic properties compared to certain alternatives. Regardless of their form, stochastic interventions, typically present their own set of challenges. One key concern is that estimands identified under positivity violations may correspond to interventions that are not directly implementable, revealing a fundamental “interpretability-implementability tradeoff” (Rudolph et al., 2024). Furthermore, when comparing the effects of different stochastic interventions, one must ensure the comparisons are fair, a concept that related work in different contexts has begun to formalize to prevent misleading conclusions (McClean et al., 2024).

Nearly all work to date has focused on univariate treatments, which is insufficient for addressing problems raised in the analysis of environmental mixtures. In this work, we build on the existing literature by using exponentially tilted estimands within the multivariate treatment context. This extension to the multivariate setting is non-trivial and introduces several new, complex questions. First, the intervention parameter becomes a vector, creating an infinite space of possible intervention directions. This raises a variety of policy questions about how to shift all exposures at once and which direction is best. Second, a fair basis for comparing these different directional shifts is needed; interventions must be constrained to be of the same size for meaningful comparisons, where the definition of fair is not unique. We propose solutions to each of these issues and explore a variety of estimands within this framework that target different policy-relevant questions of interest in environmental epidemiology. We show that these different estimands vary in terms of how efficiently they can be estimated from the data, and we provide algorithms for finding optimal policies in terms of shifts in exposures that lead to the biggest reduction in adverse health outcomes. We provide theoretical support in terms of minimax rates that show how well the estimands can be estimated and how the difficulty of estimation intrinsically depends on the conditional covariance matrix of the exposures. We also develop efficient influence function based estimators that can achieve root- $n$ convergence and asymptotic normality using complex, machine learning estimators for nuisance function estimation.

2 Exposure Shifts under Exponential Tilts

2.1 Notation and Potential Outcomes under SUTVA

Let our observed data consist of $n$ independent and identically distributed samples $\{\boldsymbol{Z}_{i}\}_{i=1}^{n}$ , drawn from some underlying distribution $P_{0}$ . Each observation $\boldsymbol{Z}_{i}=(\boldsymbol{X}_{i},\boldsymbol{W}_{i},Y_{i})$ is composed of a $p$ -dimensional vector of covariates $\boldsymbol{X}_{i}\in\mathcal{X}\subseteq\mathbb{R}^{p}$ , a $q$ -dimensional vector representing the environmental exposures or treatments¹¹1Note that we use the terms treatment and exposure interchangeably throughout the manuscript. Also note that environmental mixture simply refers to a vector of environmental exposures. $\boldsymbol{W}_{i}=(W_{i1},\dots,W_{iq})\in\mathcal{W}\subseteq\mathbb{R}^{q}$ , and a scalar outcome of interest $Y_{i}\in\mathbb{R}$ . We denote the conditional density of the exposure mixture given the covariates as $f(\boldsymbol{w}\mid\boldsymbol{x})$ .

To formally define causal effects, we operate within the potential outcomes framework. For each individual $i$ and any potential exposure vector $\boldsymbol{w}\in\mathcal{W}$ , we let $Y_{i}(\boldsymbol{w})$ denote the potential outcome that would have been observed had individual $i$ received exposure level $\boldsymbol{w}$ . This notation relies on the Stable Unit Treatment Value Assumption (SUTVA, (Rubin, 1980)), which comprises two key principles:

1.

No Interference: The potential outcome for one individual is unaffected by the exposure assignments of other individuals. That is, $Y_{i}(\boldsymbol{w})$ depends only on the exposure $\boldsymbol{w}$ assigned to individual $i$ .
2.

No multiple versions of treatment: Treatment is well-defined in the sense that there are not two distinct treatments that lead to the same value of $\boldsymbol{W}_{i}$ .

Importantly, this assumption ensures that an individual’s observed outcome corresponds to their potential outcome under their observed exposure. Formally, if an individual $i$ is observed to have exposure $\boldsymbol{W}_{i}$ , their observed outcome is $Y_{i}=Y_{i}(\boldsymbol{W}_{i})$ .

2.2 Estimands using Exponential Tilts

In this work, we extend the framework of exponential tilting incremental causal effects to the multivariate treatment setting. This type of stochastic treatment was first proposed for binary treatments (Kennedy, 2019) and later generalized to single continuous treatments (Díaz and Hejazi, 2020; Schindl et al., 2024). We adapt this formulation to define interventions on the entire $q$ -dimensional exposure vector, $\boldsymbol{W}$ . Given the conditional density of the exposure mixture, $f(\boldsymbol{w}\mid\boldsymbol{x})$ , we define an exponentially tilted interventional density, $g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x})$ , indexed by a user-specified vector $\boldsymbol{\delta}\in\mathbb{R}^{q}$ :

g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x})=\frac{\exp(\boldsymbol{\delta}^{\top}\boldsymbol{w})f(\boldsymbol{w}\mid\boldsymbol{x})}{\int_{\mathcal{W}}\exp(\boldsymbol{\delta}^{\top}\boldsymbol{v})f(\boldsymbol{v}\mid\boldsymbol{x})\,d\boldsymbol{v}}

(1)

Here, the denominator is a normalizing constant that ensures $g_{\boldsymbol{\delta}}$ integrates to one, and $\boldsymbol{\delta}$ is a vector determining both the direction and magnitude of how the natural density is shifted.

Our causal estimand of interest, the incremental effect, is the expected potential outcome under the stochastic intervention defined by the tilted density $g_{\boldsymbol{\delta}}$ . We denote this estimand as $\psi(\boldsymbol{\delta})$ :

\psi(\boldsymbol{\delta})=\mathbb{E}[Y^{g_{\boldsymbol{\delta}}}]

This represents the population average outcome if, for covariates $\boldsymbol{X}=\boldsymbol{x}$ , each individual’s exposure were a random draw from the shifted distribution $g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x})$ . In order to identify this quantity from the observed data, we must make a standard no unmeasured confounding assumption: for all $\boldsymbol{w}\in\mathcal{W}$ ,

Y(\boldsymbol{w})\perp\boldsymbol{W}\mid\boldsymbol{X}.

We develop sensitivity analysis approaches to assess violations of this assumption in Section 5. Under SUTVA and the no unmeasured confounding assumption, this causal quantity is identified from the observed data distribution via:

\psi(\boldsymbol{\delta})=\int_{\mathcal{X}}\int_{\mathcal{W}}\mathbb{E}[Y\mid\boldsymbol{W}=\boldsymbol{w},\boldsymbol{X}=\boldsymbol{x}]\,g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w}\,dP(\boldsymbol{x})

(2)

where $P(\boldsymbol{x})$ denotes the marginal probability measure of the covariates $\boldsymbol{X}$ . Note that because the support of $g_{\boldsymbol{\delta}}$ is identical to the support of $f$ , we do not have to invoke any positivity assumptions and the intervention does not require us to estimate outcomes for exposure combinations that are never observed in the data.

The parameter $\boldsymbol{\delta}$ has a similar interpretation as the constant gradient of the log-likelihood ratio between the interventional and observational densities as in (Schindl et al., 2024):

\boldsymbol{\delta}=\frac{\partial}{\partial\boldsymbol{w}}\log\left(\frac{g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x})}{f(\boldsymbol{w}\mid\boldsymbol{x})}\right).

This means that each component, $\delta_{j}$ , quantifies the change in this log density ratio for an infinitesimal increase in the $j$ -th exposure component, $w_{j}$ . Intuitively, setting $\boldsymbol{\delta}=(1,0,\dots,0)^{\top}$ defines an intervention that tilts the distribution to favor higher values of the first exposure, $W_{1}$ .

2.3 Different Exposure Shifts and Efficiency

Let $\mu(\boldsymbol{x},\boldsymbol{w})=\mathbb{E}[Y\mid\boldsymbol{X}=\boldsymbol{x},\boldsymbol{W}=\boldsymbol{w}]$ . Following the same derivation as (Schindl et al., 2024) but generalized to the multivariate setting, we get the following formula for the efficient influence function.

Proposition 1.

The efficient influence function of $\psi(\boldsymbol{\delta})$ under a nonparametric model is given by $\varphi(\boldsymbol{Z};\boldsymbol{\delta})=D_{Y}+D_{g,\mu}+D_{\psi}$ for

	$\displaystyle D_{Y}=\frac{g_{\boldsymbol{\delta}}(\boldsymbol{W}\mid\boldsymbol{X})}{f(\boldsymbol{W}\mid\boldsymbol{X})}\left(Y-\mu(\boldsymbol{X},\boldsymbol{W})\right)$
	$\displaystyle D_{g,\mu}=\frac{g_{\boldsymbol{\delta}}(\boldsymbol{W}\mid\boldsymbol{X})}{f(\boldsymbol{W}\mid\boldsymbol{X})}\left(\mu(\boldsymbol{X},\boldsymbol{W})-\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]\right)$
	$\displaystyle D_{\psi}=\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]-\psi$

Note the subscript $g_{\boldsymbol{\delta}}$ denotes expectations with respect to the tilted exposure density. The asymptotic variance of any regular and asymptotically linear (RAL) estimator for $\psi(\boldsymbol{\delta})$ is equal to the variance of its EIF, $\mathrm{Var}(\varphi(\boldsymbol{Z};\boldsymbol{\delta}))$ . A critical observation from the structure of $\varphi$ is the repeated appearance of the density ratio, $g_{\boldsymbol{\delta}}(\boldsymbol{W}\mid\boldsymbol{X})/f(\boldsymbol{W}\mid\boldsymbol{X})$ . When this ratio exhibits high variability, it inflates the variance of the first two components of the EIF, leading to less precise estimates of the causal effect. Therefore, to improve statistical efficiency for a given intervention strength, we could select an intervention direction $\boldsymbol{\delta}$ that minimizes the variance of this density ratio, $\mathrm{Var}(g_{\boldsymbol{\delta}}/f)$ . The third term, on the other hand, represents the deviation between the average post-intervention effect for a specific covariate group and the overall average post-intervention effect. It quantifies how much higher or lower the expected intervention effect for an individual with covariates $\boldsymbol{X}$ is compared to the overall population average.

To gain analytical insight into this variance, we can let the exposures follow a multivariate normal distribution, $f(\boldsymbol{w}\mid\boldsymbol{x})\sim\mathcal{N}(\boldsymbol{\mu_{x}},\boldsymbol{\Sigma})$ . Under this assumption, we have two key results. First, the tilted distribution $g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x})$ is also normal, with a shifted mean:

g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x})\sim\mathcal{N}(\boldsymbol{\mu_{x}}+\boldsymbol{\Sigma}\boldsymbol{\delta},\boldsymbol{\Sigma})

This provides a clear interpretation of the intervention: it is a shift of the mean of the exposure distribution in the direction $\boldsymbol{\Sigma}\boldsymbol{\delta}$ . Second, the variance of the density ratio has a simple, closed-form expression:

\mathrm{Var}\left(\frac{g_{\boldsymbol{\delta}}(\boldsymbol{W}\mid\boldsymbol{X})}{f(\boldsymbol{W}\mid\boldsymbol{X})}\right)=\exp(\boldsymbol{\delta}^{\top}\boldsymbol{\Sigma}\boldsymbol{\delta})-1

Minimizing this variance is equivalent to minimizing the quadratic form $\boldsymbol{\delta}^{\top}\boldsymbol{\Sigma}\boldsymbol{\delta}$ .

The question of interest thus becomes an optimization problem: which direction $\boldsymbol{\delta}$ minimizes $\boldsymbol{\delta}^{\top}\boldsymbol{\Sigma}\boldsymbol{\delta}$ subject to a constraint on the “strength” of the intervention? While various constraints are possible (e.g., fixing $\boldsymbol{\delta}^{\top}\boldsymbol{\delta}$ ), a particularly meaningful constraint is to fix the distance between the original and tilted distributions, for which the 2-Wasserstein distance provides a natural metric. We discuss the choice of this metric in the following section. For the multivariate normal case with the same covariance matrix, the squared Wasserstein distance has a simple closed-form expression as $d_{W}^{2}(f,g_{\boldsymbol{\delta}})=||(\boldsymbol{\mu_{x}}+\boldsymbol{\Sigma}\boldsymbol{\delta})-\boldsymbol{\mu_{x}}||^{2}=||\boldsymbol{\Sigma}\boldsymbol{\delta}||^{2}=\boldsymbol{\delta}^{\top}\boldsymbol{\Sigma}^{2}\boldsymbol{\delta}$ .

If our goal then is to find a shift that we can estimate with a high degree of efficiency, we can solve the following:

\min_{\boldsymbol{\delta}}\quad\boldsymbol{\delta}^{\top}\boldsymbol{\Sigma}\boldsymbol{\delta}\quad\text{subject to}\quad\boldsymbol{\delta}^{\top}\boldsymbol{\Sigma}^{2}\boldsymbol{\delta}=c^{2}

for some constant $c$ defining the size of intervention. It is straightforward to show that this expression is minimized when $\boldsymbol{\delta}$ is chosen to be proportional to the eigenvector of $\boldsymbol{\Sigma}$ corresponding to its largest eigenvalue, $\lambda_{\max}$ . Hence our finding demonstrates that for a fixed intervention strength (as measured by the Wasserstein distance), the most statistically efficient causal effect to estimate, at least in terms of this density ratio, is the one that corresponds to shifting the exposure mixture along its primary axis of variation. Note that this result only holds exactly under a multivariate normal distribution, but we have seen empirically that this choice of $\boldsymbol{\delta}$ typically leads to efficient estimates of $\psi(\boldsymbol{\delta})$ under a variety of exposure distributions.

2.4 Fair Exposure Shifts under Fixed Gelbrich Distance

A primary motivation for developing our framework is to answer policy-relevant questions, such as identifying the optimal way to modify an environmental exposure mixture to achieve the greatest public health benefit. This naturally leads to an optimization problem: finding the intervention vector $\boldsymbol{\delta}$ that maximizes (or minimizes) the causal estimand $\psi(\boldsymbol{\delta})$ . However, a naive comparison across all possible $\boldsymbol{\delta}$ vectors is misleading. For any intervention direction that yields a beneficial effect, one could simply increase the intervention’s strength, for example, by scaling the magnitude of $\boldsymbol{\delta}$ to produce an arbitrarily larger (or smaller) value of $\psi(\boldsymbol{\delta})$ . This would invariably lead to trivial solutions that correspond to extreme, unrealistic shifts in the exposure distribution. A more relevant policy question is “for a fixed amount of interventional effort, what is the best direction to apply that effort?”. To formalize this, we must first establish a fair basis for comparison, constraining our search to a set of interventions that are of the same size, which captures the actual movement of the exposure values, not just the magnitude of the parameter $\boldsymbol{\delta}$ . Note that the notion of fairness here is with respect to the size of the intervention being applied, which differs from typical notions of fairness in causal inference or related fields, such as in recent work that defines a fairness criterion based on whether an estimand preserves the ordinal ranking of effects across all covariate subgroups.(McClean et al., 2024)

In order to establish our notion of fairness between interventions, we can use the 2-Wasserstein distance, $d_{W}(f,g_{\boldsymbol{\delta}})$ , which is widely studied and serves this purpose (Panaretos and Zemel, 2019). Intuitively, the Wasserstein distance measures the minimum cost of transporting the probability mass of one distribution to match another, akin to the cost of moving a pile of dirt. By fixing the Wasserstein distance, we ensure that the total amount of shift between the old and new distributions is fixed. The optimization problem thus becomes a meaningful search for the $\boldsymbol{\delta}$ that minimizes $\psi(\boldsymbol{\delta})$ among all possible intervention directions of a comparable magnitude. This distributional fairness notion is widely used in both the operations and statistics research literatures. (Mohajerin Esfahani and Kuhn, 2018; Blanchet and Murthy, 2019; Duchi and Namkoong, 2021; Gao and Kleywegt, 2023)

For analytical tractability, we rely on a well-established result providing a formula for the squared 2-Wasserstein distance $d_{W}^{2}$ based on the means ( $\boldsymbol{\mu}_{1},\boldsymbol{\mu}_{2}$ ) and covariance matrices ( $\boldsymbol{\Sigma}_{1},\boldsymbol{\Sigma}_{2}$ ) of two distributions $(P_{1},P_{2})$ , which is referred to as the Gelbrich formula (Gelbrich, 1990). Crucially, this formula serves as a general lower bound for the true squared 2-Wasserstein distance between any two probability measures on $\mathbb{R}^{q}$ with finite second moments. Specifically, we have that

d_{W}^{2}\left(P_{1},P_{2}\right)\geq\left\|\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right\|^{2}+\operatorname{tr}\left(\boldsymbol{\Sigma}_{1}+\boldsymbol{\Sigma}_{2}-2\left(\boldsymbol{\Sigma}_{1}^{1/2}\boldsymbol{\Sigma}_{2}\boldsymbol{\Sigma}_{1}^{1/2}\right)^{1/2}\right):=d_{G}^{2}\left(P_{1},P_{2}\right)

Furthermore, this bound becomes an exact equality for any two distributions belonging to the same family of elliptically contoured distributions, a class which notably includes all multivariate normal distributions as well as uniform distributions on ellipsoids. The tractability of the Gelbrich formula has invited various researchers to use it as a surrogate for the 2-Wasserstein distance, a method frequently employed to overcome the computational complexity associated with the Wasserstein metric (Kuhn et al., 2019; Hakobyan and Yang, 2024), since the empirical 2-Wasserstein distance lacks a closed-form formula from data and can only be approached by numerical methods (Panaretos and Zemel, 2020). Moreover, the lower bound it provides has been shown to be tight in a fairly general situation against an upper bound derived for the 2-Wasserstein distance (Biswas and Mackey, 2024; Papp and Sherlock, 2024), and is therefore generally close to the 2-Wasserstein distance (Nguyen et al., 2021; Ye et al., 2024). Therefore, we believe our use of the Gelbrich formula as an approximation formula for the true 2-Wasserstein distance is reasonable.

This provides a computationally tractable and well-justified measure of intervention size. Accordingly, we measure intervention size through the Gelbrich formula applied to the marginal baseline and tilted laws of $\boldsymbol{W}$ , and write the resulting quantity as $G(\boldsymbol{\delta})$ . Comparing interventions on the level set $G(\boldsymbol{\delta})=c^{2}$ puts them on a common scale. Under a multivariate normal model this coincides with the squared 2-Wasserstein distance, while in more general settings it remains a rigorous lower bound. The optimization problem thus becomes a search for the $\boldsymbol{\delta}$ that minimizes $\psi(\boldsymbol{\delta})$ among all possible intervention directions of a comparable size, allowing us to disentangle the direction of an intervention from its size and enabling a principled exploration of which changes to an environmental exposure mixture are most beneficial or harmful.

2.5 Choice of Estimand

Having established a principled method for comparing interventions of equivalent magnitude, we are now positioned to define our estimands of interest. The framework of incremental effects, when extended to a multivariate setting, moves beyond simply estimating the effect of a single, pre-specified shift or simply examining how the causal effect depends on shift size. Instead, it allows us to explore the entire space of potential interventions to identify those that are most impactful.

2.5.1 Optimal Shifts

In environmental contexts, a key objective is to determine the most effective strategy for intervention given limited resources. For instance, how should regulators modify a complex mixture of air pollutants to achieve the greatest improvement in health outcomes? Our framework directly addresses this question by defining an estimand for the optimal policy shift. We define our primary estimand of interest, $\boldsymbol{\delta}^{*}_{c}$ , as the intervention direction that minimizes the causal effect $\psi(\boldsymbol{\delta})$ over the set of all “fair shifts” of a given size $c$ :

\boldsymbol{\delta}^{*}_{c}=\arg\min_{\boldsymbol{\delta}\in\mathcal{A}_{c}}\psi(\boldsymbol{\delta})

where

\mathcal{A}_{c}=\{\boldsymbol{\delta}:G(\boldsymbol{\delta})=c^{2}\}.

The corresponding value of this optimal policy is $\psi^{*}_{c}=\psi(\boldsymbol{\delta}^{*}_{c})$ . To provide some intuition for the optimal $\boldsymbol{\delta}$ , we can explore it analytically in a simplified setting. If we assume the outcome model is linear ( $\mu(\boldsymbol{x},\boldsymbol{w})=\boldsymbol{\alpha}^{\top}\boldsymbol{x}+\boldsymbol{\beta}^{\top}\boldsymbol{w}$ ) and the exposure distribution is normal ( $f\sim\mathcal{N}(\boldsymbol{\mu_{x}},\boldsymbol{\Sigma})$ ), the causal effect is minimized when $\boldsymbol{\delta}$ is proportional to $-\boldsymbol{\Sigma}^{-1}\boldsymbol{\beta}$ . This result provides some intuition: the optimal direction is a balance between the direct effect of each pollutant on the outcome (the vector $\boldsymbol{\beta}$ ) and the correlation structure of the mixture (represented by $\boldsymbol{\Sigma}^{-1}$ ). This highlights a key insight of our multivariate approach: the covariance of the exposures is critical for determining not only which shifts are easiest to estimate, but also which shifts are most impactful. While the optimal policy shift is one primary goal, our framework is flexible and allows for the definition of other estimands that can be useful in answering other scientific questions of interest in environmental epidemiology. We now detail these in the following sections.

2.5.2 Single Exposure Shifts

A common goal in the analysis of environmental mixtures is to identify the components of the mixture that are most harmful. This has led to a wide range of statistical approaches aimed at performing exposure selection (Bobb et al., 2015; Antonelli et al., 2020; Ferrari and Dunson, 2020; Wei et al., 2020; Samanta and Antonelli, 2022). These approaches have inherently disregarded the potential impacts of positivity violations when examining which exposures are most harmful, but we can adapt our approach here to target similar questions without the need for strong positivity assumptions. A natural choice is to let $\boldsymbol{\delta}$ be a vector of zeros with a single non-zero component, e.g., $\boldsymbol{\delta}_{j}=(0,\dots,t_{j},\dots,0)$ . The magnitude $t_{j}$ would be chosen to satisfy the same fairness constraint, $G(\boldsymbol{\delta}_{j})=c^{2}$ . While this estimand would encourage the $j^{th}$ component of the exposures to increase more than others, in the presence of correlated exposures, it will also shift the remaining exposures, and the extent to which this occurs is less clear. A distinct approach is to define a value of $\boldsymbol{\delta}$ such that the means of the exposures in the tilted distribution are the same, except for the $j^{th}$ exposure, thereby isolating the impact of that single exposure. If the exposures follow a multivariate normal distribution, this is straightforward since the mean shift is given by $\boldsymbol{\Sigma}\boldsymbol{\delta}$ . If we want to ensure that only the $j^{th}$ exposure’s mean is shifted, then we could set $\boldsymbol{\delta}_{j}=\boldsymbol{\Sigma}^{-1}\boldsymbol{e}_{j}$ where $\boldsymbol{e}_{j}=(0,\dots,t_{j},\dots,0)$ and again $t_{j}$ is chosen to ensure a fair shift. Note that this analytical form only holds when the exposure follows a multivariate normal distribution, which won’t be true in general. We can instead use this value of $\boldsymbol{\delta}_{j}$ as a starting point in a numerical algorithm that searches for the exponential tilt that only shifts the $j^{th}$ component.

Once these are obtained, we can calculate $\psi(\boldsymbol{\delta}_{j})$ for all $j=1,\dots,q$ to infer which exposures have the biggest effect on the outcome. Further note that this approach can be naturally extended to explore interactions or combined effects. For example, by setting two components of $\boldsymbol{\delta}$ to be non-zero, one could investigate whether simultaneously increasing two pollutants has a synergistic effect greater than the sum of their individual impacts. We point readers to recent work examining stochastic interventions to identify interactions, as similar ideas could be applied within our framework, though we focus here on single exposure effects for now (McCoy et al., 2023).

2.5.3 Efficient Exposure Shifts

While the previous two estimands are arguably the most scientifically and policy-relevant in most applications, there are other considerations at play, such as how efficiently we can estimate the chosen estimand. Environmental applications in particular are known to have relatively small effect sizes, and therefore efficiency can be particularly important when sample sizes are not exceedingly large. As we established in Section 2.3, the statistical difficulty of estimating $\psi(\boldsymbol{\delta})$ is heavily driven by the variance of the density ratio, $\mathrm{Var}(g_{\boldsymbol{\delta}}/f)$ . For a fixed intervention size, as defined by the Wasserstein distance, we showed that the most efficient intervention direction, in terms of minimizing this variance, is proportional to the first eigenvector of the exposure covariance matrix, $\boldsymbol{\Sigma}$ . We call this direction $\boldsymbol{\delta}_{\text{eff}}$ , and it is clear that this direction depends only on the correlation structure of the observed exposures. While this direction is only the most efficient one under normality of the exposures, we proceed with this choice in general, as we have found that it leads to efficient estimates in more general settings, and deriving the most efficient direction in general is a difficult task.

However, the optimal policy direction, $\boldsymbol{\delta}^{*}_{c}$ , which maximizes the causal effect $\psi(\boldsymbol{\delta})$ , depends on both the exposure distribution and the exposure-outcome relationship, $\mu(\boldsymbol{x},\boldsymbol{w})$ . In simplified linear models, the direction of steepest ascent for the causal effect is proportional to $\boldsymbol{\Sigma}^{-1}\boldsymbol{\beta}$ . In general, there is no reason for the direction of maximal statistical efficiency (related to the eigenvectors of $\boldsymbol{\Sigma}$ ) to be the same as the direction of the maximal causal effect (related to $\boldsymbol{\Sigma}^{-1}\boldsymbol{\beta}$ ). To see this, consider an intuitive example with two highly and positively correlated pollutants, where the first principal component is approximately in the $(1,1)$ direction, which corresponds to the direction that is “easiest” to estimate with the observed data. Now, suppose that only the second pollutant has a strong causal effect on the outcome (i.e., $\boldsymbol{\beta}\approx(0,\beta_{2})$ ). The optimal policy would primarily involve shifting the second pollutant. Our framework reveals an inherent tension: the policy we most want to evaluate (shifting the second pollutant alone) is statistically difficult because it moves in a direction against the data’s strong correlation structure, leading to a high-variance density ratio and thus high uncertainty in our estimate of its effect. In general, this presents a trade-off between interpretability and statistical efficiency, and users can decide based on features of their observed data which estimand to target.

3 Estimation and Inference

3.1 Estimating $\psi(\boldsymbol{\delta})$

For a fixed intervention vector $\boldsymbol{\delta}$ , the target estimand is the expected outcome under the tilted exposure distribution:

\psi(\boldsymbol{\delta})=\mathbb{E}\left[\int_{\mathcal{W}}\mu(\boldsymbol{X},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right]

where the outer expectation is over the marginal distribution of covariates $\boldsymbol{X}$ . Therefore, a direct approach to estimating $\psi(\boldsymbol{\delta})$ is through a plug-in procedure. This method involves replacing each component in the expression above with a corresponding empirical estimate, which leads to the plug-in estimator:

\hat{\psi}_{\text{plugin}}(\boldsymbol{\delta})=\frac{1}{n}\sum_{i=1}^{n}\left[\int_{\mathcal{W}}\hat{\mu}(\boldsymbol{X}_{i},\boldsymbol{w})\hat{g}_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X}_{i})\,d\boldsymbol{w}\right]

However, this estimator faces certain practical and theoretical challenges. One issue is that the estimator’s performance is highly dependent on an accurate estimate of the multivariate conditional density $\hat{f}(\boldsymbol{w}|\boldsymbol{x})$ , which can be difficult to estimate for multivariate exposures, and will likely have slower convergence rates. Furthermore, this approach does not leverage the structure of the efficient influence function and, as a result, is generally not statistically efficient. These limitations motivate alternative estimators with superior statistical properties.

3.1.1 One-step Estimation and Cross-fitting

To overcome the limitations of the plug-in estimator, we employ an approach rooted in semiparametric efficiency theory. This method uses the efficient influence function (EIF) to construct an estimator that is consistent under weaker conditions and achieves the optimal asymptotic variance. The EIF for the estimand $\psi(\boldsymbol{\delta})$ is given in Section 2 by $\varphi(\boldsymbol{Z};\boldsymbol{\delta})=D_{Y}+D_{g,\mu}+D_{\psi}$ . In our notation, substituting the EIF components and algebraically combining the first two terms yields

\varphi(\boldsymbol{Z};\psi,\mu,f)=r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\Big\{Y-\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]\Big\}+\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]-\psi,

where $r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})=g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x})/f(\boldsymbol{w}\mid\boldsymbol{x})$ and expectations with subscript $g_{\boldsymbol{\delta}}$ are taken with respect to $g_{\boldsymbol{\delta}}(\cdot\mid\boldsymbol{X})$ . The one-step estimation procedure utilizes this EIF to correct an initial parameter estimate by adding the empirical average of the EIF to the initial plug-in estimate, which serves as a bias-correction term:

\hat{\psi}_{\text{onestep}}(\boldsymbol{\delta})=\tilde{\psi}_{\text{plugin}}(\boldsymbol{\delta})+\frac{1}{n}\sum_{i=1}^{n}\varphi\!\big(\boldsymbol{Z}_{i};\tilde{\psi}_{\text{plugin}},\hat{\mu},\hat{f}\big).

This can be re-written to show that the one-step estimator takes the following form:

\displaystyle\hat{\psi}_{\text{onestep}}(\boldsymbol{\delta})=\frac{1}{n}\sum_{i=1}^{n}\hat{r}(\boldsymbol{W}_{i},\boldsymbol{X}_{i})\big[Y_{i}-\mathbb{E}_{\hat{g}_{\boldsymbol{\delta}}}[\hat{\mu}\mid\boldsymbol{X}_{i}]\big]+\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{\hat{g}_{\boldsymbol{\delta}}}[\hat{\mu}\mid\boldsymbol{X}_{i}].

This estimator has a number of key features that we will describe in subsequent sections when we study the asymptotic properties of this estimator. To summarize, it allows the use of flexible machine learning methods for estimation of each of the nuisance functions, and it is asymptotically efficient given its construction based on the EIF. The practical performance of the estimator is highly dependent on an estimate of the conditional density of the exposures, given by $f$ , which can be challenging with a moderate number of exposures. This density shows up in the expectations with respect to $g_{\boldsymbol{\delta}}$ , but also in the ratio term, denoted by $r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})$ . First, we detail how to estimate this quantity directly using estimates of $f$ and $\mu$ , though in Section 3.1.2 we propose an approach to directly estimating this quantity using regression techniques that may not have the same theoretical properties, but can have good finite-sample performance when there are a moderate number of exposures.

To ensure the desirable asymptotic properties of our final estimator for $\psi(\boldsymbol{\delta})$ , the entire estimation procedure is embedded within a cross-fitting framework. Cross-fitting is a technique designed to eliminate a key source of bias that arises when the same data is used both to train nuisance parameters and to evaluate the final parameter of interest. This prevents overfitting inherent in data-adaptive methods (Zheng and van der Laan, 2011; Chernozhukov et al., 2018) and has been shown to improve performance for related EIF based estimators. The procedure is implemented as follows. The data is randomly partitioned into $K$ disjoint folds of approximately equal size. For each fold $k\in\{1,\dots,K\}$ , we treat it as the evaluation set and use the remaining $K-1$ folds as the training set. On the training set, we fit our nuisance models, including the outcome regression $\hat{\mu}_{-k}(\boldsymbol{x},\boldsymbol{w})$ and the exposure density $\hat{f}_{-k}(\boldsymbol{w},\boldsymbol{x})$ . These models, trained on data not in fold $k$ , are then used to compute the components of the efficient influence function for every observation $i$ only within the evaluation fold $k$ . The final estimate, $\hat{\psi}(\boldsymbol{\delta})$ , is then constructed by solving the estimating equation aggregated across all $K$ folds, ensuring that the nuisance function estimates for any given observation are always independent of that observation itself. This approach to nuisance function estimation has theoretical advantages by allowing for less restrictive assumptions on the nuisance functions, and important finite-sample properties as it tends to reduce bias in the estimated causal effects that is induced from overfitting.

3.1.2 Direct estimation using regression approaches

When there are a moderate number of exposures, estimation of $\psi(\boldsymbol{\delta})$ becomes increasingly challenging, even for the efficient one-step estimators described above due to the inherent difficulty of estimating a multivariate, conditional distribution $f(\boldsymbol{w}\mid\boldsymbol{x})$ . For this reason we also explore an approach, first described in Schindl et al. (2024), that does not require estimation of the exposure density at all. This can be advantageous in finite samples, particularly when estimation of the density ratio $r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})$ is unstable. The first key insight is that the density ratio can be written as

r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})=\frac{\text{exp}(\boldsymbol{\delta}^{\top}\boldsymbol{w})}{{\int_{\mathcal{W}}\exp(\boldsymbol{\delta}^{\top}\boldsymbol{v})f(\boldsymbol{v}\mid\boldsymbol{x})\,d\boldsymbol{v}}}=\frac{\text{exp}(\boldsymbol{\delta}^{\top}\boldsymbol{w})}{\mathbb{E}[\exp(\boldsymbol{\delta}^{\top}\boldsymbol{W})\mid\boldsymbol{X}]}=\frac{\text{exp}(\boldsymbol{\delta}^{\top}\boldsymbol{w})}{\nu_{\boldsymbol{\delta}}(\boldsymbol{X})},

which shows the density ratio can be estimated by estimating the conditional expectation $\nu_{\boldsymbol{\delta}}(\boldsymbol{X})$ . This does not require the conditional density of the exposures and can be carried out using flexible, univariate regression techniques. Further, the other key component of our one-step estimator can be written as

\int_{\mathcal{W}}\mu(\boldsymbol{X},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}=\frac{\int_{\mathcal{W}}\mu(\boldsymbol{X},\boldsymbol{w})\exp(\boldsymbol{\delta}^{\top}\boldsymbol{w})f(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w}}{\int_{\mathcal{W}}\exp(\boldsymbol{\delta}^{\top}\boldsymbol{v})f(\boldsymbol{v}\mid\boldsymbol{x})\,d\boldsymbol{v}}=\frac{\mathbb{E}[\exp(\boldsymbol{\delta}^{\top}\boldsymbol{W})\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]}{\nu_{\boldsymbol{\delta}}(\boldsymbol{X})}=\frac{\eta_{\boldsymbol{\delta}}(\boldsymbol{X})}{\nu_{\boldsymbol{\delta}}(\boldsymbol{X})}.

This shows that this quantity can be estimated by taking the ratio of two quantities, each of which can be estimated using flexible, univariate regression techniques. This was introduced in Schindl et al. (2024), though they did not implement this estimator as they found it to be unstable by taking the ratio of the estimates for these two conditional expectations. They were working, however, in the univariate exposure setting where estimating the conditional density of the exposures is more straightforward. In our setting, with multiple exposures, conditional density estimation can be very difficult, whereas this approach relies on univariate prediction models only. Additionally note, that one can take a third strategy, which is to still estimate the conditional density of the exposures and use it whenever calculating $\int_{\mathcal{W}}\mu(\boldsymbol{X},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}$ , but then use the regression approach described above for estimating the density ratio to improve stability of our estimates. While potentially useful for estimation, these estimators that obviate the need to estimate the exposure density will not inherit the same theoretical properties as the one-step estimator that directly uses estimates of $f$ and $\mu$ , which we study theoretically in Section 4. They may, however, produce better finite sample performance, which we study in the simulation studies in Section 6 across a wide range of exposure distributions.

3.2 Optimizing over a Manifold

Our primary estimand, the optimal policy shift $\boldsymbol{\delta}^{*}_{c}$ , is defined as the solution to

\min_{\boldsymbol{\delta}\in\mathcal{M}}\psi(\boldsymbol{\delta}),

where $\mathcal{M}:=\{\boldsymbol{\delta}\in\mathbb{R}^{d}:G(\boldsymbol{\delta})=c^{2}\}$ and $G$ is the Gelbrich quantity defined in Section 2.4. We assume $c^{2}$ is a regular value of $G$ (equivalently, $\nabla G(\boldsymbol{\delta})\neq\boldsymbol{0}$ for all $\boldsymbol{\delta}\in\mathcal{M}$ ), so that $\mathcal{M}$ is a smooth embedded hypersurface. This regularity holds generically and is verified for the Gelbrich constraint in Appendix B. Standard Euclidean optimization algorithms such as gradient descent are not directly applicable as a gradient step taken from a point on the manifold will likely lead to a point outside of the feasible set.

To properly solve this problem, we must recognize that the constraint set forms a smooth, curved space known as a Riemannian manifold. Optimization on manifolds requires specialized techniques that generalize concepts from Euclidean optimization. The core idea is to perform optimization steps within the tangent space at each point on the manifold, which is a local linear approximation of the manifold, and then map the result back onto the manifold itself. For our specific constrained problem, we employ a Riemannian Broyden Fletcher Goldfarb Shanno (BFGS) algorithm. This is a powerful quasi-Newton method adapted for optimization on manifolds, which generally offers faster convergence than simpler first-order methods. In practice, the method is implemented with two computationally light ingredients: a projection-based retraction back to the level set, and an inexpensive projection-based vector transport between tangent spaces. The key components of this algorithm for our problem are:

Preliminary definitions: A definition of the tangent space is to take any small smooth curve on $\mathcal{M}$

\gamma:(-\varepsilon,\varepsilon)\rightarrow\mathcal{M},\quad\gamma(0)=\boldsymbol{\delta}.

Then

\gamma^{\prime}(0)\in\mathbb{R}^{d},

is a valid tangent vector. The tangent space at a point $\boldsymbol{\delta}$ is the collection of all such vectors:

T_{\delta}\mathcal{M}=\left\{\gamma^{\prime}(0):\gamma(0)=\delta,\gamma(t)\in\mathcal{M}\text{ for all }t\right\}

For a level set $\mathcal{M}=\{\boldsymbol{\delta}:G(\boldsymbol{\delta})=c^{2}\}$ , we also have the equivalent characterization $T_{\boldsymbol{\delta}}\mathcal{M}=\text{Null}(\nabla G(\boldsymbol{\delta})^{\top})$ . The BFGS (Broyden–Fletcher–Goldfarb–Shanno) algorithm is a highly effective quasi-Newton method, which finds an optimum by maintaining and iteratively updating an approximation $\mathbf{B}$ of the inverse Hessian matrix. In Euclidean space $\mathbb{R}^{n}$ , the standard update formula is:

\mathbf{B}_{k+1}=\mathbf{B}_{k}+\frac{\mathbf{b}_{k}\mathbf{b}_{k}^{\top}}{\mathbf{b}_{k}^{\top}\mathbf{a}_{k}}-\frac{(\mathbf{B}_{k}\mathbf{a}_{k})(\mathbf{B}_{k}\mathbf{a}_{k})^{\top}}{\mathbf{a}_{k}^{\top}\mathbf{B}_{k}\mathbf{a}_{k}}

Where:

•

$\mathbf{B}_{k+1}$ is the updated approximation of the inverse Hessian, which is the target of the update.
•

$\mathbf{B}_{k}$ is the current approximation of the inverse Hessian.
•

$\mathbf{a}_{k}$ is the change in position (step), defined as $\mathbf{a}_{k}=\boldsymbol{\delta}_{k+1}-\boldsymbol{\delta}_{k}$ .
•

$\mathbf{b}_{k}$ is the change in gradient, defined as $\mathbf{b}_{k}=\nabla\psi(\boldsymbol{\delta}_{k+1})-\nabla\psi(\boldsymbol{\delta}_{k})$

2.

Riemannian Gradient: The standard (Euclidean) gradient, $\nabla\psi(\boldsymbol{\delta})$ , must be projected onto the tangent space at the current point $\boldsymbol{\delta}$ to find the direction of steepest ascent along the manifold. Let $\nabla G(\boldsymbol{\delta})$ denote the Euclidean normal to the level set at $\boldsymbol{\delta}$ , and let $\boldsymbol{n}(\boldsymbol{\delta})=\nabla G(\boldsymbol{\delta})/\|\nabla G(\boldsymbol{\delta})\|$ be the corresponding unit normal vector. The orthogonal projection matrix onto the tangent space is defined as $P_{\boldsymbol{\delta}}:=I-\boldsymbol{n}(\boldsymbol{\delta})\boldsymbol{n}(\boldsymbol{\delta})^{\top}$ . Consequently, the Riemannian gradient under the induced Euclidean metric is given by

$\operatorname{grad}\psi(\boldsymbol{\delta})=P_{\boldsymbol{\delta}}\nabla\psi(\boldsymbol{\delta}).$
3.

Retraction: Even though we are moving based on a Riemannian gradient step that is in the current point’s tangent space (which is essentially a linear approximation of the manifold at the current point), we are still technically moving out of the manifold. After finding a search direction $\boldsymbol{v}$ , we need a retraction to map the point from the tangent space back onto the manifold. For the level-set manifold $\mathcal{M}$ , we take the projection retraction $R_{\boldsymbol{\delta}}(\boldsymbol{\xi})=\Pi(\boldsymbol{\delta}+\boldsymbol{\xi})$ , where $\Pi$ is the orthogonal projection onto $\mathcal{M}$ , which is a canonical construction for embedded manifolds. In our code, $\Pi$ is computed approximately by a single normal-direction correction, which can be interpreted as a projection-like update that preserves the local first-order accuracy required of a retraction while substantially reducing computational cost.
4.

Vector Transport: Vector transport is required to compare tangent vectors that live in different tangent spaces across iterates. For our embedded level-set manifold, we use the simple projection transport:

$\widetilde{\mathcal{T}}_{\boldsymbol{\delta}\to\boldsymbol{\delta}_{+}}(\boldsymbol{\zeta}):=P_{\boldsymbol{\delta}_{+}}\boldsymbol{\zeta},$

namely, orthogonally projecting an ambient vector onto the new tangent space. Such a transport is generally not an isometry, even though it is often computationally attractive. This transport is also used to move tangent-space quantities between iterates when forming the RBFGS update.

To summarize, the iterative process of the Riemannian BFGS algorithm is outlined as follows.

Algorithm 1 Riemannian BFGS for Optimal Policy Shift

Initialize:

k=0

, choose

\boldsymbol{\delta}_{0}

on the manifold (s.t.

G(\boldsymbol{\delta}_{0})=c^{2}

Compute Euclidean gradient

\nabla\psi(\boldsymbol{\delta}_{0})

and Riemannian gradient

\operatorname{grad}\psi(\boldsymbol{\delta}_{0})=P_{\boldsymbol{\delta}_{0}}\nabla\psi(\boldsymbol{\delta}_{0})

Initialize inverse Hessian approximation

\boldsymbol{B}_{0}=\boldsymbol{I}

while not converged (e.g.,

\|\operatorname{grad}\psi(\boldsymbol{\delta}_{k})\|>\epsilon

) do

Compute search direction:

\boldsymbol{p}_{k}=-\boldsymbol{B}_{k}\operatorname{grad}\psi(\boldsymbol{\delta}_{k})

Perform a weak Wolfe line search along the retraction curve.

Update point via retraction:

\boldsymbol{\delta}_{k+1}=R_{\boldsymbol{\delta}_{k}}(\alpha_{k}\boldsymbol{p}_{k})

Compute new Riemannian gradient

\operatorname{grad}\psi(\boldsymbol{\delta}_{k+1})

Update

\boldsymbol{B}_{k+1}

: apply the RBFGS update and transport

\boldsymbol{B}_{k}

to the new tangent space as needed.

k\leftarrow k+1

end while

return

\boldsymbol{\delta}_{k}

When the objective is to minimize $\psi(\boldsymbol{\delta})$ over the Gelbrich level set, convergence guarantees for RBFGS depend on the line-search conditions, the regularity of the objective along retraction curves, and the choice of vector transport. Under the assumptions stated in Appendix B, the global convergence result in Theorem 4.2 of Huang et al. (2018) yields $\liminf_{k\to\infty}\|\operatorname{grad}\psi(\boldsymbol{\delta}_{k})\|=0$ for the RBFGS scheme equipped with a suitable isometric vector transport and weak Wolfe line search. Appendix B verifies the geometric regularity of the Gelbrich level set, establishes objective regularity along the retraction curve, and invokes the existence of a weak Wolfe step. The proof therefore uses a fully rigorous projection retraction and the scaled isometric vector transport $\mathcal{T}^{S}$ (Huang et al., 2015) together with an explicit boundary condition on $G$ that yields compactness of the relevant level set. Generally, however, these stricter choices are made for the convergence proof. In the actual implementation we use a computationally cheaper projection-type retraction and the orthogonal projection transport $\widetilde{\mathcal{T}}$ , together with safeguard checks on the RBFGS update. This separation between the proof device and the working algorithm is standard in large-scale Riemannian optimization (Ring and Wirth, 2012; Huang et al., 2018). In our numerical work, the simplified implementation converged stably to a local solution while substantially reducing computation time.

In embedded-manifold settings, the projection transport $\boldsymbol{\zeta}\mapsto P_{\boldsymbol{\delta}_{+}}\boldsymbol{\zeta}$ can significantly reduce computational time without degrading practical convergence behavior. This phenomenon is well documented in the RBFGS experiments of Qi et al. (2010), where projection-based transport on the Stiefel manifold reduces wall-clock time from $304.0$ seconds to $24.0$ seconds and also decreases the iteration count from $175$ to $83$ in the Procrustes problem. Moreover, Huang et al. (2018) compare combinations that satisfy sufficient conditions used in global convergence theory with cheaper combinations that do not, and find that the numbers of function and gradient evaluations (as well as vector transports) are not significantly affected by the choice of retraction and transport; as a result, lower-complexity retractions and transports can be markedly faster in terms of computational time. In our applications, the simplified implementation described above converges reliably and is computationally fast.

4 Semiparametric Efficiency Theory

This section explores the fundamental limits of estimation for the multivariate incremental effect, $\psi(\boldsymbol{\delta})$ . The analysis characterizes how the statistical difficulty of this problem depends not only on the sample size $n$ , but on the intervention vector $\boldsymbol{\delta}$ and its interplay with the exposure covariance structure $\boldsymbol{\Sigma}$ . We generalize and further develop the theoretical analysis for univariate continuous exposures from (Schindl et al., 2024) to our multivariate setting.

4.1 Minimax Lower Bound

In this section, we establish a minimax lower bound for the incremental effect

\theta(\boldsymbol{\delta})\ :=\ \psi(\boldsymbol{\delta})-\psi(\boldsymbol{0}),

under a flexible nonparametric model. Minimax lower bounds benchmark what is statistically achievable without imposing additional structure: they show that no estimator can attain uniformly smaller risk over a given model class. Before presenting the main results, we must first define a number of important terms. First, we refer to the centered version of the exposures as

\tilde{\boldsymbol{W}}:=\boldsymbol{W}-\operatorname{E}[\boldsymbol{W}\mid\boldsymbol{X}].

To analyze the asymptotic variance of the effect difference $\theta(\boldsymbol{\delta})$ , we decompose the problem into an incremental component $\boldsymbol{h}(\boldsymbol{Z})$ and a baseline influence function $\varphi_{0}(\boldsymbol{Z})$ . We define

\boldsymbol{h}(\boldsymbol{Z})\ :=\ \tilde{\boldsymbol{W}}\Big(Y-\operatorname{E}[\mu\mid\boldsymbol{X}]\Big)\;-\;\operatorname{E}\big[\operatorname{E}[\mu\tilde{\boldsymbol{W}}\mid\boldsymbol{X}]\big],\qquad\varphi_{0}(\boldsymbol{Z}):=Y-\psi(\boldsymbol{0}).

Next, we calculate the covariance structure of $\boldsymbol{h}(\boldsymbol{Z})$ . A key part of this variance arises from the outcome regression. To ensure this term represents a proper covariance (centered moment), we compute the raw second moment and subtract the outer product of the mean:

\boldsymbol{\Sigma}_{\mu,\mathrm{full}}:=\operatorname{E}\!\Big[\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top}\big(\mu-\operatorname{E}[\mu\mid\boldsymbol{X}]\big)^{2}\Big]\;-\;\operatorname{E}\big[\operatorname{E}[\mu\tilde{\boldsymbol{W}}\mid\boldsymbol{X}]\big]\operatorname{E}\big[\operatorname{E}[\mu\tilde{\boldsymbol{W}}\mid\boldsymbol{X}]\big]^{\top}.

Then, letting $\boldsymbol{\Sigma}_{\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top},\,\varepsilon}:=\operatorname{E}\big[\operatorname{Var}(Y\mid\boldsymbol{X},\boldsymbol{W})\,\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top}\big]$ be the residual variance component, we show in the appendix that we can write the full covariance matrix of $\boldsymbol{h}(\boldsymbol{Z})$ as

\boldsymbol{H}\ :=\ \operatorname{Cov}\!\big(\boldsymbol{h}(\boldsymbol{Z})\big)\ =\ \boldsymbol{\Sigma}_{\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top},\,\varepsilon}\ +\ \boldsymbol{\Sigma}_{\mu,\mathrm{full}}.

Finally, estimating the incremental effect $\theta(\boldsymbol{\delta})$ requires distinguishing the variation in the shift direction from the variation in the baseline. Since the gradient component $\boldsymbol{h}(\boldsymbol{Z})$ and the baseline influence $\varphi_{0}(\boldsymbol{Z})$ are typically correlated, the fundamental difficulty of estimating their difference is governed by the variability of $\boldsymbol{h}$ that cannot be explained by $\varphi_{0}$ . Mathematically, this corresponds to the residual variance of $\boldsymbol{h}$ after linearly projecting it onto the space spanned by $\varphi_{0}$ , yielding the Schur–complement covariance

\boldsymbol{\Gamma}\ :=\ \boldsymbol{H}\ -\ \frac{\operatorname{Cov}(\boldsymbol{h},\varphi_{0})\,\operatorname{Cov}(\boldsymbol{h},\varphi_{0})^{\!\top}}{\operatorname{Var}(\varphi_{0})}\ \succeq\ \boldsymbol{0}.

The next result records how the efficiency bound depends on $\boldsymbol{\delta}$ .

Lemma 1 (Variance of the efficient influence function).

Under (A1)–(A5) defined in the appendix and for all $\|\boldsymbol{\delta}\|$ within a fixed small radius, there exist constants $0<c_{\mathrm{low}}\leq c_{\mathrm{up}}<\infty$ (depending only on the model constants and the chosen radius; explicit expressions are given in the Appendix) such that

c_{\mathrm{low}}\;\boldsymbol{\delta}^{\top}\boldsymbol{H}\,\boldsymbol{\delta}\ \leq\ \operatorname{Var}\!\big\{\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})\big\}\ \leq\ c_{\mathrm{up}}\;\boldsymbol{\delta}^{\top}\boldsymbol{H}\,\boldsymbol{\delta},\qquad\boldsymbol{H}=\boldsymbol{\Sigma}_{\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top},\,\varepsilon}+\boldsymbol{\Sigma}_{\mu,\mathrm{full}}.

Lemma 1 makes explicit that the nonparametric efficiency bound (the variance of the efficient influence function) grows quadratically in the tilt magnitude, with geometry determined by the matrix $\boldsymbol{H}$ ; this geometry blends the noise component $\boldsymbol{\Sigma}_{\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top},\,\varepsilon}$ and the outcome–regression component $\boldsymbol{\Sigma}_{\mu,\mathrm{full}}$ . Since $\boldsymbol{H}=\boldsymbol{\Sigma}_{\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top},\,\varepsilon}+\boldsymbol{\Sigma}_{\mu,\mathrm{full}}\succeq\boldsymbol{\Sigma}_{\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top},\,\varepsilon}$ , the lower bound immediately implies

\operatorname{Var}\{\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})\}\ \geq\ c_{\mathrm{low}}\ \boldsymbol{\delta}^{\top}\boldsymbol{\Sigma}_{\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top},\,\varepsilon}\,\boldsymbol{\delta}.

As the direction of $\boldsymbol{\delta}$ changes, the bound can increase substantially showing how the most efficient shifts are those proportional to the first eigenvector of $\boldsymbol{\Sigma}_{\tilde{\boldsymbol{W}}\tilde{\boldsymbol{W}}^{\top},\,\varepsilon}$ . Note that this matrix is not the same as the covariance matrix of the exposures, but is closely related to it, and therefore this result confirms our findings of Section 2 about which directions are most efficient to estimate. Now, we derive minimax bounds for the estimation error of $\theta(\boldsymbol{\delta})$ .

Theorem 2 (Minimax lower bound).

Assume (A1)–(A5) and let $\|\boldsymbol{\delta}\|$ lie in the finite-tilt regime specified in the Appendix. There exists a universal constant $C>0$ (independent of $n$ and $\boldsymbol{\delta}$ ) such that, for any estimator $\widehat{\theta}$ based on a sample of size $n$ ,

\inf_{\widehat{\theta}}\ \sup_{P}\ \operatorname{E}_{P}\!\big[(\widehat{\theta}-\theta_{P}(\boldsymbol{\delta}))^{2}\big]\ \geq\ C\cdot\frac{\ \boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\,\boldsymbol{\delta}\ }{n}\,.

Theorem 2 shows that the best possible root mean–squared error obeys

\operatorname{RMSE}(\widehat{\theta})\ \gtrsim\ \sqrt{\frac{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\,\boldsymbol{\delta}}{n}}\,,

revealing an effective sample size of order $n_{\mathrm{eff}}\asymp n/(\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta})$ . Hence larger tilts and directions aligned with high–variance components of $\boldsymbol{\Gamma}$ are intrinsically harder, regardless of the estimation strategy. Conversely, if $\boldsymbol{\delta}$ lies near a low–variance direction of $\boldsymbol{\Gamma}$ , faster convergence is attainable. Together with Lemma 1, the theorem clarifies why one must account for $\boldsymbol{\delta}$ when assessing estimation difficulty and designing procedures: the geometry induced by $\boldsymbol{\Gamma}$ determines how precision scales with both the size and the direction of the tilt.

4.2 Convergence and Normality

Under mild regularity conditions (boundedness, i.i.d. sampling, and a fixed finite tilt), we establish a finite- $\boldsymbol{\delta}$ central limit theorem for our cross-fitted one-step estimator. At first sight, the efficient influence function suggests that asymptotic linearity requires controlling the $\boldsymbol{\delta}$ -specific nuisance functions

m_{\boldsymbol{\delta}}(\boldsymbol{x})\;=\;\mathbb{E}_{g_{\boldsymbol{\delta}}}\!\big[\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}=\boldsymbol{x}\big],\qquad r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})\;=\;\frac{g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x})}{f(\boldsymbol{w}\mid\boldsymbol{x})},

Since both objects are indexed by the tilt parameter $\boldsymbol{\delta}$ , and $m_{\boldsymbol{\delta}}$ is a nonlinear functional of the observed-data law, it is not obvious how to guarantee the $L_{2}$ rates needed for one-step asymptotics.

Our analysis shows that, for any fixed finite tilt $\boldsymbol{\delta}$ , it is in fact sufficient to estimate only two familiar observed-data nuisance components:

\mu(\boldsymbol{x},\boldsymbol{w})\;=\;\mathbb{E}[Y\mid\boldsymbol{X}=\boldsymbol{x},\boldsymbol{W}=\boldsymbol{w}],\qquad f(\boldsymbol{w}\mid\boldsymbol{x}),

namely the outcome regression and the (generalized) propensity score. In the Appendix we show that, under bounded support for $\boldsymbol{W}$ and the finite-tilt condition $\|\boldsymbol{\delta}\|\leq\Delta$ , the maps $(\mu,f)\mapsto r_{\boldsymbol{\delta}}$ and $(\mu,f)\mapsto m_{\boldsymbol{\delta}}$ are Lipschitz in $L_{2}(P)$ : there exist fixed finite constants $C_{1}(\Delta),C_{2}(\Delta)$ such that for any estimators $(\widehat{\mu},\widehat{f})$ ,

\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}\;\leq\;C_{1}(\Delta)\,\|\widehat{f}-f\|_{2},\qquad\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}\;\leq\;C_{2}(\Delta)\big(\|\widehat{\mu}-\mu\|_{2}+\|\widehat{f}-f\|_{2}\big),

where $(\widehat{m}_{\boldsymbol{\delta}},\widehat{r}_{\boldsymbol{\delta}})$ are obtained from $(\widehat{\mu},\widehat{f})$ by the same formulas as above; see Lemma 9. Consequently, the product condition

\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}\,\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}=o_{P}(n^{-1/2}),

which ensures that the second-order remainder is negligible, is implied by the more interpretable requirement that both $\widehat{\mu}$ and $\widehat{f}$ converge at rate $n^{-1/4}$ in $L_{2}(P)$ ; see Corollary 3. These $L_{2}$ rates for $(\widehat{\mu},\widehat{f})$ , and hence for $(\widehat{m}_{\boldsymbol{\delta}},\widehat{r}_{\boldsymbol{\delta}})$ , can be guaranteed under standard smoothness and complexity conditions by the highly adaptive lasso (van der Laan, 2017), which attains near-optimal convergence rates for a broad nonparametric class.

Theorem 3 (Finite- $\boldsymbol{\delta}$ CLT).

Assume (A1)–(A3) and (C1)–(C5) in the Appendix. Fix $\Delta\in(0,\infty)$ and any tilt $\boldsymbol{\delta}$ with $\|\boldsymbol{\delta}\|\leq\Delta$ . Then

\sqrt{n}\,\{\widehat{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})\}\ \rightsquigarrow\ \mathcal{N}\!\big(0,\ \mathrm{Var}\{\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})\}\big),

where the efficient influence function is

\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})=r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\{Y-m_{\boldsymbol{\delta}}(\boldsymbol{X})\}+m_{\boldsymbol{\delta}}(\boldsymbol{X})-\psi(\boldsymbol{\delta}).

Consequently,

\sqrt{n}\,\{\widehat{\theta}(\boldsymbol{\delta})-\theta(\boldsymbol{\delta})\}\ \rightsquigarrow\ \mathcal{N}\!\big(0,\ \mathrm{Var}\{\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})\}\big),\qquad\varphi_{\theta(\boldsymbol{\delta})}:=\varphi_{\psi(\boldsymbol{\delta})}-\varphi_{\psi(\mathbf{0})}.

Theorem 3 provides the basis for asymptotic inference, which proceeds in a standard influence-function manner. Let $\boldsymbol{Z}_{i}=(\boldsymbol{X}_{i},\boldsymbol{W}_{i},Y_{i})$ , and define the empirical influence values for $\psi(\boldsymbol{\delta})$ by

\widehat{\varphi}_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z}_{i}):=\widehat{r}_{\boldsymbol{\delta}}(\boldsymbol{W}_{i},\boldsymbol{X}_{i})\{Y_{i}-\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X}_{i})\}+\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X}_{i})-\widehat{\psi}(\boldsymbol{\delta}),

where $\widehat{r}_{\boldsymbol{\delta}}$ and $\widehat{m}_{\boldsymbol{\delta}}$ are constructed from the cross-fitted estimators $(\widehat{\mu},\widehat{f})$ as above. A consistent estimator of the asymptotic variance is the sample second moment

\widehat{\sigma}_{\psi(\boldsymbol{\delta})}^{\,2}:=\frac{1}{n}\sum_{i=1}^{n}\widehat{\varphi}_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z}_{i})^{2},

and analogously for the incremental effect we use

\widehat{\varphi}_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z}_{i}):=\widehat{\varphi}_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z}_{i})-\widehat{\varphi}_{\psi(\mathbf{0})}(\boldsymbol{Z}_{i}),\qquad\widehat{\sigma}_{\theta(\boldsymbol{\delta})}^{\,2}:=\frac{1}{n}\sum_{i=1}^{n}\widehat{\varphi}_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z}_{i})^{2}.

These yield Wald-type confidence intervals of the form

\widehat{\theta}(\boldsymbol{\delta})\ \pm\ z_{1-\alpha/2}\,\frac{\widehat{\sigma}_{\theta(\boldsymbol{\delta})}}{\sqrt{n}},

with a similar process for $\psi(\boldsymbol{\delta})$ . When inference is required for several tilts $\boldsymbol{\delta}_{1},\ldots,\boldsymbol{\delta}_{J}$ , a joint covariance estimator is obtained by the empirical covariance matrix of the vector

\big(\widehat{\varphi}_{\theta(\boldsymbol{\delta}_{1})}(\boldsymbol{Z}_{i}),\ldots,\widehat{\varphi}_{\theta(\boldsymbol{\delta}_{J})}(\boldsymbol{Z}_{i})\big)^{\top},\qquad i=1,\ldots,n,

which enables simultaneous confidence intervals or Wald tests via a multivariate normal approximation.

5 Sensitivity analysis to unmeasured confounding

In this section we develop a sensitivity analysis approach to assess the robustness of our causal estimates to the presence of unmeasured confounders. Throughout, we assume that there exist unmeasured confounders $\boldsymbol{U}$ such that

Y(\boldsymbol{w})\perp\!\!\!\!\perp\boldsymbol{W}\mid\boldsymbol{X},\boldsymbol{U},\quad\text{for all }\boldsymbol{w}.

For notational clarity, let $\boldsymbol{V}:=(\boldsymbol{X},\boldsymbol{U})$ denote the full adjustment set, reserving $\boldsymbol{Z}:=(\boldsymbol{X},\boldsymbol{W},Y)$ for the observed data throughout. We still consider incremental policies defined by exponential tilting of the observed conditional exposure density $f(\boldsymbol{w}\mid\boldsymbol{X})$ :

g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})=\frac{\exp(\boldsymbol{\delta}^{\top}\boldsymbol{w})f(\boldsymbol{w}\mid\boldsymbol{X})}{\mathbb{E}[\exp(\boldsymbol{\delta}^{\top}\boldsymbol{W})\mid\boldsymbol{X}]}.

The full-data causal estimand is given by

\psi(\boldsymbol{\delta})=\mathbb{E}\left[\int_{\mathcal{W}}\mu(\boldsymbol{V},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right],

where $\mu(\boldsymbol{V},\boldsymbol{w})=\mathbb{E}[Y\mid\boldsymbol{V},\boldsymbol{W}=\boldsymbol{w}]$ is referred to as the long outcome regression. We further denote the estimand obtained by applying the same identification formula while omitting $\boldsymbol{U}$ as

\widetilde{\psi}(\boldsymbol{\delta})=\mathbb{E}\left[\int_{\mathcal{W}}\widetilde{\mu}(\boldsymbol{X},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right]=\mathbb{E}\left[\int_{\mathcal{W}}[\mu(\boldsymbol{X},\boldsymbol{w})+d(\boldsymbol{X},\boldsymbol{w})]g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right],

where $\widetilde{\mu}(\boldsymbol{X},\boldsymbol{w})=\mathbb{E}[Y\mid\boldsymbol{X},\boldsymbol{W}=\boldsymbol{w}]$ is referred to as the short outcome regression, and $d(\boldsymbol{X},\boldsymbol{w})$ captures confounding bias at exposure level $\boldsymbol{w}$ . When there is unmeasured confounding, the identified short estimand $\widetilde{\psi}(\boldsymbol{\delta})$ , which conditions only on observed covariates $\boldsymbol{X}$ , diverges from the causal target $\psi(\boldsymbol{\delta})$ that is defined under conditional exchangeability given $\boldsymbol{V}=(\boldsymbol{X},\boldsymbol{U})$ . We follow Chernozhukov et al. (2022) and leverage the geometry of Riesz representers to derive sharp bias bounds based on $L_{2}$ norms, yielding interpretable calibration of confounding strength without imposing restrictive structural assumptions on $\mu(\cdot,\cdot)$ .

5.1 The Bias Representation

Let $f(\boldsymbol{w}\mid\boldsymbol{V})$ denote the true conditional exposure density given the full adjustment set. The parameter $\psi(\boldsymbol{\delta})$ can be written as a linear functional of the long regression evaluated under the observed distribution of $(\boldsymbol{V},\boldsymbol{W})$ :

\psi(\boldsymbol{\delta})=\mathbb{E}\left[\mu(\boldsymbol{V},\boldsymbol{W})\alpha_{\boldsymbol{\delta}}(\boldsymbol{V},\boldsymbol{W})\right],\qquad\alpha_{\boldsymbol{\delta}}(\boldsymbol{V},\boldsymbol{W})=\frac{g_{\boldsymbol{\delta}}(\boldsymbol{W}\mid\boldsymbol{X})}{f(\boldsymbol{W}\mid\boldsymbol{V})}.

Analogously, the short estimand admits the representation

\widetilde{\psi}(\boldsymbol{\delta})=\mathbb{E}\left[\widetilde{\mu}(\boldsymbol{X},\boldsymbol{W})\alpha_{s,\boldsymbol{\delta}}(\boldsymbol{X},\boldsymbol{W})\right],\qquad\alpha_{s,\boldsymbol{\delta}}(\boldsymbol{X},\boldsymbol{W})=\frac{g_{\boldsymbol{\delta}}(\boldsymbol{W}\mid\boldsymbol{X})}{f(\boldsymbol{W}\mid\boldsymbol{X})}.

Under mild regularity conditions ensuring square-integrability of the relevant Riesz representers, Chernozhukov et al. (2022) shows that the bias admits the exact representation

\widetilde{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})=\mathbb{E}\left[\Delta_{\mu}(\boldsymbol{V},\boldsymbol{W})\Delta_{\alpha}(\boldsymbol{V},\boldsymbol{W})\right],

where

\Delta_{\mu}(\boldsymbol{V},\boldsymbol{W})=\mu(\boldsymbol{V},\boldsymbol{W})-\widetilde{\mu}(\boldsymbol{X},\boldsymbol{W}),\qquad\Delta_{\alpha}(\boldsymbol{V},\boldsymbol{W})=\alpha_{\boldsymbol{\delta}}(\boldsymbol{V},\boldsymbol{W})-\alpha_{s,\boldsymbol{\delta}}(\boldsymbol{X},\boldsymbol{W}).

This identity isolates two necessary sources of bias: the additional outcome variation explained by $\boldsymbol{U}$ beyond $(\boldsymbol{X},\boldsymbol{W})$ through $\Delta_{\mu}$ , and the additional information about the exposure distribution provided by $\boldsymbol{U}$ beyond $\boldsymbol{X}$ through $\Delta_{\alpha}$ . By Cauchy–Schwarz,

|\widetilde{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})|\leq\sqrt{\mathbb{E}[\Delta_{\mu}(\boldsymbol{V},\boldsymbol{W})^{2}]}\sqrt{\mathbb{E}[\Delta_{\alpha}(\boldsymbol{V},\boldsymbol{W})^{2}]}.

5.2 Sensitivity Parameters

To express the bound in terms of identifiable scale components and partial $R^{2}$ -type sensitivity parameters, define the following identifiable parameters

\sigma_{s}^{2}=\mathbb{E}\left[(Y-\widetilde{\mu}(\boldsymbol{X},\boldsymbol{W}))^{2}\right],\qquad\nu_{s}^{2}(\boldsymbol{\delta})=\mathbb{E}\left[\alpha_{s,\boldsymbol{\delta}}(\boldsymbol{X},\boldsymbol{W})^{2}\right],\qquad S(\boldsymbol{\delta})=\sigma_{s}\nu_{s}(\boldsymbol{\delta}).

We parameterize the outcome component by the nonparametric partial $R^{2}$ of $\boldsymbol{U}$ with $Y$ given $(\boldsymbol{X},\boldsymbol{W})$ ,

R^{2}_{Y\sim\boldsymbol{U}\mid\boldsymbol{X},\boldsymbol{W}}=\frac{\mathbb{E}\left[\Delta_{\mu}(\boldsymbol{V},\boldsymbol{W})^{2}\right]}{\sigma_{s}^{2}},\qquad C_{Y}=\sqrt{R^{2}_{Y\sim\boldsymbol{U}\mid\boldsymbol{X},\boldsymbol{W}}}.

For the treatment component, we use the relative increase in $L_{2}$ variation of the Riesz representer induced by conditioning on $\boldsymbol{U}$ :

C_{D}^{2}(\boldsymbol{\delta})=\frac{\mathbb{E}\left[\Delta_{\alpha}(\boldsymbol{V},\boldsymbol{W})^{2}\right]}{\nu_{s}^{2}(\boldsymbol{\delta})}.

Combining these definitions yields the sensitivity bound

|\widetilde{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})|\leq S(\boldsymbol{\delta})\cdot C_{Y}\cdot C_{D}(\boldsymbol{\delta}).

For the empirical analysis, we parameterize the strength of unmeasured confounding by

\eta_{Y}^{2}:=C_{Y}^{2},\qquad\eta_{\alpha}^{2}(\boldsymbol{\delta}):=1-R_{\alpha}^{2}(\boldsymbol{\delta}),\qquad C_{D}^{2}(\boldsymbol{\delta})=\frac{\eta_{\alpha}^{2}(\boldsymbol{\delta})}{1-\eta_{\alpha}^{2}(\boldsymbol{\delta})}.

where $R_{\alpha}^{2}(\boldsymbol{\delta})$ is the squared correlation between the long and short Riesz representers. Here $\eta_{Y}^{2}$ is the fraction of residual outcome variation explainable by $\boldsymbol{U}$ given $(\boldsymbol{X},\boldsymbol{W})$ , and $\eta_{\alpha}^{2}(\boldsymbol{\delta})$ is the fraction of RR variation explainable by $\boldsymbol{U}$ along the intervention path indexed by $\boldsymbol{\delta}$ . These two quantities determine the width of the bias bound and, in turn, the width of the endpoint confidence bounds. The sensitivity parameters can be set subjectively or calibrated by formal benchmarking against observed covariates (Cinelli and Hazlett, 2020). The benchmarking procedure used in the empirical analysis is given in Appendix E.7.

5.3 Sensitivity parameters and confidence bounds for $\theta(\boldsymbol{\delta})$

Our empirical results focus on the incremental effect $\theta(\boldsymbol{\delta})=\psi(\boldsymbol{\delta})-\psi(\boldsymbol{0})$ . As shown in Appendix E, the same omitted-variable-bias representation applies to this contrast after replacing the short RR by the RR contrast $\alpha_{s,\theta,\boldsymbol{\delta}}(\boldsymbol{X},\boldsymbol{W})=\alpha_{s,\boldsymbol{\delta}}(\boldsymbol{X},\boldsymbol{W})-\alpha_{s,\boldsymbol{0}}(\boldsymbol{X},\boldsymbol{W})$ . Writing

\nu_{s,\theta}^{2}(\boldsymbol{\delta}):=\mathbb{E}\!\left[\alpha_{s,\theta,\boldsymbol{\delta}}(\boldsymbol{X},\boldsymbol{W})^{2}\right],

the plugin bound used in our analysis is

\widehat{B}(\boldsymbol{\delta})=\,\widehat{S}(\boldsymbol{\delta})\,\sqrt{\eta_{Y}^{2}}\,\sqrt{\frac{\eta_{\alpha}^{2}(\boldsymbol{\delta})}{1-\eta_{\alpha}^{2}(\boldsymbol{\delta})}},\qquad\widehat{S}(\boldsymbol{\delta}):=\sqrt{\widehat{\sigma}_{s}^{2}\,\widehat{\nu}_{s,\theta}^{2}(\boldsymbol{\delta})},

with $\widehat{\nu}_{s,\theta}^{2}(\boldsymbol{\delta})=n^{-1}\sum_{i=1}^{n}\widehat{\alpha}_{s,\theta,\boldsymbol{\delta}}(\boldsymbol{X}_{i},\boldsymbol{W}_{i})^{2}$ . Therefore the estimated point identified set is

\theta(\boldsymbol{\delta})\in\Big[\,\widehat{\theta}(\boldsymbol{\delta})-\widehat{B}(\boldsymbol{\delta}),\;\widehat{\theta}(\boldsymbol{\delta})+\widehat{B}(\boldsymbol{\delta})\Big].

For formal sensitivity-adjusted inference, we construct confidence bounds for the estimated endpoints $\widehat{\theta}_{-}(\boldsymbol{\delta}):=\widehat{\theta}(\boldsymbol{\delta})-\widehat{B}(\boldsymbol{\delta})$ and $\widehat{\theta}_{+}(\boldsymbol{\delta}):=\widehat{\theta}(\boldsymbol{\delta})+\widehat{B}(\boldsymbol{\delta})$ . Appendix E gives the endpoint expansions and the resulting standard errors. The empirical results report a sensitivity-adjusted $95\%$ interval obtained from a lower confidence bound for $\widehat{\theta}_{-}(\boldsymbol{\delta})$ and an upper confidence bound for $\widehat{\theta}_{+}(\boldsymbol{\delta})$ :

\left[\widehat{\theta}_{-}(\boldsymbol{\delta})-z_{0.95}\,\widehat{\mathrm{se}}_{-}(\boldsymbol{\delta}),\;\widehat{\theta}_{+}(\boldsymbol{\delta})+z_{0.95}\,\widehat{\mathrm{se}}_{+}(\boldsymbol{\delta})\right].

6 Simulation studies

We conduct simulation studies to systematically evaluate the finite-sample performance of various nuisance-parameter estimation pipelines. Our goal is to assess how different modeling and estimation pipelines perform when the underlying exposure distribution exhibits complex features such as heavy tails, skewness, or multimodality.

6.1 Simulation design

In all simulations, we fix the sample size at $n=5,000$ and the dimensions of the covariates and exposures at $p=10$ and $q=6$ , respectively. The results are averaged over 100 independent Monte Carlo repetitions. We generate the covariate vector $\boldsymbol{X}=(X_{1},\dots,X_{p})^{\top}$ from a multivariate normal distribution with an autoregressive correlation structure. Specifically, $\boldsymbol{X}\sim\mathcal{N}(0,\boldsymbol{\Sigma}_{X})$ , where the $(j,k)$ -th entry of the covariance matrix is given by $\Sigma_{X,jk}=0.5^{|j-k|}$ . The exposure vector $\boldsymbol{W}$ is generated conditional on $\boldsymbol{X}$ using a linear location-shift model:

\displaystyle\boldsymbol{W}=\boldsymbol{B}^{\top}\boldsymbol{X}+\boldsymbol{\beta}_{0}+\boldsymbol{\mathcal{E}}

where $\boldsymbol{B}\in\mathbb{R}^{p\times q}$ is a sparse coefficient matrix with non-zero entries drawn from $\mathcal{N}(0,0.6^{2})$ (sparsity level $0.4$ ), and $\boldsymbol{\beta}_{0}$ is the intercept vector. The error term $\boldsymbol{\mathcal{E}}$ determines the distributional characteristics of the exposures. We consider three scenarios to test model robustness:

1.

Scenario I (Gaussian): The errors follow a multivariate normal distribution $\boldsymbol{\mathcal{E}}\sim\mathcal{N}(0,\boldsymbol{\Sigma}_{W})$ , where $\boldsymbol{\Sigma}_{W}$ has an AR(1) structure with correlation parameter $\rho=0.6$ .
2.

Scenario II (Skewed): The errors follow a multivariate skew-normal distribution, $\boldsymbol{\mathcal{E}}\sim\mathcal{SN}(0,\boldsymbol{\Sigma}_{W},\boldsymbol{\alpha})$ , with a slant parameter vector $\boldsymbol{\alpha}=(4,\dots,4)^{\top}$ . This scenario introduces substantial asymmetry and non-Gaussianity.
3.

Scenario III (Truncated contaminated normal): To investigate robustness against heavy-tail-like deviations while ensuring the finiteness of the normalizing constant required for exponential tilting, we consider a truncated contaminated normal distribution. Specifically, the error distribution follows a mixture of two centered Gaussians, $(1-\pi)\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma}_{W})+\pi\mathcal{N}(\mathbf{0},\omega^{2}\boldsymbol{\Sigma}_{W})$ , with contamination rate $\pi=0.2$ and scale inflation $\omega=1.5$ . The support is restricted to the region bounded by $6$ standard deviations of the baseline component. This setting introduces heavier tails relative to the standard Gaussian assumption while maintaining the required integrability conditions.

We generate the outcome under two different scenarios that comprise differing levels of complexity.

Scenario I (Simple Linear Model): We generate a continuous outcome $Y$ using a linear model:

\displaystyle Y=\alpha_{0}+\boldsymbol{\alpha}^{\top}\boldsymbol{X}+\boldsymbol{\beta}^{\top}\boldsymbol{W}+\epsilon,\quad\epsilon\sim\mathcal{N}(0,1)

Here, the regression coefficients are generated from normal distributions: entries of $\boldsymbol{\alpha}$ are drawn from $\mathcal{N}(0.5,1)$ , and entries of $\boldsymbol{\beta}$ are drawn from $\mathcal{N}(2,1)$ , ensuring a strong but different signal for both covariates and exposures.

Scenario II (Complex Model): We evaluate a second, more complex DGP for the outcome $Y$ to incorporate structural nonlinearities and interactions. The continuous outcome $Y$ is generated using the following model:

Y=\alpha_{0}+\boldsymbol{\alpha}^{\top}\boldsymbol{X}+c_{1}X_{2}^{2}+\boldsymbol{\beta}^{\top}\boldsymbol{W}+c_{2}W_{1}W_{2}+c_{3}X_{1}W_{1}+\epsilon,\quad\epsilon\sim\mathcal{N}(0,1)

Here, the baseline linear confounding coefficients $\boldsymbol{\alpha}$ and the marginal independent effect coefficients $\boldsymbol{\beta}$ are drawn from $\mathcal{N}(0.5,1)$ and $\mathcal{N}(2,1)$ , respectively. Beyond the linear main effects, the model explicitly introduces a quadratic effect for a specific covariate ( $c_{1}X_{2}^{2}$ , with $c_{1}=0.5$ ), a cross-product interaction between two exposures ( $c_{2}W_{1}W_{2}$ , with $c_{2}=1.0$ ), and a bilinear interaction between a covariate and an exposure ( $c_{3}X_{1}W_{1}$ , with $c_{3}=0.8$ ).

This more complex setting is designed to emulate mechanisms often encountered in the health effects of air pollution mixtures. The quadratic covariate term can be viewed as mimicking the classical U-shaped meteorological confounding effect of temperature. The cross-product term $W_{1}W_{2}$ represents synergistic toxicity between two distinct pollutants, analogous to the combined effects of fine particulate matter (PM_2.5) and ozone (O₃). The interaction $X_{1}W_{1}$ captures effect modification, reflecting how a vulnerability factor such as patient age can amplify the health impact of a specific exposure.

6.2 Implemented estimators and method comparison

Our primary estimand is the tilted mean $\psi(\boldsymbol{\delta})$ , evaluated at a fixed tilting parameter $\boldsymbol{\delta}$ . We compare seven estimation pipelines that differ in how they approximate the nuisance structure induced by the tilted exposure law. For all methods, the outcome regression $\mu(\boldsymbol{X},\boldsymbol{W})=\mathbb{E}[Y\mid\boldsymbol{X},\boldsymbol{W}]$ is estimated using XGBoost with 5-fold cross-fitting. Our approaches can largely be categorized into two distinct types, which vary in how the exposure distribution is estimated.

1.

Semi-parametric location-shift working models.
We model the exposure distribution using a semi-parametric location-shift decomposition, $W_{j}=\mu_{j}(\boldsymbol{X})+\sigma_{j}\varepsilon_{j}$ . For each exposure dimension, the conditional mean $\mu_{j}(\boldsymbol{X})$ is flexibly estimated via XGBoost, while the scale parameter $\sigma_{j}$ is estimated as the empirical standard deviation of the residuals within the corresponding training fold. This specification accommodates nonlinear covariate-dependent shifts in the exposure mean while imposing a homoscedastic error structure across the covariate space.
We evaluate three variations of this working model based on the specified marginal distributions of the standardized residuals $\varepsilon_{j}$ . In all three variations, the joint dependence structure of $\boldsymbol{\varepsilon}=(\varepsilon_{1},\dots,\varepsilon_{q})^{\top}$ is modeled using a Gaussian copula:
- •
  
  Path 2 (Gaussian): Assumes the standardized residuals follow a Gaussian distribution.
- •
  
  Path 2 ( $t$ ): Models each marginal residual using a scaled Student’s $t$ distribution, $F_{j}=t_{df_{j}}$ , where the degrees of freedom $df_{j}$ are estimated via maximum likelihood. This allows the model to accommodate heavy tails in specific exposure dimensions.
- •
  
  Path 2 (Empirical): Employs a fully nonparametric approach for the marginal residuals. The marginal cumulative distribution function $F_{j}$ for each residual dimension is flexibly estimated using log-spline density estimation.
2.

Direct estimation of nuisance parameters via regression.

In addition to the standard outcome regression $\mu(\boldsymbol{x},\boldsymbol{w})$ , the one-step estimator depends on nuisance functions that vary with the tilting parameter $\boldsymbol{\delta}$ . These are given by $\nu_{\boldsymbol{\delta}}(\boldsymbol{x})$ and $\eta_{\boldsymbol{\delta}}(\boldsymbol{x})$ , which are defined in Section 3.1.2. In this pipeline, we use 5-fold cross-fitted SoftBART regression (Linero and Yang, 2018) for estimation of these functions.

Within each of the three approaches to conditional density estimation, we explore 1) estimating the density ratio using the estimate of $f$ , and 2) directly estimating the density ratio by estimating $\nu_{\boldsymbol{\delta}}(\boldsymbol{x})$ . This leads to six estimators, though we consider a seventh estimator that involves directly estimating both $\nu_{\boldsymbol{\delta}}(\boldsymbol{x})$ and $\eta_{\boldsymbol{\delta}}(\boldsymbol{x})$ without ever needing to estimate the exposure density.

6.3 Simulation Results

Figure 1 displays the estimated tilted mean, $\psi(\boldsymbol{\delta})$ , across 100 Monte Carlo repetitions for the seven pipelines under the six data-generating designs. Additionally, Table 1 complements the boxplots with numerical summaries aggregated over the six designs. We report the average signed bias together with the average absolute bias and the average RMSE.

Refer to caption — Figure 1: Boxplots of $\psi(\boldsymbol{\delta})$ across 100 repetitions under six data-generating designs. The dashed horizontal line marks the true value of the estimand. The labels on the horizontal axis identify, from left to right, the residual family, the method used for $r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})$ , and the method used for estimating $m_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})$ . Throughout, MC represents simply plugging in estimates of $f$ directly, while SoftBART represents direct estimation as described in Section 3.1.2.

Table 1: Average performance over the six simulation designs.

	Mean bias	Mean absolute bias	Mean RMSE
Fully direct SoftBART regression	0.32	0.51	1.07
Gaussian residual model	0.78	0.82	1.05
Gaussian residual model with SoftBART estimation of $r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})$	0.15	0.37	0.64
$t$ residual model	0.72	0.80	1.01
$t$ residual model with SoftBART estimation of $r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})$	0.15	0.37	0.64
Empirical residual model	0.24	0.42	0.66
Empirical residual model with SoftBART estimation of $r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})$	0.08	0.31	0.62

Across the six designs, the empirical residual working model is the most reliable default among the three distributional choices. Additionally, the results make it clear that estimating the density ratio directly using SoftBART outperforms estimation of the density ratio by simply plugging in the estimate of $f$ . Regardless of the distributional choice, direct modeling of the density ratio drastically improves both bias and RMSE. Note, however, that the approach that does not model the exposure density at all, and directly estimates all nuisance functions directly through regression, performs poorly with a high RMSE. This highlights that taking a ratio of two estimated nuisance functions leads to additional, undesirable instability. Overall, the simulation results support a concrete practical conclusion: for the one-step estimator considered here, finite-sample performance is determined primarily by accurate estimation of density ratio $r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})$ while the remaining choice among the Gaussian, $t$ , and empirical residual families is secondary once that component is well estimated.

7 Application: Assessing Health Impacts of PM_2.5 Component Mixtures

We now evaluate an important public health question regarding the health impacts of long-term exposure to fine particulate matter (PM_2.5) and its complex chemical mixtures. Specifically, we constructed a county-level dataset across the United States for the year 2019 to estimate the exposure-response relationship between PM_2.5 constituents and age-adjusted hospitalization rates per 10,000 people for Chronic Obstructive Pulmonary Disease (COPD). Our analysis obtained health records from the CDC Environmental Public Health Tracking Network (EPHTN) and CDC WONDER. High-resolution estimates of PM_2.5 mass and its chemical constituents are derived from the Atmospheric Composition Analysis Group (Van Donkelaar et al., 2019). To adjust for potential confounding, we integrated a broad set of sociodemographic, behavioral, and clinical covariates compiled from the U.S. Census Bureau and CDC surveillance systems. Our analysis focuses on a $q=5$ dimensional PM_2.5 component mixture: black carbon (BC), nitrates (NO₃), organic matter (OM), sulfates (SO₄), and ammonium (NH₄). All reported curves use the cross-fitted one-step estimator from Section 3 together with the empirical residual model shown to perform best in the simulation studies.

We study three types of intervention paths corresponding to single exposure shifts, shifting groups of exposures, and finding the optimal shift in terms of reducing overall hospitalization rates. For the single exposure shifts, we utilize shifts of the form $\boldsymbol{\delta}_{j}=(0,\dots,t_{j},\dots,0)$ when studying exposure $j$ . The magnitude $t_{j}$ is chosen to satisfy the corresponding Gelbrich constraint. These paths are useful for comparing feasible directions on the intervention manifold, though they do not necessarily isolate single-pollutant causal effects because the tilted law remains multivariate and continues to alter the entire mixture distribution, not just exposure $j$ . We take a different approach for studying groups of exposures, where we consider three groups: 1) BC+OM, 2) NO₃+SO₄+NH₄, and 3) the full five-pollutant mixture. We solve a numerical algorithm for finding the $\boldsymbol{\delta}$ value that shifts the means of each of the exposures within a group, while holding the means of the other exposures constant. We do not apply this idea to the single pollutant shifts, because finding such a shift is not feasible in many cases due to the high correlations among the exposures. Third, for the optimal policy we run the Riemannian BFGS algorithm, though we do so from 100 independent random starting values since the optimization problem is non-convex and global optimality is not guaranteed.

Figure 2 shows the main empirical results. Among the five single pollutant analyses, all estimated curves are negative indicating a harmful effect of the air pollution mixture. The steepest declines are for SO₄ and NH₄, followed by NO₃ and OM, with BC producing a smaller but still negative curve. Because the single exposure shifts may also be shifting other exposures due to correlation among them, we interpret these as rankings of feasible intervention directions rather than pollutant-specific dose-response effects. Arguably the more interesting and informative results therefore are given by the grouped exposure analyses. The BC+OM path remains close to zero throughout the displayed Gelbrich range, indicating that these two exposures do not strongly impact COPD hospitalizations. The NO₃+SO₄+NH₄ group and the all exposures group, however, show significantly more pronounced effects showing that these three exposures are likely driving the adverse health effects seen.

As expected, the BFGS path has the largest estimated effect showing the biggest reductions in COPD hospitalizations possible at each level of the Gelbrich constraint. To provide more intuition about the results from the optimal policy shift, Figure 3 shows the distribution of the causal effect across the 100 different starting values, and Figure 4 shows the distribution of the $\boldsymbol{\delta}$ values for each exposure in the optimal exposure shift across the different starting values. We can see there is some heterogeneity across starting values in both figures, though certain coherent patterns do emerge. The exposures with the most negative tilts assigned to them are SO₄ and NO₃, highlighting that they are potentially the most impactful of the exposures in the air pollution mixture.

Figure 2 also summarizes the sensitivity of our results to unmeasured confounding bias. The shaded ribbons correspond to two fixed sensitivity parameter settings, while the dashed blue curves implement the formal benchmark procedure described in Appendix E.7. Intuitively, $k_{Y}$ scales how much residual outcome variation an omitted confounder may explain relative to the strongest observed covariate, and $k_{D}$ scales how much additional RR variation it may explain relative to the strongest observed covariate for a given intervention path. In our data, the strongest outcome-side benchmark is the covariate White, whereas the RR-side benchmark is selected separately for each scenario. Under the benchmark $k_{Y}=1$ and $k_{D}=1$ , the BC, OM, SO₄, NH₄, NO₃+SO₄+NH₄, all-pollutant, and BFGS curves remain significantly negative throughout the displayed Gelbrich range showing moderate sensitivity to unmeasured confounding. The NO₃ path is more sensitive at the smallest Gelbrich constraints, though it becomes more robust for larger Gelbrich distances. Figure 5 provides an assessment of how large the sensitivity parameters would have to become in order to make any of the results insignificant for each exposure (or group of exposures) examined. For this reason, we refer to this as the least favorable point, because it is the level of confounding required simply to make the result insignificant at any Gelbrich distance, not all distances simultaneously. The NO₃ contour lies closest to the origin, indicating that comparatively small confounding would suffice to remove significance for at least one value of the Gelbrich constraint. At the other end, SO₄, the all-pollutant equal-mean path, and the BFGS path lie farthest from the origin, so they require materially stronger confounding to overturn the estimated negative effects. BC, OM, NH₄, and the NO₃+SO₄+NH₄ group fall between these extremes, still showing a modest degree of robustness to confounding.

8 Discussion

In this manuscript we developed methodology for estimating the health effects of multiple air pollutants simultaneously in a way that is robust to the presence of severe positivity violations. By examining stochastic interventions with tilted exposure distributions, we can study which exposures are most harmful without relying on model-based extrapolation. One critical issue in the multivariate setting is how to define a fair shift that corresponds to similar shifts in the exposure distribution, which we do via the 2-Wasserstein distance. We provide asymptotic theory and minimax estimation rates for our proposed estimands, and show in a national study of the health effects of air pollution that there are detrimental effects of the air pollution mixture, but that these are largely driven by nitrates and sulfates.

There are a number of directions for future work that could expand upon, and improve, the methodology seen here. For one, our estimators are applicable to any stochastic shift estimand, and future research could target different shifts other than the exponentially tilted ones seen here, which maintain public health relevance and could be potentially more interpretable for practitioners. Additionally, one could expand on the sensitivity analyses developed here by incorporating recent results on sensitivity analysis for multiple exposures (Zheng et al., 2021). These incorporate moderate parametric assumptions in the multiple exposure setting and allow one to produce partial identification regions that could be tighter than those seen here, and allow one to incorporate additional assumptions or sources of information, such as negative control variables. Overall, we believe the proposed framework provides analysts, particularly those involved in the analysis of air pollution mixtures, robust approaches to estimating causal effects of multivariate, continuous exposures.

References

J. Antonelli, M. Mazumdar, D. Bellinger, D. Christiani, R. Wright, and B. Coull (2020) Estimating the health effects of environmental mixtures using bayesian semiparametric regression and sparsity inducing priors. The Annals of Applied Statistics. Cited by: §2.5.2.
J. Antonelli and C. Zigler (2024) Causal analysis of air pollution mixtures: estimands, positivity, and extrapolation. American Journal of Epidemiology 193 (10), pp. 1392–1398. Cited by: §1.
N. Biswas and L. Mackey (2024) Bounding wasserstein distance with couplings. Journal of the American Statistical Association 119 (548), pp. 2947–2958. Cited by: §2.4.
J. Blanchet and K. Murthy (2019) Quantifying distributional model risk via optimal transport. Mathematics of Operations Research 44 (2), pp. 565–600. Cited by: §2.4.
J. F. Bobb, L. Valeri, B. Claus Henn, D. C. Christiani, R. O. Wright, M. Mazumdar, J. J. Godleski, and B. A. Coull (2015) Causal inference for mixtures. Statistical science 30 (4), pp. 514–530. Cited by: §1, §2.5.2.
C. Bruns and N. Kallus (2024) Local effects of continuous instruments without positivity. arXiv preprint arXiv:2403.06450. Cited by: §1.
V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins (2018) Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 21 (1), pp. C1–C68. Cited by: §3.1.1.
V. Chernozhukov, C. Cinelli, W. Newey, A. Sharma, and V. Syrgkanis (2022) Long story short: omitted variable bias in causal machine learning. Technical report National Bureau of Economic Research. Cited by: §E.7, Appendix E, §5.1, §5.
C. Cinelli and C. Hazlett (2020) Making sense of sensitivity: extending omitted variable bias. Journal of the Royal Statistical Society Series B: Statistical Methodology 82 (1), pp. 39–67. Cited by: §5.2.
I. Díaz and N. S. Hejazi (2020) Causal mediation analysis for stochastic interventions. Journal of the Royal Statistical Society Series B: Statistical Methodology 82 (3), pp. 661–683. Cited by: §1, §2.2.
I. Díaz and M. J. van der Laan (2012) Population intervention causal effects based on stochastic interventions. Biometrics 68 (2), pp. 541–549. Cited by: §1.
I. Díaz, N. Williams, K. L. Hoffman, and E. J. Schenck (2023) Nonparametric causal effects based on longitudinal modified treatment policies. Journal of the American Statistical Association 118 (542), pp. 846–857. Cited by: §1.
J. Dorn and J. Guo (2024) Nonparametric estimation of local treatment effects with continuous instruments. Journal of Business & Economic Statistics, pp. 1–14. Cited by: §1.
J. C. Duchi and H. Namkoong (2021) Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics 49 (3), pp. 1378–1406. Cited by: §2.4.
F. Ferrari and D. B. Dunson (2020) Identifying main effects and interactions among exposures using gaussian processes. The annals of applied statistics 14 (4), pp. 1743. Cited by: §2.5.2.
R. Gao and A. Kleywegt (2023) Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research 48 (2), pp. 603–655. Cited by: §2.4.
M. Gelbrich (1990) On a formula for the $L^{2}$ wasserstein metric between measures on euclidean and hilbert spaces. Mathematische Nachrichten 147 (1), pp. 185–203. Cited by: §2.4.
A. Hakobyan and I. Yang (2024) Wasserstein distributionally robust control of partially observable linear stochastic systems. IEEE Transactions on Automatic Control 69 (9), pp. 6121–6136. Cited by: §2.4.
S. Haneuse and A. Rotnitzky (2013) Estimation of the effect of interventions that modify the received treatment. Statistics in medicine 32 (30), pp. 5260–5277. Cited by: §1.
W. Huang, P. Absil, and K. A. Gallivan (2018) A riemannian bfgs method without differentiated retraction for nonconvex optimization problems. SIAM Journal on Optimization 28 (1), pp. 470–495. Cited by: Appendix B, Appendix B, Appendix B, Appendix B, §3.2, §3.2.
W. Huang, K. A. Gallivan, and P. Absil (2015) A broyden class of quasi-newton methods for riemannian optimization. SIAM Journal on Optimization 25 (3), pp. 1660–1685. Cited by: Appendix B, Appendix B, Appendix B, Appendix B, Appendix B, §3.2.
N. Kallus and E. Mbougwe (2024) Stochastic interventions, sensitivity analysis, and optimal transport. The Annals of Statistics 52 (2), pp. 522–545. Cited by: §1.
N. Kallus and M. Oprescu (2022) Doubly robust inference on causal derivative effects for continuous treatments. arXiv preprint arXiv:2203.01878. Cited by: §1.
E. H. Kennedy (2019) Non-parametric causal effects based on incremental propensity score interventions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 81 (4), pp. 719–742. Cited by: §1, §2.2.
E. H. Kennedy (2024) Semiparametric doubly robust targeted double machine learning: a review. Handbook of statistical methods for precision medicine, pp. 207–236. Cited by: Appendix D.
D. Kuhn, P. M. Esfahani, V. A. Nguyen, and S. Shafieezadeh-Abadeh (2019) Wasserstein distributionally robust optimization: theory and applications in machine learning. In Operations research & management science in the age of analytics, pp. 130–166. Cited by: §2.4.
A. R. Linero and Y. Yang (2018) Bayesian regression tree ensembles that adapt to smoothness and sparsity. Journal of the Royal Statistical Society Series B: Statistical Methodology 80 (5), pp. 1087–1110. Cited by: item 2.
L. Malagò, L. Montrucchio, and G. Pistone (2018) Wasserstein riemannian geometry of gaussian densities. Information Geometry 1 (2), pp. 137–179. Cited by: Appendix B.
A. McClean, Y. Li, S. Bae, M. A. McAdams-DeMarco, I. Díaz, and W. Wu (2024) Fair comparisons of causal parameters with many treatments and positivity violations. arXiv preprint arXiv:2410.13522. Cited by: §1, §2.4.
D. B. McCoy, A. E. Hubbard, A. Schuler, and M. J. van der Laan (2023) Semiparametric discovery and estimation of interaction in mixed exposures using stochastic interventions. arXiv preprint arXiv:2305.01849. Cited by: §2.5.2.
P. Mohajerin Esfahani and D. Kuhn (2018) Data-driven distributionally robust optimization using the wasserstein metric: performance guarantees and tractable reformulations. Mathematical Programming 171 (1), pp. 115–166. Cited by: §2.4.
V. A. Nguyen, S. Shafiee, D. Filipović, and D. Kuhn (2021) Mean-covariance robust risk measurement. arXiv preprint arXiv:2112.09959. Cited by: §2.4.
V. M. Panaretos and Y. Zemel (2019) Statistical aspects of wasserstein distances. Annual review of statistics and its application 6 (1), pp. 405–431. Cited by: §2.4.
V. M. Panaretos and Y. Zemel (2020) An invitation to statistics in wasserstein space. Springer Nature. Cited by: §2.4.
T. P. Papp and C. Sherlock (2024) Scalable couplings for the random walk metropolis algorithm. Journal of the Royal Statistical Society Series B: Statistical Methodology, pp. qkae113. Cited by: §2.4.
A. Peters, R. Lall, and F. Dominici (2012) Causal inference for observed sudden changes in the composition of a multi-pollutant mixture: application to the effects of the utah valley steel mill closure. Epidemiology (Cambridge, Mass.) 23 (4), pp. 559. Cited by: §1.
N. Pfister and P. Bühlmann (2021) Extrapolation-aware nonparametric statistical inference. Journal of the Royal Statistical Society Series B: Statistical Methodology 83 (5), pp. 915–941. Cited by: §1.
C. Qi, K. A. Gallivan, and P. Absil (2010) Riemannian bfgs algorithm with applications. In Recent Advances in Optimization and its Applications in Engineering: The 14th Belgian-French-German Conference on Optimization, pp. 183–192. Cited by: §3.2.
T. S. Richardson and J. M. Robins (2013) Single world intervention graphs (swigs): a unification of the counterfactual and graphical approaches to causality. Center for the Statistics and the Social Sciences, University of Washington Series. Working Paper 128. Cited by: §1.
W. Ring and B. Wirth (2012) Optimization methods on riemannian manifolds and their application to shape space. SIAM Journal on Optimization 22 (2), pp. 596–627. Cited by: Appendix B, Appendix B, §3.2.
J. M. Robins, M. A. Hernán, and U. Siebert (2004) Effects of multiple interventions. In Comparative quantification of health risks: Global and regional burden of disease attributable to selected major risk factors, Vol. 1, pp. 2191–2230. Cited by: §1.
D. Rothenhäusler and B. Yu (2021) Incremental causal effects. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 83 (3), pp. 578–605. Cited by: §1.
D. B. Rubin (1980) Randomization analysis of experimental data: the fisher randomization test comment. Journal of the American statistical association 75 (371), pp. 591–593. Cited by: §2.1.
K. E. Rudolph, N. S. Hejazi, and M. J. van der Laan (2024) Propensity score weighting across counterfactual worlds: longitudinal effects under positivity violations. Statistical Methods in Medical Research 33 (1), pp. 137–154. Cited by: §1.
K. E. Rudolph, S. Inose, N. Williams, I. Diaz, L. Calderon, J. M. Torres, and M. Kioumourtzoglou (2025) Everything all at once: on choosing an estimand for multi-component environmental exposures. arXiv preprint arXiv:2509.17960. Cited by: §1.
S. Samanta and J. Antonelli (2022) Estimation and false discovery control for the analysis of environmental mixtures. Biostatistics 23 (4), pp. 1039–1055. Cited by: §2.5.2.
K. Schindl, S. Shen, and E. H. Kennedy (2024) Incremental effects for continuous exposures. arXiv preprint arXiv:2409.11967. Cited by: §1, §2.2, §2.2, §2.3, §3.1.2, §3.1.2, §4.
K. Schindl and L. Wasserman (2025) Causal geodesy: counterfactual estimation along the path between correlation and causation. arXiv preprint arXiv:2508.08499. Cited by: §1.
S. L. Taubman, J. M. Robins, M. A. Mittleman, and M. A. Hernán (2009) Intervening on risk factors for coronary heart disease: an application of the parametric g-formula. International journal of epidemiology 38 (6), pp. 1599–1611. Cited by: §1.
M. J. van der Laan (2017) A generally efficient targeted minimum loss based estimator based on the highly adaptive lasso. The International Journal of Biostatistics 13 (2), pp. 20160075. External Links: Document Cited by: §4.2.
A. Van Donkelaar, R. V. Martin, C. Li, and R. T. Burnett (2019) Regional estimates of chemical composition of fine particulate matter using a combined geoscience-statistical method with information from satellites, models, and monitors. Environmental science & technology 53 (5), pp. 2595–2611. Cited by: §7.
R. Wei, B. J. Reich, J. A. Hoppin, and S. Ghosal (2020) Sparse bayesian additive nonparametric regression with application to health effects of pesticides mixtures. Statistica Sinica 30 (1), pp. 55–79. Cited by: §2.5.2.
Q. Ye, G. A. Hanasusanto, and W. Xie (2024) Distributionally fair stochastic optimization using wasserstein distance. arXiv preprint arXiv:2402.01872. Cited by: §2.4.
J. G. Young, M. A. Hernán, and J. M. Robins (2011) Identification, estimation and approximation of risk under interventions that depend on the natural value of treatment using observational data. In Statistical models and causal inference: a dialogue with the social sciences, pp. 103–128. Cited by: §1.
Y. Zhang, P. Han, X. Wu, and G. Diao (2023) Nonparametric inference on dose-response curves without the positivity condition. Biometrika 110 (1), pp. 219–236. Cited by: §1.
J. Zheng, A. D’Amour, and A. Franks (2021) Bayesian inference and partial identification in multi-treatment causal inference with unobserved confounding. arXiv preprint arXiv:2111.07973. Cited by: §8.
W. Zheng and M. J. van der Laan (2011) Cross-validated targeted minimum-loss-based estimation. In Targeted Learning, pp. 459–474. Cited by: §3.1.1.

Appendix A Neyman-orthogonality and robustness to misspecified outcome model

Setup.

Let $\mu(\boldsymbol{x},\boldsymbol{w})=\mathbb{E}[Y\mid\boldsymbol{X}=\boldsymbol{x},\boldsymbol{W}=\boldsymbol{w}]$ denote the outcome regression, let $f(\boldsymbol{w}\mid\boldsymbol{x})$ be the observed exposure density, and let $g(\boldsymbol{w}\mid\boldsymbol{x})$ be the tilted density with density ratio $r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})=g(\boldsymbol{w}\mid\boldsymbol{x})/f(\boldsymbol{w}\mid\boldsymbol{x})$ . We write $\mathbb{E}_{f}[\cdot\mid\boldsymbol{X}]$ and $\mathbb{E}_{g}[\cdot\mid\boldsymbol{X}]$ for conditional expectations with respect to $f(\cdot\mid\boldsymbol{X})$ and $g(\cdot\mid\boldsymbol{X})$ , respectively. Assume standard regularity conditions. Consider the efficient influence function for a scalar parameter $\psi$ :

\varphi(\boldsymbol{Z};\psi,\mu,r):=r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\Big\{Y-\mathbb{E}_{g}[\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]\Big\}+\mathbb{E}_{g}[\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]-\psi.

We utilize the identities:

\mathbb{E}[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}]=1,\qquad\mathbb{E}[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]=\mathbb{E}_{g}[\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}].

(a) Orthogonality with respect to $\mu$ .

Fix $r$ and perturb $\mu$ along a path $\mu_{\varepsilon}=\mu+\varepsilon h$ . Since $\mathbb{E}[rY]$ and $\psi$ do not depend on $\varepsilon$ , we have:

	$\displaystyle\frac{d}{d\varepsilon}\,\mathbb{E}\{\varphi(\boldsymbol{Z};\psi,\mu_{\varepsilon},r)\}\Big\|_{\varepsilon=0}$	$\displaystyle=\mathbb{E}\!\left[-r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\,\mathbb{E}_{g}[h(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]+\mathbb{E}_{g}[h(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]\right]$
		$\displaystyle=\mathbb{E}\!\left[\big\{1-\mathbb{E}[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}]\big\}\,\mathbb{E}_{g}[h(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]\right]$
		$\displaystyle=0.$

Hence, the influence function is Neyman-orthogonal with respect to $\mu$ .

(b) Sensitivity in $r$ .

Fix $\mu$ and perturb $r$ along a normalized path $r_{\varepsilon}=r(1+\varepsilon v)$ , where $v(\boldsymbol{w},\boldsymbol{x})$ is a measurable function satisfying the constraint $\mathbb{E}[r_{\varepsilon}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}]=1$ for all $\varepsilon$ . Differentiating this constraint at $\varepsilon=0$ yields $\mathbb{E}_{f}[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})v(\boldsymbol{w},\boldsymbol{x})\mid\boldsymbol{X}]=0$ , which is equivalent to $\mathbb{E}_{g}[v(\boldsymbol{w},\boldsymbol{x})\mid\boldsymbol{X}]=0$ .

Let $m_{\varepsilon}(\boldsymbol{X}):=\mathbb{E}_{g_{\varepsilon}}[\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]$ , where $g_{\varepsilon}$ is the tilted density corresponding to $r_{\varepsilon}$ . Since $g_{\varepsilon}(\boldsymbol{w}\mid\boldsymbol{x})=r_{\varepsilon}(\boldsymbol{w},\boldsymbol{x})f(\boldsymbol{w}\mid\boldsymbol{x})$ , we have:

\frac{d}{d\varepsilon}m_{\varepsilon}(\boldsymbol{X})\Big|_{\varepsilon=0}=\mathbb{E}_{g}[v(\boldsymbol{w},\boldsymbol{x})\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}].

Let $m^{\prime}(\boldsymbol{X}):=\mathbb{E}_{g}[v\mu\mid\boldsymbol{X}]$ . The influence function along the path is:

\varphi(\boldsymbol{Z};\psi,\mu,r_{\varepsilon})=r_{\varepsilon}(\boldsymbol{W},\boldsymbol{X})\{Y-m_{\varepsilon}(\boldsymbol{X})\}+m_{\varepsilon}(\boldsymbol{X})-\psi.

Differentiating the expected influence function yields:

	$\displaystyle\frac{d}{d\varepsilon}\,\mathbb{E}\{\varphi(\boldsymbol{Z};\psi,\mu,r_{\varepsilon})\}\Big\|_{\varepsilon=0}$	$\displaystyle=\mathbb{E}\big[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})v(\boldsymbol{W},\boldsymbol{X})\{Y-m_{0}(\boldsymbol{X})\}\big]+\mathbb{E}\big[(1-r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X}))\,m^{\prime}(\boldsymbol{X})\big]$
		$\displaystyle=\mathbb{E}\big[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})v(\boldsymbol{W},\boldsymbol{X})\mu(\boldsymbol{X},\boldsymbol{W})\big]$
		$\displaystyle\quad-\mathbb{E}\big[m_{0}(\boldsymbol{X})\mathbb{E}[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})v(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}]\big]+\mathbb{E}\big[(1-r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X}))\,m^{\prime}(\boldsymbol{X})\big]$
		$\displaystyle=\mathbb{E}\big[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})v(\boldsymbol{W},\boldsymbol{X})\mu(\boldsymbol{X},\boldsymbol{W})\big]-\underbrace{\mathbb{E}[m_{0}(\boldsymbol{X})\cdot 0]}_{=\ 0}+\underbrace{\mathbb{E}\big[(1-r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X}))\,m^{\prime}(\boldsymbol{X})\big]}_{=\ 0}$
		$\displaystyle=\mathbb{E}\{\mathbb{E}_{g}[v(\boldsymbol{W},\boldsymbol{X})\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]\}.$

In general, $\mathbb{E}\{\mathbb{E}_{g}[v\mu\mid\boldsymbol{X}]\}\neq 0$ . Therefore, the derivative does not vanish, implying that the influence function is not Neyman-orthogonal in the $r$ direction unless additional restrictive constraints are imposed on the perturbation $v$ .

This non-orthogonality with respect to $r$ extends directly to the exposure density $f$ . For a fixed target density $g$ , any perturbation in $f$ induces a corresponding change in the density ratio $r=g/f$ . Consider a perturbation path $f_{\varepsilon}=f(1+\varepsilon S)$ , where $S(\boldsymbol{w}\mid\boldsymbol{x})$ is a standard score function satisfying $\mathbb{E}_{f}[S(\boldsymbol{W}\mid\boldsymbol{X})\mid\boldsymbol{X}]=0$ . Specifically:

r_{\varepsilon}(\boldsymbol{w},\boldsymbol{x})=\frac{g(\boldsymbol{w}\mid\boldsymbol{x})}{f_{\varepsilon}(\boldsymbol{w}\mid\boldsymbol{x})}=\frac{g(\boldsymbol{w}\mid\boldsymbol{x})}{f(\boldsymbol{w}\mid\boldsymbol{x})(1+\varepsilon S)}=r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})(1+\varepsilon S)^{-1}.

Differentiating with respect to $\varepsilon$ at $\varepsilon=0$ :

\frac{d}{d\varepsilon}r_{\varepsilon}\Big|_{\varepsilon=0}=-r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})S(\boldsymbol{w}\mid\boldsymbol{x}).

This corresponds to the perturbation $v=-S$ in the previous derivation for $r$ . Substituting this into the derivative obtained in part (b), we get:

	$\displaystyle\frac{d}{d\varepsilon}\,\mathbb{E}\{\varphi(\boldsymbol{Z};\psi,\mu,r_{\varepsilon})\}\Big\|_{\varepsilon=0}$	$\displaystyle=\mathbb{E}\big\{\mathbb{E}_{g}[-S(\boldsymbol{W}\mid\boldsymbol{X})\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]\big\}$
		$\displaystyle=-\mathbb{E}\big\{\mathbb{E}_{g}[S(\boldsymbol{W}\mid\boldsymbol{X})\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}]\big\}.$

Since $\mathbb{E}_{g}[S\mu\mid\boldsymbol{X}]$ is generally non-zero (unless $\mu$ is constant or $S$ is orthogonal to $\mu$ under $g$ ), the derivative does not vanish. Thus, the influence function is not Neyman-orthogonal with respect to the exposure density $f$ .

Robustness to misspecified outcome model.

Let $\mu_{0}$ be the true outcome regression and $r$ be the true density ratio. Define the target parameter:

\psi_{0}:=\mathbb{E}\big[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\,\mu_{0}(\boldsymbol{X},\boldsymbol{W})\big]=\mathbb{E}\Big[\mathbb{E}_{f}\{r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\mu_{0}(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}\}\Big],

which represents the average outcome under the tilted distribution $g$ . We now show that at $(\psi_{0},r)$ , the influence function is globally robust to $\mu$ . That is, for any measurable $\tilde{\mu}$ :

\mathbb{E}\big[\varphi(\boldsymbol{Z};\psi_{0},\tilde{\mu},r)\big]=0.

Consequently, estimators based on the efficient influence function $\varphi$ , such as the one-step estimator employed in this work, remain consistent for $\psi_{0}$ even under misspecification of the outcome regression model.

Proof.

Fix any measurable function $\tilde{\mu}$ and define:

\tilde{m}(\boldsymbol{X}):=\mathbb{E}_{f}\big[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\tilde{\mu}(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}\big]=\mathbb{E}_{g}[\tilde{\mu}(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}].

Since $\tilde{m}$ depends only on $\boldsymbol{X}$ and $\mathbb{E}[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}]=1$ , we have:

\mathbb{E}\big[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\,\tilde{m}(\boldsymbol{X})\big]=\mathbb{E}\Big[\tilde{m}(\boldsymbol{X})\,\mathbb{E}[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}]\Big]=\mathbb{E}\big[\tilde{m}(\boldsymbol{X})\big].

Thus, at $(\psi_{0},r)$ :

	$\displaystyle\mathbb{E}\big[\varphi(\boldsymbol{Z};\psi_{0},\tilde{\mu},r)\big]$	$\displaystyle=\mathbb{E}\big[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\{Y-\tilde{m}(\boldsymbol{X})\}\big]+\mathbb{E}[\tilde{m}(\boldsymbol{X})]-\psi_{0}$
		$\displaystyle=\mathbb{E}[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})Y]-\psi_{0}.$

By the law of iterated expectations:

\mathbb{E}[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})Y]=\mathbb{E}\big[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\,\mu_{0}(\boldsymbol{X},\boldsymbol{W})\big]=\psi_{0}.

Therefore, $\mathbb{E}[\varphi(\boldsymbol{Z};\psi_{0},\tilde{\mu},r)]=0$ for all $\tilde{\mu}$ . This confirms that consistent estimation of $\psi_{0}$ relies primarily on the consistency of $\widehat{r}_{\boldsymbol{\delta}}$ , rendering the estimator robust to misspecification of $\mu$ . ∎

Appendix B Validity for Riemannian BFGS on Gelbrich Constraint for exponential tilting

The convergence theory of Riemannian optimization algorithms, including Riemannian BFGS with Wolfe line search, is established in, e.g., Ring and Wirth (2012); Huang et al. (2015, 2018). We consider the deterministic level–set constraint

\mathcal{M}=\{\boldsymbol{\delta}\in\mathbb{R}^{d}:\,G(\boldsymbol{\delta})=c^{2}\},

where $G(\boldsymbol{\delta})$ is the squared Gelbrich distance between the marginal baseline distribution of $W$ and the marginal tilted distribution of $W$ induced by $\boldsymbol{\delta}$ . The manifold $\mathcal{M}$ is equipped with the Riemannian metric induced by the Euclidean inner product on $\mathbb{R}^{d}$ . Global convergence of cautious Riemannian BFGS in the sense that $\liminf_{k\to\infty}\|\operatorname{grad}\psi(\boldsymbol{\delta}_{k})\|=0$ follows from Huang et al. (2018, Theorem 4.2) under a compact level-set condition and Lipschitz continuous differentiability with respect to the chosen transport. We establish these properties for the marginal Gelbrich constraint by showing that the level set is a smooth embedded hypersurface, that a $C^{2}$ retraction and a continuous scaled transport with the pointwise norm-preserving property used in Huang et al. (2015, 2018) are available, and that $\psi$ is regular on a neighbourhood containing the line-search trial points.

Geometry of the Gelbrich constraint.

Let $f_{W}$ denote the marginal baseline law of $W$ on $\mathbb{R}^{d}$ , and let

\boldsymbol{\mu}=\mathbb{E}_{f_{W}}[W]\in\mathbb{R}^{d},\qquad\boldsymbol{\Sigma}=\mathrm{Cov}_{f_{W}}(W)\in\mathbb{S}_{++}^{d}.

For $\boldsymbol{\delta}\in\mathcal{D}$ (an open set on which the marginal log–moment generating function is finite), define

M(\boldsymbol{\delta})=\mathbb{E}_{f_{W}}\!\left[e^{\boldsymbol{\delta}^{\top}W}\right],\qquad g_{\boldsymbol{\delta}}^{\mathrm{marg}}(w)=\exp\!\big(\boldsymbol{\delta}^{\top}w-\log M(\boldsymbol{\delta})\big)\,f_{W}(w).

Consequently, let $\boldsymbol{\mu}_{\boldsymbol{\delta}}=\mathbb{E}_{g_{\boldsymbol{\delta}}^{\mathrm{marg}}}[W]$ and $\boldsymbol{\Sigma}_{\boldsymbol{\delta}}=\mathrm{Cov}_{g_{\boldsymbol{\delta}}^{\mathrm{marg}}}(W)$ . Equivalently,

\boldsymbol{\mu}_{\boldsymbol{\delta}}=\nabla_{\boldsymbol{\delta}}\log M(\boldsymbol{\delta})\in\mathbb{R}^{d},\qquad\boldsymbol{\Sigma}_{\boldsymbol{\delta}}=\nabla_{\boldsymbol{\delta}}^{2}\log M(\boldsymbol{\delta})\in\mathbb{S}_{++}^{d}.

The Gelbrich function is

G(\boldsymbol{\delta}):=\|\boldsymbol{\mu}_{\boldsymbol{\delta}}-\boldsymbol{\mu}\|_{2}^{2}+\mathrm{tr}\!\Big(\boldsymbol{\Sigma}+\boldsymbol{\Sigma}_{\boldsymbol{\delta}}-2\big(\boldsymbol{\Sigma}^{1/2}\boldsymbol{\Sigma}_{\boldsymbol{\delta}}\boldsymbol{\Sigma}^{1/2}\big)^{1/2}\Big).

Equip $\mathbb{R}^{d}$ with the Euclidean inner product and $\mathbb{S}^{d}$ with the Frobenius inner product $\langle\boldsymbol{A},\boldsymbol{B}\rangle=\mathrm{tr}(\boldsymbol{A}^{\top}\boldsymbol{B})$ . For each $\boldsymbol{\delta}$ , define the linear map $J_{\boldsymbol{\delta}}:\mathbb{R}^{d}\to\mathbb{S}^{d}$ by

(J_{\boldsymbol{\delta}}\boldsymbol{e}_{k})_{ij}:=\frac{\partial(\boldsymbol{\Sigma}_{\boldsymbol{\delta}})_{ij}}{\partial\delta_{k}},\qquad k=1,\dots,d,

and extend by linearity so that $J_{\boldsymbol{\delta}}\boldsymbol{v}=\sum_{k=1}^{d}v_{k}J_{\boldsymbol{\delta}}\boldsymbol{e}_{k}$ . Let $J_{\boldsymbol{\delta}}^{\!*}:\mathbb{S}^{d}\to\mathbb{R}^{d}$ be the adjoint of $J_{\boldsymbol{\delta}}$ with respect to these inner products, that is,

\langle J_{\boldsymbol{\delta}}\boldsymbol{v},\boldsymbol{H}\rangle=\boldsymbol{v}^{\top}\big(J_{\boldsymbol{\delta}}^{\!*}\boldsymbol{H}\big)\qquad(\boldsymbol{v}\in\mathbb{R}^{d},\ \boldsymbol{H}\in\mathbb{S}^{d}),

equivalently, for any $\boldsymbol{v}\in\mathbb{R}^{d}$ ,

\boldsymbol{v}^{\top}\big(J_{\boldsymbol{\delta}}^{\!*}\boldsymbol{H}\big)=\langle J_{\boldsymbol{\delta}}\boldsymbol{v},\boldsymbol{H}\rangle=\mathrm{tr}\!\big((J_{\boldsymbol{\delta}}\boldsymbol{v})^{\top}\boldsymbol{H}\big)=\sum_{k=1}^{d}v_{k}\,\mathrm{tr}\!\big((J_{\boldsymbol{\delta}}\boldsymbol{e}_{k})^{\top}\boldsymbol{H}\big).

Define

\boldsymbol{T}_{\boldsymbol{\Sigma}_{\boldsymbol{\delta}}\to\boldsymbol{\Sigma}}:=\boldsymbol{\Sigma}_{\boldsymbol{\delta}}^{-1/2}\big(\boldsymbol{\Sigma}_{\boldsymbol{\delta}}^{1/2}\boldsymbol{\Sigma}\,\boldsymbol{\Sigma}_{\boldsymbol{\delta}}^{1/2}\big)^{1/2}\boldsymbol{\Sigma}_{\boldsymbol{\delta}}^{-1/2}\in\mathbb{S}_{++}^{d}.

For $\boldsymbol{A}\in\mathbb{S}_{++}^{d}$ , define the Lyapunov (Sylvester) operator $\mathcal{L}_{\boldsymbol{A}}(\boldsymbol{Y})=\boldsymbol{A}\boldsymbol{Y}+\boldsymbol{Y}\boldsymbol{A}$ , and write $\mathcal{L}_{\boldsymbol{A}}^{-1}$ for its solution operator.

Lemma 2 (Exact gradient and generic regularity of Gelbrich level sets).

Assume $\log M(\boldsymbol{\delta})$ is finite on a nonempty open set $\mathcal{D}\subset\mathbb{R}^{d}$ . Then $\log M(\boldsymbol{\delta})$ is real–analytic on $\mathcal{D}$ , hence so are $\boldsymbol{\mu}_{\boldsymbol{\delta}}$ and $\boldsymbol{\Sigma}_{\boldsymbol{\delta}}$ .

(i)

$G$ is $C^{1}$ on $\mathcal{D}$ and

\nabla G(\boldsymbol{\delta})=2\,\boldsymbol{\Sigma}_{\boldsymbol{\delta}}\big(\boldsymbol{\mu}_{\boldsymbol{\delta}}-\boldsymbol{\mu}\big)+J_{\boldsymbol{\delta}}^{\!*}\!\Big(I-\boldsymbol{T}_{\boldsymbol{\Sigma}_{\boldsymbol{\delta}}\to\boldsymbol{\Sigma}}\Big),\qquad\boldsymbol{\delta}\in\mathcal{D}.

(ii)

Let $\mathcal{C}=\{\boldsymbol{\delta}\in\mathcal{D}:\nabla G(\boldsymbol{\delta})=\boldsymbol{0}\}$ and $\mathcal{V}=G(\mathcal{C})\subset[0,\infty)$ . Then $\mathcal{V}$ has Lebesgue measure zero and empty interior. In particular, for any $c>0$ with $c^{2}\notin\mathcal{V}$ , the level set

$\mathcal{M}_{c}:=\{\boldsymbol{\delta}\in\mathcal{D}:\ G(\boldsymbol{\delta})=c^{2}\}$

is a $C^{\infty}$ (indeed real-analytic) embedded hypersurface and $\nabla G(\boldsymbol{\delta})\neq\boldsymbol{0}$ for all $\boldsymbol{\delta}\in\mathcal{M}_{c}$ .

Proof.

Write $G(\boldsymbol{\delta})=G_{\mathrm{mean}}(\boldsymbol{\delta})+G_{\mathrm{cov}}(\boldsymbol{\delta})$ with

G_{\mathrm{mean}}(\boldsymbol{\delta})=\|\boldsymbol{\mu}_{\boldsymbol{\delta}}-\boldsymbol{\mu}\|_{2}^{2},\qquad G_{\mathrm{cov}}(\boldsymbol{\delta})=\Phi(\boldsymbol{\Sigma}_{\boldsymbol{\delta}}),

where

\Phi(\boldsymbol{\Theta}):=\mathrm{tr}\!\Big(\boldsymbol{\Sigma}+\boldsymbol{\Theta}-2\big(\boldsymbol{\Sigma}^{1/2}\boldsymbol{\Theta}\boldsymbol{\Sigma}^{1/2}\big)^{1/2}\Big),\qquad\boldsymbol{\Theta}\in\mathbb{S}_{++}^{d}.

Let $\boldsymbol{v}\in\mathbb{R}^{d}$ and denote the directional derivative by $D_{\boldsymbol{v}}$ . Since $\mathrm{D}\boldsymbol{\mu}_{\boldsymbol{\delta}}[\boldsymbol{v}]=\boldsymbol{\Sigma}_{\boldsymbol{\delta}}\boldsymbol{v}$ and $\boldsymbol{\Sigma}_{\boldsymbol{\delta}}$ is symmetric,

D_{\boldsymbol{v}}G_{\mathrm{mean}}(\boldsymbol{\delta})=2\,\langle\boldsymbol{\mu}_{\boldsymbol{\delta}}-\boldsymbol{\mu},\boldsymbol{\Sigma}_{\boldsymbol{\delta}}\boldsymbol{v}\rangle=2\,\boldsymbol{v}^{\top}\boldsymbol{\Sigma}_{\boldsymbol{\delta}}(\boldsymbol{\mu}_{\boldsymbol{\delta}}-\boldsymbol{\mu}),

hence $\nabla G_{\mathrm{mean}}(\boldsymbol{\delta})=2\,\boldsymbol{\Sigma}_{\boldsymbol{\delta}}(\boldsymbol{\mu}_{\boldsymbol{\delta}}-\boldsymbol{\mu})$ .

For $\boldsymbol{H}\in\mathbb{S}^{d}$ , the first Fréchet derivative of $\Phi$ is given by

\mathrm{D}\Phi(\boldsymbol{\Theta})[\boldsymbol{H}]=\langle I-\boldsymbol{T}_{\boldsymbol{\Theta}\to\boldsymbol{\Sigma}},\boldsymbol{H}\rangle,

where $\boldsymbol{T}_{\boldsymbol{\Theta}\to\boldsymbol{\Sigma}}:=\boldsymbol{\Theta}^{-1/2}(\boldsymbol{\Theta}^{1/2}\boldsymbol{\Sigma}\boldsymbol{\Theta}^{1/2})^{1/2}\boldsymbol{\Theta}^{-1/2}$ . This follows from the Fréchet derivative of the principal matrix square root, $\mathrm{D}(\boldsymbol{A}^{1/2})[\boldsymbol{V}]=\mathcal{L}_{\boldsymbol{A}^{1/2}}^{-1}(\boldsymbol{V})$ (Malagò et al., 2018, Eqs. (14)–(16)). In particular, for $\boldsymbol{A}\in\mathbb{S}_{++}^{d}$ and $\boldsymbol{V}\in\mathbb{S}^{d}$ ,

\mathrm{D}\,\mathrm{tr}(\boldsymbol{A}^{1/2})[\boldsymbol{V}]=\tfrac{1}{2}\,\mathrm{tr}(\boldsymbol{A}^{-1/2}\boldsymbol{V}).

With $\boldsymbol{A}=\boldsymbol{\Sigma}^{1/2}\boldsymbol{\Theta}\boldsymbol{\Sigma}^{1/2}$ and $\boldsymbol{V}=\boldsymbol{\Sigma}^{1/2}\boldsymbol{H}\boldsymbol{\Sigma}^{1/2}$ ,

\mathrm{D}\Phi(\boldsymbol{\Theta})[\boldsymbol{H}]=\mathrm{tr}(\boldsymbol{H})-\mathrm{tr}\!\Big(\boldsymbol{\Sigma}^{1/2}\boldsymbol{A}^{-1/2}\boldsymbol{\Sigma}^{1/2}\boldsymbol{H}\Big)=\langle I-\boldsymbol{T}_{\boldsymbol{\Theta}\to\boldsymbol{\Sigma}},\,\boldsymbol{H}\rangle.

By the chain rule and the definition of $J_{\boldsymbol{\delta}}$ ,

D_{\boldsymbol{v}}G_{\mathrm{cov}}(\boldsymbol{\delta})=\mathrm{D}\Phi(\boldsymbol{\Sigma}_{\boldsymbol{\delta}})\!\big[J_{\boldsymbol{\delta}}\boldsymbol{v}\big]=\langle\,I-\boldsymbol{T}_{\boldsymbol{\Sigma}_{\boldsymbol{\delta}}\to\boldsymbol{\Sigma}}\,,J_{\boldsymbol{\delta}}\boldsymbol{v}\rangle=\boldsymbol{v}^{\top}J_{\boldsymbol{\delta}}^{\!*}\!\big(I-\boldsymbol{T}_{\boldsymbol{\Sigma}_{\boldsymbol{\delta}}\to\boldsymbol{\Sigma}}\big),

which implies $\nabla G_{\mathrm{cov}}(\boldsymbol{\delta})=J_{\boldsymbol{\delta}}^{\!*}\!\big(I-\boldsymbol{T}_{\boldsymbol{\Sigma}_{\boldsymbol{\delta}}\to\boldsymbol{\Sigma}}\big)$ . Summing these terms yields the gradient in (i).

For part (ii), $\log M(\boldsymbol{\delta})$ is real–analytic on $\mathcal{D}$ , hence so are $\boldsymbol{\mu}_{\boldsymbol{\delta}}$ and $\boldsymbol{\Sigma}_{\boldsymbol{\delta}}$ , and therefore $G$ is $C^{1}$ on $\mathcal{D}$ . Since $G$ is real–analytic on $\mathcal{D}$ , it is $C^{\infty}$ on $\mathcal{D}$ , and Sard’s theorem yields that the set of critical values $\mathcal{V}$ has Lebesgue measure zero. Since $\mathcal{V}$ has Lebesgue measure zero, it has empty interior. For any $c>0$ with $c^{2}\notin\mathcal{V}$ , every $\boldsymbol{\delta}\in\mathcal{M}_{c}$ satisfies $\nabla G(\boldsymbol{\delta})\neq\boldsymbol{0}$ ; the regular level–set theorem yields that $\mathcal{M}_{c}$ is a $C^{\infty}$ (indeed real-analytic) embedded hypersurface. ∎

Projection retraction and scaled vector transport.

Fix $c>0$ such that $c^{2}\notin\mathcal{V}$ and write $\mathcal{M}_{c}=\{\boldsymbol{\delta}\in\mathcal{D}:\,G(\boldsymbol{\delta})=c^{2}\}$ . For $\boldsymbol{\delta}\in\mathcal{M}_{c}$ , let $\boldsymbol{n}(\boldsymbol{\delta}):={\nabla G(\boldsymbol{\delta})}/{\|\nabla G(\boldsymbol{\delta})\|}$ be the unit normal vector and denote the orthogonal projection onto the tangent space $T_{\boldsymbol{\delta}}\mathcal{M}_{c}$ by

P_{\boldsymbol{\delta}}:=I-\boldsymbol{n}(\boldsymbol{\delta})\boldsymbol{n}(\boldsymbol{\delta})^{\top}.

Let $\Pi:\mathcal{N}\to\mathcal{M}_{c}$ be the orthogonal projection from a tubular neighbourhood $\mathcal{N}$ of $\mathcal{M}_{c}$ . For $\boldsymbol{\delta}\in\mathcal{M}_{c}$ , define the retraction $R_{\boldsymbol{\delta}}$ on a sufficiently small ball $\mathcal{B}_{\boldsymbol{\delta}}\subset T_{\boldsymbol{\delta}}\mathcal{M}_{c}$ by

R_{\boldsymbol{\delta}}(\boldsymbol{\xi}):=\Pi(\boldsymbol{\delta}+\boldsymbol{\xi}),\qquad\boldsymbol{\xi}\in\mathcal{B}_{\boldsymbol{\delta}}.

Define the differentiated-retraction transport

T^{R}_{\boldsymbol{\delta},\boldsymbol{\xi}}(\boldsymbol{\zeta}):=\mathrm{D}R_{\boldsymbol{\delta}}(\boldsymbol{\xi})[\boldsymbol{\zeta}],\qquad\boldsymbol{\zeta}\in T_{\boldsymbol{\delta}}\mathcal{M}_{c}.

Following Huang et al. (2015, Eq. (4.3)), define the scaled transport

T^{S}_{\boldsymbol{\delta},\boldsymbol{\xi}}(\boldsymbol{\zeta}):=\begin{cases}\dfrac{\|\boldsymbol{\zeta}\|}{\|T^{R}_{\boldsymbol{\delta},\boldsymbol{\xi}}(\boldsymbol{\zeta})\|}\,T^{R}_{\boldsymbol{\delta},\boldsymbol{\xi}}(\boldsymbol{\zeta}),&\boldsymbol{\zeta}\neq\boldsymbol{0},\\[6.0pt] \boldsymbol{0},&\boldsymbol{\zeta}=\boldsymbol{0},\end{cases}

which is norm-preserving and satisfies the requirements (2.5)–(2.8) therein. For a step $\boldsymbol{\xi}\in T_{\boldsymbol{\delta}}\mathcal{M}_{c}$ and $\boldsymbol{\delta}_{+}:=R_{\boldsymbol{\delta}}(\boldsymbol{\xi})$ , define

\mathcal{T}_{\boldsymbol{\delta}\to\boldsymbol{\delta}_{+}}(\boldsymbol{\zeta}):=T^{S}_{\boldsymbol{\delta},\boldsymbol{\xi}}(\boldsymbol{\zeta}),\qquad\boldsymbol{\zeta}\in T_{\boldsymbol{\delta}}\mathcal{M}_{c}.

Lemma 3 (Validity of transport).

The retraction $R$ and the transport $\mathcal{T}$ defined above satisfy:

(a)

there exists an open neighbourhood $\mathcal{U}\subset T\mathcal{M}_{c}$ of the zero section such that $R_{\boldsymbol{\delta}}$ and $\mathcal{T}_{\boldsymbol{\delta}\to R_{\boldsymbol{\delta}}(\boldsymbol{\xi})}$ are well defined for $(\boldsymbol{\delta},\boldsymbol{\xi})\in\mathcal{U}$ and are $C^{0}$ in all arguments;

(b)

$R_{\boldsymbol{\delta}}(\boldsymbol{0})=\boldsymbol{\delta}$ and $\mathrm{D}R_{\boldsymbol{\delta}}(\boldsymbol{0})=\mathrm{Id}_{T_{\boldsymbol{\delta}}\mathcal{M}_{c}}$ , and for every $\boldsymbol{\zeta}\in T_{\boldsymbol{\delta}}\mathcal{M}_{c}$ ,

\mathcal{T}_{\boldsymbol{\delta}\to R_{\boldsymbol{\delta}}(\boldsymbol{\xi})}(\boldsymbol{\zeta})=T^{R}_{\boldsymbol{\delta},\boldsymbol{\xi}}(\boldsymbol{\zeta})+O(\|\boldsymbol{\xi}\|\,\|\boldsymbol{\zeta}\|)\quad(\boldsymbol{\xi}\to\boldsymbol{0}),

and in particular $T^{R}_{\boldsymbol{\delta},\boldsymbol{0}}=\mathrm{Id}_{T_{\boldsymbol{\delta}}\mathcal{M}_{c}}$ ;

(c)

for every $\boldsymbol{\delta}\in\mathcal{M}_{c}$ and $\boldsymbol{\xi}\in\mathcal{B}_{\boldsymbol{\delta}}$ ,

\|\mathcal{T}_{\boldsymbol{\delta}\to R_{\boldsymbol{\delta}}(\boldsymbol{\xi})}(\boldsymbol{\zeta})\|=\|\boldsymbol{\zeta}\|\qquad(\boldsymbol{\zeta}\in T_{\boldsymbol{\delta}}\mathcal{M}_{c}),

so for each fixed $(\boldsymbol{\delta},\boldsymbol{\xi})$ , the map $\boldsymbol{\zeta}\mapsto\mathcal{T}_{\boldsymbol{\delta}\to R_{\boldsymbol{\delta}}(\boldsymbol{\xi})}(\boldsymbol{\zeta})$ is pointwise norm-preserving on $T_{\boldsymbol{\delta}}\mathcal{M}_{c}$ , and $\mathcal{T}$ is uniformly bounded on compact subsets of $\mathcal{M}_{c}$ in the sense that

\|\mathcal{T}_{\boldsymbol{\delta}\to R_{\boldsymbol{\delta}}(\boldsymbol{\xi})}(\boldsymbol{\zeta})\|\leq\|\boldsymbol{\zeta}\|

for all admissible $(\boldsymbol{\delta},\boldsymbol{\xi},\boldsymbol{\zeta})$ .

Proof.

Property (a) follows from the tubular neighbourhood theorem and the $C^{1}$ smoothness of $\Pi$ : $R_{\boldsymbol{\delta}}(\boldsymbol{\xi})=\Pi(\boldsymbol{\delta}+\boldsymbol{\xi})$ is well defined and $C^{1}$ for $\boldsymbol{\xi}$ small, and $\mathrm{D}R_{\boldsymbol{\delta}}(\boldsymbol{\xi})$ is a linear map between tangent spaces for $\|\boldsymbol{\xi}\|$ sufficiently small. The scaled transport $T^{S}$ satisfies the continuity and first-order conditions invoked in Huang et al. (2015), so $\mathcal{T}$ is continuous in all arguments.

For (b), $\Pi(\boldsymbol{\delta})=\boldsymbol{\delta}$ and $\mathrm{D}\Pi(\boldsymbol{\delta})=P_{\boldsymbol{\delta}}$ , hence $R_{\boldsymbol{\delta}}(\boldsymbol{0})=\boldsymbol{\delta}$ and $\mathrm{D}R_{\boldsymbol{\delta}}(\boldsymbol{0})=P_{\boldsymbol{\delta}}$ . Since $P_{\boldsymbol{\delta}}$ is the identity on $T_{\boldsymbol{\delta}}\mathcal{M}_{c}$ , $\mathrm{D}R_{\boldsymbol{\delta}}(\boldsymbol{0})=\mathrm{Id}_{T_{\boldsymbol{\delta}}\mathcal{M}_{c}}$ . The first–order agreement between $\mathcal{T}$ and $\mathrm{D}R_{\boldsymbol{\delta}}(\boldsymbol{0})$ follows from Huang et al. (2015, (2.5)–(2.7), Lem. 3.5).

For (c), by construction,

\|T^{S}_{\boldsymbol{\delta},\boldsymbol{\xi}}(\boldsymbol{\zeta})\|=\|\boldsymbol{\zeta}\|\qquad(\boldsymbol{\zeta}\in T_{\boldsymbol{\delta}}\mathcal{M}_{c}),

so $\mathcal{T}_{\boldsymbol{\delta}\to R_{\boldsymbol{\delta}}(\boldsymbol{\xi})}=T^{S}_{\boldsymbol{\delta},\boldsymbol{\xi}}$ is pointwise norm-preserving:

\|\mathcal{T}_{\boldsymbol{\delta}\to R_{\boldsymbol{\delta}}(\boldsymbol{\xi})}(\boldsymbol{\zeta})\|=\|\boldsymbol{\zeta}\|\qquad(\boldsymbol{\zeta}\in T_{\boldsymbol{\delta}}\mathcal{M}_{c}).

The stated uniform boundedness on compact subsets then follows with constant $1$ . ∎

Objective regularity for the global target.

For $\boldsymbol{x}\in\mathcal{X}$ and $\boldsymbol{\delta}\in U$ , define the local tilted conditional mean

\eta_{\boldsymbol{\delta}}(\boldsymbol{x}):=\mathbb{E}_{f}[\mu(\boldsymbol{x},W)e^{\boldsymbol{\delta}^{\top}W}\mid X=\boldsymbol{x}],\qquad\nu_{\boldsymbol{\delta}}(\boldsymbol{x}):=\mathbb{E}_{f}[e^{\boldsymbol{\delta}^{\top}W}\mid X=\boldsymbol{x}],

m_{\boldsymbol{\delta}}(\boldsymbol{x}):=\frac{\eta_{\boldsymbol{\delta}}(\boldsymbol{x})}{\nu_{\boldsymbol{\delta}}(\boldsymbol{x})},\qquad\psi(\boldsymbol{\delta}):=\mathbb{E}\big\{m_{\boldsymbol{\delta}}(X)\big\}.

Lemma 4 (Conditional regularity of the local tilted mean).

Assume $|\mu(x,w)|\leq C(1+\|w\|^{p})$ for some finite constants $C,p>0$ . Let $U\subset\mathbb{R}^{d}$ be open and let $K\subset U$ be compact. Assume

A(X):=\sup_{\boldsymbol{\delta}\in U}\mathbb{E}_{f}\!\big[e^{\boldsymbol{\delta}^{\top}W}(1+\|W\|^{p+2})\mid X\big]<\infty\quad\text{a.s.},\qquad\mathbb{E}\big[A(X)^{2}\big]<\infty,

(3)

and

\inf_{\boldsymbol{\delta}\in K}\nu_{\boldsymbol{\delta}}(X)\geq D_{\min}>0\quad\text{a.s.}

(4)

Then for almost every $X$ , the map $\boldsymbol{\delta}\mapsto m_{\boldsymbol{\delta}}(X)$ is $C^{2}$ on $K$ . Moreover, there exist finite constants $C_{0},C_{1},C_{2}$ , depending only on $C$ , $p$ , and $D_{\min}$ , such that with

B_{j}(X):=C_{j}\{1+A(X)+A(X)^{2}\},\qquad j=0,1,2,

we have almost surely

\sup_{\boldsymbol{\delta}\in K}|m_{\boldsymbol{\delta}}(X)|\leq B_{0}(X),\qquad\sup_{\boldsymbol{\delta}\in K}\|\nabla m_{\boldsymbol{\delta}}(X)\|\leq B_{1}(X),\qquad\sup_{\boldsymbol{\delta}\in K}\|\nabla^{2}m_{\boldsymbol{\delta}}(X)\|\leq B_{2}(X),

and $\mathbb{E}[B_{j}(X)]<\infty$ for $j=0,1,2$ .

Proof.

Fix $\boldsymbol{\delta}\in U$ . By (3), the maps $\eta_{\boldsymbol{\delta}}(X)$ and $\nu_{\boldsymbol{\delta}}(X)$ are well defined almost surely, and differentiation under the conditional expectation is justified up to second order because $e^{\boldsymbol{\delta}^{\top}W}(1+\|W\|^{p+2})$ dominates the first- and second-order directional derivatives of both integrands. Thus, for almost every $X$ ,

\nabla\eta_{\boldsymbol{\delta}}(X)=\mathbb{E}_{f}[\mu(X,W)We^{\boldsymbol{\delta}^{\top}W}\mid X],\qquad\nabla^{2}\eta_{\boldsymbol{\delta}}(X)=\mathbb{E}_{f}[\mu(X,W)WW^{\top}e^{\boldsymbol{\delta}^{\top}W}\mid X],

\nabla\nu_{\boldsymbol{\delta}}(X)=\mathbb{E}_{f}[We^{\boldsymbol{\delta}^{\top}W}\mid X],\qquad\nabla^{2}\nu_{\boldsymbol{\delta}}(X)=\mathbb{E}_{f}[WW^{\top}e^{\boldsymbol{\delta}^{\top}W}\mid X].

The polynomial growth bound on $\mu$ implies that there is a finite constant $C_{\eta}$ , depending only on $C$ and $p$ , such that almost surely

\sup_{\boldsymbol{\delta}\in K}\Big(|\eta_{\boldsymbol{\delta}}(X)|+\|\nabla\eta_{\boldsymbol{\delta}}(X)\|+\|\nabla^{2}\eta_{\boldsymbol{\delta}}(X)\|+|\nu_{\boldsymbol{\delta}}(X)|+\|\nabla\nu_{\boldsymbol{\delta}}(X)\|+\|\nabla^{2}\nu_{\boldsymbol{\delta}}(X)\|\Big)\leq C_{\eta}A(X).

By (4), the quotient rule gives, for almost every $X$ ,

\nabla m_{\boldsymbol{\delta}}(X)=\frac{\nabla\eta_{\boldsymbol{\delta}}(X)}{\nu_{\boldsymbol{\delta}}(X)}-\frac{\eta_{\boldsymbol{\delta}}(X)\nabla\nu_{\boldsymbol{\delta}}(X)}{\nu_{\boldsymbol{\delta}}(X)^{2}},

\nabla^{2}m_{\boldsymbol{\delta}}(X)=\frac{\nabla^{2}\eta_{\boldsymbol{\delta}}(X)}{\nu_{\boldsymbol{\delta}}(X)}-\frac{\nabla\eta_{\boldsymbol{\delta}}(X)\nabla\nu_{\boldsymbol{\delta}}(X)^{\top}+\nabla\nu_{\boldsymbol{\delta}}(X)\nabla\eta_{\boldsymbol{\delta}}(X)^{\top}+\eta_{\boldsymbol{\delta}}(X)\nabla^{2}\nu_{\boldsymbol{\delta}}(X)}{\nu_{\boldsymbol{\delta}}(X)^{2}}+\frac{2\eta_{\boldsymbol{\delta}}(X)\nabla\nu_{\boldsymbol{\delta}}(X)\nabla\nu_{\boldsymbol{\delta}}(X)^{\top}}{\nu_{\boldsymbol{\delta}}(X)^{3}}.

Therefore there exist finite constants $C_{0},C_{1},C_{2}$ , depending only on $C$ , $p$ , and $D_{\min}$ , such that

\sup_{\boldsymbol{\delta}\in K}|m_{\boldsymbol{\delta}}(X)|\leq C_{0}\{1+A(X)\},\qquad\sup_{\boldsymbol{\delta}\in K}\|\nabla m_{\boldsymbol{\delta}}(X)\|\leq C_{1}\{1+A(X)+A(X)^{2}\},

\sup_{\boldsymbol{\delta}\in K}\|\nabla^{2}m_{\boldsymbol{\delta}}(X)\|\leq C_{2}\{1+A(X)+A(X)^{2}\}\qquad\text{a.s.}

Since $\mathbb{E}[A(X)^{2}]<\infty$ , the envelopes $B_{j}(X)=C_{j}\{1+A(X)+A(X)^{2}\}$ are integrable. This proves the claim. ∎

Corollary 1 (Objective regularity and line search on Gelbrich level sets).

Let $c>0$ satisfy $c^{2}\notin\mathcal{V}$ and set $\mathcal{M}=\{\boldsymbol{\delta}\in\mathbb{R}^{d}:\,G(\boldsymbol{\delta})=c^{2}\}$ . Assume the boundary-separation condition

G^{-}_{\infty}:=\liminf_{\|\boldsymbol{\delta}\|\to\infty,\ \boldsymbol{\delta}\in\mathcal{D}}G(\boldsymbol{\delta})>c^{2}.

(5)

Choose $R_{c}>0$ such that

G(\boldsymbol{\delta})>c^{2}\qquad(\boldsymbol{\delta}\in\mathcal{D},\ \|\boldsymbol{\delta}\|\geq R_{c}),

(6)

and define $K_{c}:=\overline{B}(\boldsymbol{0},R_{c})$ . Let $R$ and $\mathcal{T}$ be as defined above, and let the Riemannian metric be the Euclidean metric on $T_{\boldsymbol{\delta}}\mathcal{M}$ . Assume the conditions of Lemma 4 on an open neighbourhood $U$ of $K_{c}$ . Assume also the constraint regularity on $U$ :

\inf_{\boldsymbol{\delta}\in U\cap\mathcal{M}}\|\nabla G(\boldsymbol{\delta})\|\geq\epsilon_{\nabla}>0,\qquad\sup_{\boldsymbol{\delta}\in U}\|\nabla^{2}G(\boldsymbol{\delta})\|\leq M_{G}<\infty.

(7)

Then $\mathcal{M}$ is compact, and:

(a)

$\psi$ is $C^{2}$ on $K_{c}$ , with

\nabla\psi(\boldsymbol{\delta})=\mathbb{E}\big[\nabla m_{\boldsymbol{\delta}}(X)\big],\qquad\nabla^{2}\psi(\boldsymbol{\delta})=\mathbb{E}\big[\nabla^{2}m_{\boldsymbol{\delta}}(X)\big],

and

\sup_{\boldsymbol{\delta}\in K_{c}}\|\nabla^{2}\psi(\boldsymbol{\delta})\|\leq M_{\psi}<\infty.

Consequently, $\nabla\psi$ is $M_{\psi}$ –Lipschitz on $K_{c}$ .

(b)

Writing $\boldsymbol{n}(\boldsymbol{\delta})=\nabla G(\boldsymbol{\delta})/\|\nabla G(\boldsymbol{\delta})\|$ and $P(\boldsymbol{\delta})=I-\boldsymbol{n}(\boldsymbol{\delta})\boldsymbol{n}(\boldsymbol{\delta})^{\top}$ , the Riemannian gradient $\operatorname{grad}\psi(\boldsymbol{\delta})=P(\boldsymbol{\delta})\nabla\psi(\boldsymbol{\delta})$ is Lipschitz on $\mathcal{M}$ , that is, there exists $L>0$ such that

\|\operatorname{grad}\psi(\boldsymbol{\delta})-\operatorname{grad}\psi(\boldsymbol{\delta}^{\prime})\|\leq L\|\boldsymbol{\delta}-\boldsymbol{\delta}^{\prime}\|\qquad(\boldsymbol{\delta},\boldsymbol{\delta}^{\prime}\in\mathcal{M}).

(c)

Fix $0<c_{1}<c_{2}<1$ . For any $\boldsymbol{\delta}\in\mathcal{M}$ and any descent direction $\boldsymbol{\eta}\in T_{\boldsymbol{\delta}}\mathcal{M}$ with $\langle\operatorname{grad}\psi(\boldsymbol{\delta}),\boldsymbol{\eta}\rangle<0$ , there exists a step size $t_{\ast}>0$ in the domain of $\gamma(t)=R_{\boldsymbol{\delta}}(t\boldsymbol{\eta})$ such that $\gamma$ satisfies the weak Wolfe conditions at $t_{\ast}$ .

Proof.

By (6), every point of $\mathcal{M}$ lies in $K_{c}$ . Since $G$ is continuous on $U$ and $K_{c}\subset U$ , the set

\mathcal{M}=K_{c}\cap G^{-1}(\{c^{2}\})

is closed in the compact set $K_{c}$ . Hence $\mathcal{M}$ is compact.

For part (a), Lemma 4 provides an integrable envelope $B_{2}(X)$ such that

\sup_{\boldsymbol{\delta}\in K_{c}}\|\nabla^{2}m_{\boldsymbol{\delta}}(X)\|\leq B_{2}(X)\qquad\text{a.s.}

The same lemma yields analogous integrable envelopes for $m_{\boldsymbol{\delta}}(X)$ and $\nabla m_{\boldsymbol{\delta}}(X)$ . Hence dominated differentiation applies to

\psi(\boldsymbol{\delta})=\mathbb{E}\big\{m_{\boldsymbol{\delta}}(X)\big\},

which gives

\nabla\psi(\boldsymbol{\delta})=\mathbb{E}\big[\nabla m_{\boldsymbol{\delta}}(X)\big],\qquad\nabla^{2}\psi(\boldsymbol{\delta})=\mathbb{E}\big[\nabla^{2}m_{\boldsymbol{\delta}}(X)\big].

Therefore

\sup_{\boldsymbol{\delta}\in K_{c}}\|\nabla^{2}\psi(\boldsymbol{\delta})\|\leq\mathbb{E}\Big[\sup_{\boldsymbol{\delta}\in K_{c}}\|\nabla^{2}m_{\boldsymbol{\delta}}(X)\|\Big]\leq\mathbb{E}\big[B_{2}(X)\big]=:M_{\psi}<\infty.

The Lipschitz property of $\nabla\psi$ on $K_{c}$ follows from the mean value theorem.

For $\boldsymbol{\delta},\boldsymbol{\delta}^{\prime}\in\mathcal{M}$ ,

\operatorname{grad}\psi(\boldsymbol{\delta})-\operatorname{grad}\psi(\boldsymbol{\delta}^{\prime})=P(\boldsymbol{\delta})\big(\nabla\psi(\boldsymbol{\delta})-\nabla\psi(\boldsymbol{\delta}^{\prime})\big)+\big(P(\boldsymbol{\delta})-P(\boldsymbol{\delta}^{\prime})\big)\nabla\psi(\boldsymbol{\delta}^{\prime}).

Since $\|P(\boldsymbol{\delta})\|_{\mathrm{op}}\leq 1$ ,

\|P(\boldsymbol{\delta})\big(\nabla\psi(\boldsymbol{\delta})-\nabla\psi(\boldsymbol{\delta}^{\prime})\big)\|\leq M_{\psi}\|\boldsymbol{\delta}-\boldsymbol{\delta}^{\prime}\|.

Let $\boldsymbol{n}(\boldsymbol{\delta})=\nabla G(\boldsymbol{\delta})/\|\nabla G(\boldsymbol{\delta})\|$ . By (7), $\|\nabla G(\boldsymbol{\delta})\|\geq\epsilon_{\nabla}$ on $\mathcal{M}$ , and $\|\nabla G(\boldsymbol{\delta})-\nabla G(\boldsymbol{\delta}^{\prime})\|\leq M_{G}\|\boldsymbol{\delta}-\boldsymbol{\delta}^{\prime}\|$ on $K_{c}$ . Using

\boldsymbol{n}(\boldsymbol{\delta})-\boldsymbol{n}(\boldsymbol{\delta}^{\prime})=\frac{\nabla G(\boldsymbol{\delta})-\nabla G(\boldsymbol{\delta}^{\prime})}{\|\nabla G(\boldsymbol{\delta})\|}+\Big(\frac{1}{\|\nabla G(\boldsymbol{\delta})\|}-\frac{1}{\|\nabla G(\boldsymbol{\delta}^{\prime})\|}\Big)\nabla G(\boldsymbol{\delta}^{\prime}),

one obtains

\|\boldsymbol{n}(\boldsymbol{\delta})-\boldsymbol{n}(\boldsymbol{\delta}^{\prime})\|\leq\frac{2}{\epsilon_{\nabla}}\,\|\nabla G(\boldsymbol{\delta})-\nabla G(\boldsymbol{\delta}^{\prime})\|\leq\frac{2M_{G}}{\epsilon_{\nabla}}\,\|\boldsymbol{\delta}-\boldsymbol{\delta}^{\prime}\|.

Since $P(\boldsymbol{\delta})=I-\boldsymbol{n}(\boldsymbol{\delta})\boldsymbol{n}(\boldsymbol{\delta})^{\top}$ ,

\|P(\boldsymbol{\delta})-P(\boldsymbol{\delta}^{\prime})\|\leq\|\boldsymbol{n}(\boldsymbol{\delta})-\boldsymbol{n}(\boldsymbol{\delta}^{\prime})\|\big(\|\boldsymbol{n}(\boldsymbol{\delta})\|+\|\boldsymbol{n}(\boldsymbol{\delta}^{\prime})\|\big)\leq\frac{4M_{G}}{\epsilon_{\nabla}}\,\|\boldsymbol{\delta}-\boldsymbol{\delta}^{\prime}\|.

Let $M_{1}=\sup_{\boldsymbol{\delta}\in K_{c}}\|\nabla\psi(\boldsymbol{\delta})\|<\infty$ . Combining these bounds yields

\|\operatorname{grad}\psi(\boldsymbol{\delta})-\operatorname{grad}\psi(\boldsymbol{\delta}^{\prime})\|\leq\Big(M_{\psi}+\frac{4M_{G}}{\epsilon_{\nabla}}M_{1}\Big)\,\|\boldsymbol{\delta}-\boldsymbol{\delta}^{\prime}\|,

so part (b) holds with $L=M_{\psi}+(4M_{G}/\epsilon_{\nabla})M_{1}$ .

For part (c), define the one-dimensional line-search function $h(t)=\psi(R_{\boldsymbol{\delta}}(t\boldsymbol{\eta}))$ . By part (a) and the $C^{1}$ regularity of $R$ , the map $h$ is continuously differentiable on an interval containing $0$ . Because $R_{\boldsymbol{\delta}}(t\boldsymbol{\eta})\in\mathcal{M}$ whenever the retraction is defined, and because $\psi$ is continuous on the compact set $\mathcal{M}$ , the function $h$ is bounded below on its domain. Therefore there exists $t_{\ast}>0$ satisfying the weak Wolfe conditions along $\gamma(t)=R_{\boldsymbol{\delta}}(t\boldsymbol{\eta})$ ; see Ring and Wirth (2012, Proposition 1). ∎

For an accepted step $t_{\ast}$ , define

\boldsymbol{s}:=\mathcal{T}_{\boldsymbol{\delta}\to\boldsymbol{\delta}_{+}}(t_{\ast}\boldsymbol{\eta}),\qquad\boldsymbol{y}:=\operatorname{grad}\psi(\boldsymbol{\delta}_{+})-\mathcal{T}_{\boldsymbol{\delta}\to\boldsymbol{\delta}_{+}}\big(\operatorname{grad}\psi(\boldsymbol{\delta})\big),\qquad\boldsymbol{\delta}_{+}:=R_{\boldsymbol{\delta}}(t_{\ast}\boldsymbol{\eta}).

The cautious RBFGS update of Huang et al. (2018, Algorithm 1) is then applied whenever the prescribed curvature condition holds; otherwise the current approximation is transported to the new tangent space without updating the Hessian surrogate.

Lemma 5 (Feasible initialization and containment).

Let $c>0$ satisfy $c^{2}\notin\mathcal{V}$ and set $\mathcal{M}=\{\boldsymbol{\delta}:G(\boldsymbol{\delta})=c^{2}\}$ . Assume the boundary-separation condition (5). Assume that $G$ satisfies the local quadratic control near the origin: there exist $\rho>0$ , $\varepsilon\in(0,1)$ and a symmetric positive–definite matrix $\boldsymbol{H}_{G}(\boldsymbol{0})$ with eigenvalues $\lambda_{\min},\lambda_{\max}>0$ such that, for all $\|\boldsymbol{\delta}\|\leq\rho$ ,

\tfrac{1}{2}(1-\varepsilon)\lambda_{\min}\|\,\boldsymbol{\delta}\|^{2}\ \leq\ G(\boldsymbol{\delta})\ \leq\ \tfrac{1}{2}(1+\varepsilon)\lambda_{\max}\|\,\boldsymbol{\delta}\|^{2}.

(8)

Assume also the hypotheses of Corollary 1.

(i)

For every prescribed local radius $r_{0}\in(0,\rho]$ , there exists $c_{0}>0$ such that for any $c\in(0,c_{0})$ one can select a unit vector $\boldsymbol{v}$ and $t^{\ast}\in(0,r_{0}]$ with $G(t^{\ast}\boldsymbol{v})=c^{2}$ and $\boldsymbol{\delta}_{0}:=t^{\ast}\boldsymbol{v}\in\mathcal{M}\cap\overline{B}(\boldsymbol{0},r_{0})$ .
(ii)

Let $\mathcal{L}=\{\boldsymbol{\delta}\in\mathcal{M}:\ \psi(\boldsymbol{\delta})\leq\psi(\boldsymbol{\delta}_{0})\}$ . Then $\mathcal{M}$ is compact, $\mathcal{L}$ is compact, and for any sequence $\{\boldsymbol{\delta}_{k}\}$ generated by Riemannian BFGS on $\mathcal{M}$ using $R$ and $\mathcal{T}$ above with a weak Wolfe line search, one has $\psi(\boldsymbol{\delta}_{k+1})\leq\psi(\boldsymbol{\delta}_{k})\leq\psi(\boldsymbol{\delta}_{0})$ and hence $\boldsymbol{\delta}_{k}\in\mathcal{L}$ for all $k\geq 0$ .

Proof.

For part (i), fix $r_{0}\in(0,\rho]$ . Fix a unit vector $\boldsymbol{v}$ and define $q(t)=G(t\boldsymbol{v})$ on $[0,r_{0}]$ . Then $q$ is continuous, $q(0)=0$ , and by (8),

\tfrac{1}{2}(1-\varepsilon)\lambda_{\min}t^{2}\ \leq\ q(t)\ \leq\ \tfrac{1}{2}(1+\varepsilon)\lambda_{\max}t^{2}\qquad(0\leq t\leq r_{0}).

Let $c_{0}=\sqrt{\tfrac{1}{2}(1-\varepsilon)\lambda_{\min}}\,r_{0}$ . For any $c\in(0,c_{0})$ , there exists $t^{\ast}\in(0,r_{0}]$ with $q(t^{\ast})=c^{2}$ by the intermediate value theorem. Then $\boldsymbol{\delta}_{0}=t^{\ast}\boldsymbol{v}\in\mathcal{M}\cap\overline{B}(\boldsymbol{0},r_{0})$ , proving part (i).

For part (ii), condition (5) implies the existence of $R_{c}>0$ such that $G(\boldsymbol{\delta})>c^{2}$ for all $\boldsymbol{\delta}\in\mathcal{D}$ with $\|\boldsymbol{\delta}\|\geq R_{c}$ . Hence $\mathcal{M}\subset\overline{B}(\boldsymbol{0},R_{c})$ . By Corollary 1, the level set $\mathcal{M}$ is compact and $\psi$ is continuous on $\mathcal{M}$ . Hence $\mathcal{L}$ is a closed subset of the compact set $\mathcal{M}$ , so $\mathcal{L}$ is compact. Now proceed by induction. Assume $\boldsymbol{\delta}_{k}\in\mathcal{L}$ and let $\boldsymbol{\eta}_{k}\in T_{\boldsymbol{\delta}_{k}}\mathcal{M}$ be the search direction. Let $\alpha_{k}>0$ be produced by the weak Wolfe line search applied to $\gamma(t)=R_{\boldsymbol{\delta}_{k}}(t\boldsymbol{\eta}_{k})$ and set $\boldsymbol{\delta}_{k+1}=R_{\boldsymbol{\delta}_{k}}(\alpha_{k}\boldsymbol{\eta}_{k})$ . The Armijo condition yields

\psi(\boldsymbol{\delta}_{k+1})\ \leq\ \psi(\boldsymbol{\delta}_{k})+c_{1}\alpha_{k}\,\langle\operatorname{grad}\psi(\boldsymbol{\delta}_{k}),\boldsymbol{\eta}_{k}\rangle\ <\ \psi(\boldsymbol{\delta}_{k})\ \leq\ \psi(\boldsymbol{\delta}_{0}),

hence $\boldsymbol{\delta}_{k+1}\in\mathcal{L}$ . Therefore $\psi(\boldsymbol{\delta}_{k+1})\leq\psi(\boldsymbol{\delta}_{k})\leq\psi(\boldsymbol{\delta}_{0})$ and $\boldsymbol{\delta}_{k}\in\mathcal{L}$ for all $k\geq 0$ . ∎

Stationarity conclusion.

Under (5), Lemma 5 shows that the initial sublevel set $\mathcal{L}=\{\boldsymbol{\delta}\in\mathcal{M}:\psi(\boldsymbol{\delta})\leq\psi(\boldsymbol{\delta}_{0})\}$ is compact. Thus Assumption 4.1 in Huang et al. (2018) is satisfied for the RBFGS iterates started at $\boldsymbol{\delta}_{0}$ . Together with Assumption 4.2, applied here with the scaled transport $\mathcal{T}$ above, Theorem 4.2 yields

\liminf_{k\to\infty}\|\operatorname{grad}\psi(\boldsymbol{\delta}_{k})\|=0.

Since $\{\boldsymbol{\delta}_{k}\}\subset\mathcal{L}$ and $\mathcal{L}$ is compact, the sequence admits an accumulation point $\boldsymbol{\delta}_{\star}$ . Continuity of $\operatorname{grad}\psi$ implies that any accumulation point along a subsequence with $\|\operatorname{grad}\psi(\boldsymbol{\delta}_{k_{j}})\|\to 0$ is Riemannian stationary.

Appendix C Detailed Proof for Minimax Lower Bound

Standing Assumptions and Notation

Let $\boldsymbol{Z}=(\boldsymbol{X},\boldsymbol{W},Y)$ with $\boldsymbol{X}\in\mathbb{R}^{p}$ , $\boldsymbol{W}\in\mathbb{R}^{q}$ , and $Y\in\mathbb{R}$ . Write $f(\boldsymbol{w}\mid\boldsymbol{x})$ for the exposure density, and define the exponential tilt

	$\displaystyle g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x})$	$\displaystyle:=\frac{\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\,f(\boldsymbol{w}\mid\boldsymbol{x})}{\nu_{\boldsymbol{\delta}}(\boldsymbol{x})},$
	$\displaystyle\nu_{\boldsymbol{\delta}}(\boldsymbol{x})$	$\displaystyle:=\mathbb{E}_{f}\big[\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}\mid\boldsymbol{X}=\boldsymbol{x}\big].$

Let $\mu(\boldsymbol{x},\boldsymbol{w}):=\mathbb{E}[Y\mid\boldsymbol{X}=\boldsymbol{x},\boldsymbol{W}=\boldsymbol{w}]$ , and denote

\tilde{\boldsymbol{s}}(\boldsymbol{w},\boldsymbol{x}):=\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})-\mathbb{E}_{f}[\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}=\boldsymbol{x}].

All bounds are derived for a general tilt and then specialized to $\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})=\boldsymbol{w}$ .

Assumptions:

(A1) Bounded outcomes

There exists $M<\infty$ such that $|Y|\leq M$ a.s. Hence $|\mu(\boldsymbol{x},\boldsymbol{w})|\leq M$ .

(A2) Nondegenerate noise

$0<\underline{\sigma}^{2}\leq\operatorname{Var}(Y\mid\boldsymbol{X},\boldsymbol{W})\leq\overline{\sigma}^{2}<\infty$ a.s.

(A3) Bounded tilt

$\|\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\|\leq M_{s}<\infty$ and we require for some finite radius $\Delta\in(0,\infty)$ that $\|\boldsymbol{\delta}\|\leq\Delta$ . For the purpose of this proof, we assume there exists a fixed constant $C_{\delta}>0$ such that

\|\boldsymbol{\delta}\|\ \leq\ \frac{C_{\delta}}{M_{s}}\,.

(A4) Model richness

The class of distributions $\mathcal{P}$ is sufficiently rich. There exist constants $B,\epsilon_{0}>0$ such that for any $P_{0}\in\mathcal{P}$ , if $\phi\in L_{\infty}(P_{0})$ with $\operatorname{E}_{P_{0}}[\phi(\boldsymbol{Z})]=0$ and $\|\phi\|_{\infty}\leq B$ , then for all $|\epsilon|\leq\epsilon_{0}$ , the density $p_{0}(1+\epsilon\phi)$ corresponds to a distribution $P_{1}\in\mathcal{P}$ .

(A5) Pathwise differentiability

The functional $\psi(\cdot)$ is pathwise differentiable at any $P_{0}\in\mathcal{P}$ . The remainder from the von Mises expansion admits a uniform quadratic bound: there exists a constant $K<\infty$ such that for any perturbation $\phi$ with $\|\phi\|_{\infty}\leq B$ , the bound

|\psi(P_{0}(1+\epsilon\phi))-\psi(P_{0})-\epsilon\operatorname{E}_{P_{0}}[\varphi(\boldsymbol{Z};P_{0})\phi(\boldsymbol{Z})]|\leq K\epsilon^{2}

holds uniformly for all $P_{0}\in\mathcal{P}$ and all valid $\phi$ .

Identification and target. Under standard consistency and conditional exchangeability,

	$\displaystyle\psi_{P}(\boldsymbol{\delta})$	$\displaystyle=\mathbb{E}_{P}\Big[\int\mu(\boldsymbol{X},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\Big],$
	$\displaystyle\theta_{P}(\boldsymbol{\delta})$	$\displaystyle:=\psi_{P}(\boldsymbol{\delta})-\psi_{P}(\boldsymbol{0}).$

C.1 Uniform second-order bounds

This subsection derives uniform second-order $L_{2}(P_{0})$ bounds for (i) the density ratio $g_{\boldsymbol{\delta}}/f$ and (ii) the tilted conditional mean $\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]$ , valid over a finite-tilt regime. The bounds provide an explicit quadratic remainder for the expansion around $\boldsymbol{\delta}=\boldsymbol{0}$ and uniform control of second-order terms in the Le Cam two-point construction underlying the minimax lower bound.

Lemma 6 (Uniform $L_{2}$ bounds with explicit constants).

Suppose (A1)–(A3). Let $\tau:=\|\boldsymbol{\delta}\|M_{s}\leq C_{\delta}$ . Then for all such $\boldsymbol{\delta}$ ,

	$\displaystyle\Big\\|\frac{g_{\boldsymbol{\delta}}}{f}-1-\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}\Big\\|_{L_{2}(P_{0})}$	$\displaystyle\leq C_{g}\,\\|\boldsymbol{\delta}\\|^{2},$
	$\displaystyle\big\\|\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]-\mathbb{E}_{f}[\mu\mid\boldsymbol{X}]-\boldsymbol{\delta}^{\top}\mathbb{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big\\|_{L_{2}(P_{0})}$	$\displaystyle\leq C_{\mu}\,\\|\boldsymbol{\delta}\\|^{2},$

where one can take

	$\displaystyle C_{g}$	$\displaystyle:=4\,e^{2C_{\max}}\,M_{s}^{2},$
	$\displaystyle C_{\mu}$	$\displaystyle:=4\,e^{2C_{\max}}\,M\,M_{s}^{2}.$

Proof.

All vector norms are Euclidean, and all matrix norms are operator norms. Expectations and conditional expectations are under the baseline law $P_{0}$ unless explicitly indicated. Write $M_{s}:=\sup_{\boldsymbol{w},\boldsymbol{x}}\|\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\|<\infty$ by (A3), and $M:=\sup_{\boldsymbol{x},\boldsymbol{w}}|\mu(\boldsymbol{x},\boldsymbol{w})|<\infty$ by (A1). Let $\boldsymbol{\delta}\in\mathbb{R}^{q}$ and define

\tau:=\|\boldsymbol{\delta}\|\,M_{s}.

Under (A3),

\|\boldsymbol{\delta}\|\ \leq\ \frac{C_{\delta}}{M_{s}}\quad\Longrightarrow\quad\tau\leq C_{\delta},

which yields uniform bounds with explicit constants.

Define

	$\displaystyle r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})$	$\displaystyle:=\frac{g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{x})}{f(\boldsymbol{w}\mid\boldsymbol{x})}=\frac{\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}}{\nu_{\boldsymbol{\delta}}(\boldsymbol{x})},$
	$\displaystyle\nu_{\boldsymbol{\delta}}(\boldsymbol{x})$	$\displaystyle:=\mathbb{E}_{f}\big[\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}\,\big\|\,\boldsymbol{X}=\boldsymbol{x}\big].$

Since $|\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})|\leq\|\boldsymbol{\delta}\|\,\|\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\|\leq\tau\leq C_{\delta}$ , we have

\displaystyle e^{-\tau}\ \leq\ \exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\ \leq\ e^{\tau},\qquad e^{-\tau}\ \leq\ \nu_{\boldsymbol{\delta}}(\boldsymbol{x})\ \leq\ e^{\tau}.

It follows that

0<e^{-2\tau}\ \leq\ r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})\ \leq\ e^{2\tau}\ \leq\ e^{2C_{\max}}.

Second-order expansion of $r_{\boldsymbol{\delta}}$ and a uniform Hessian bound. Introduce the log-partition function

\log\nu_{\boldsymbol{\delta}}(\boldsymbol{x})=\log\mathbb{E}_{f}\!\big[\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}\,\big|\,\boldsymbol{X}=\boldsymbol{x}\big].

Then

r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})=\exp\big\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})-\log\nu_{\boldsymbol{\delta}}(\boldsymbol{x})\big\}.

Differentiating with respect to $\boldsymbol{\delta}$ yields

	$\displaystyle\nabla_{\boldsymbol{\delta}}\log\nu_{\boldsymbol{\delta}}(\boldsymbol{x})$	$\displaystyle=\mathbb{E}_{g_{\boldsymbol{\delta}}}\!\big[\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\,\big\|\,\boldsymbol{X}=\boldsymbol{x}\big]=:\boldsymbol{\mu}_{g,s}(\boldsymbol{\delta},\boldsymbol{x}),$
	$\displaystyle\nabla_{\boldsymbol{\delta}}^{2}\log\nu_{\boldsymbol{\delta}}(\boldsymbol{x})$	$\displaystyle=\operatorname{Var}_{g_{\boldsymbol{\delta}}}\!\big(\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\,\big\|\,\boldsymbol{X}=\boldsymbol{x}\big),$

and therefore

	$\displaystyle\nabla_{\boldsymbol{\delta}}\log r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})$	$\displaystyle=\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})-\boldsymbol{\mu}_{g,s}(\boldsymbol{\delta},\boldsymbol{x}),$
	$\displaystyle\nabla_{\boldsymbol{\delta}}^{2}\log r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})$	$\displaystyle=-\,\operatorname{Var}_{g_{\boldsymbol{\delta}}}\!\big(\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\,\big\|\,\boldsymbol{X}=\boldsymbol{x}\big).$

Using $\nabla F=F\,\nabla\log F$ and $\nabla^{2}F=F\big[(\nabla\log F)(\nabla\log F)^{\top}+\nabla^{2}\log F\big]$ , we obtain

	$\displaystyle\nabla_{\boldsymbol{\delta}}r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})$	$\displaystyle=r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})\,\big[\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})-\boldsymbol{\mu}_{g,s}(\boldsymbol{\delta},\boldsymbol{x})\big],$
	$\displaystyle\nabla_{\boldsymbol{\delta}}^{2}r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})$	$\displaystyle=r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})\,\Big(\big[\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})-\boldsymbol{\mu}_{g,s}(\boldsymbol{\delta},\boldsymbol{x})\big]\big[\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})-\boldsymbol{\mu}_{g,s}(\boldsymbol{\delta},\boldsymbol{x})\big]^{\top}$
		$\displaystyle\qquad\qquad\qquad\qquad-\operatorname{Var}_{g_{\boldsymbol{\delta}}}\!\big(\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\,\big\|\,\boldsymbol{X}=\boldsymbol{x}\big)\Big).$

A uniform bound for the Hessian holds on the ball $\{\boldsymbol{\delta}:\ \|\boldsymbol{\delta}\|\leq C_{\delta}/M_{s}\}$ . First,

	$\displaystyle\\|\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})-\boldsymbol{\mu}_{g,s}(\boldsymbol{\delta},\boldsymbol{x})\\|$	$\displaystyle\leq\\|\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\\|+\\|\boldsymbol{\mu}_{g,s}(\boldsymbol{\delta},\boldsymbol{x})\\|$
		$\displaystyle\leq M_{s}+\mathbb{E}_{g_{\boldsymbol{\delta}}}[\\|\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\\|\mid\boldsymbol{X}=\boldsymbol{x}]\leq 2M_{s},$

\big\|\big[\boldsymbol{s}-\boldsymbol{\mu}_{g,s}\big]\big[\boldsymbol{s}-\boldsymbol{\mu}_{g,s}\big]^{\top}\big\|\leq\|\boldsymbol{s}-\boldsymbol{\mu}_{g,s}\|^{2}\leq 4M_{s}^{2}.

Next, since $\|\boldsymbol{s}-\boldsymbol{\mu}_{g,s}\|\leq 2M_{s}$ ,

	$\displaystyle\big\\|\operatorname{Var}_{g_{\boldsymbol{\delta}}}(\boldsymbol{s}\mid\boldsymbol{X}=\boldsymbol{x})\big\\|$	$\displaystyle=\big\\|\mathbb{E}_{g_{\boldsymbol{\delta}}}\big[(\boldsymbol{s}-\boldsymbol{\mu}_{g,s})(\boldsymbol{s}-\boldsymbol{\mu}_{g,s})^{\top}\mid\boldsymbol{X}=\boldsymbol{x}\big]\big\\|$
		$\displaystyle\leq\mathbb{E}_{g_{\boldsymbol{\delta}}}\big[\\|\boldsymbol{s}-\boldsymbol{\mu}_{g,s}\\|^{2}\mid\boldsymbol{X}=\boldsymbol{x}\big]\leq 4M_{s}^{2}.$

Combining these bounds with $0<r_{\boldsymbol{\delta}}\leq e^{2\tau}\leq e^{2C_{\max}}$ yields the uniform Hessian bound

\big\|\nabla_{\boldsymbol{\delta}}^{2}r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})\big\|\leq r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})\,\big(4M_{s}^{2}+4M_{s}^{2}\big)\leq 8\,e^{2C_{\max}}\,M_{s}^{2},\qquad\text{for all }(\boldsymbol{w},\boldsymbol{x})\text{ and all }\|\boldsymbol{\delta}\|\leq\tfrac{C_{\delta}}{M_{s}}.

(9)

Proof of (i). By Taylor’s theorem with integral remainder for the scalar function $r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})$ around $\boldsymbol{\delta}=\boldsymbol{0}$ ,

	$\displaystyle r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})$	$\displaystyle=r_{\boldsymbol{0}}(\boldsymbol{w},\boldsymbol{x})+\nabla_{\boldsymbol{\delta}}r_{\boldsymbol{0}}(\boldsymbol{w},\boldsymbol{x})^{\top}\boldsymbol{\delta}$
		$\displaystyle\qquad+\boldsymbol{\delta}^{\top}\!\Big(\int_{0}^{1}(1-t)\,\nabla_{\boldsymbol{\delta}}^{2}r_{t\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})\,dt\Big)\boldsymbol{\delta}.$

Since $\nu_{\boldsymbol{0}}(\boldsymbol{x})=1$ ,

	$\displaystyle r_{\boldsymbol{0}}(\boldsymbol{w},\boldsymbol{x})$	$\displaystyle=1,$
	$\displaystyle\nabla_{\boldsymbol{\delta}}r_{\boldsymbol{0}}(\boldsymbol{w},\boldsymbol{x})$	$\displaystyle=\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})-\mathbb{E}_{f}[\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}=\boldsymbol{x}]=\tilde{\boldsymbol{s}}(\boldsymbol{w},\boldsymbol{x}).$

Therefore,

\displaystyle r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})-1-\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}(\boldsymbol{w},\boldsymbol{x})

\displaystyle=\boldsymbol{\delta}^{\top}\!\Big(\int_{0}^{1}(1-t)\,\nabla_{\boldsymbol{\delta}}^{2}r_{t\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})\,dt\Big)\boldsymbol{\delta}.

By Taylor’s theorem with integral remainder and (9),

	$\displaystyle\big\|r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})-1-\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}(\boldsymbol{w},\boldsymbol{x})\big\|$	$\displaystyle\leq\Big(\int_{0}^{1}(1-t)\,dt\Big)\,\Big(\sup_{\boldsymbol{\xi}:\ \\|\boldsymbol{\xi}\\|\leq\\|\boldsymbol{\delta}\\|}\big\\|\nabla_{\boldsymbol{\delta}}^{2}r_{\boldsymbol{\xi}}(\boldsymbol{w},\boldsymbol{x})\big\\|\Big)\,\\|\boldsymbol{\delta}\\|^{2}$
		$\displaystyle\leq 4\,e^{2C_{\max}}\,M_{s}^{2}\,\\|\boldsymbol{\delta}\\|^{2}.$

Hence,

\Big\|r_{\boldsymbol{\delta}}-1-\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}\Big\|_{L_{2}(P_{0})}\ \leq\ 4\,e^{2C_{\max}}\,M_{s}^{2}\,\|\boldsymbol{\delta}\|^{2},

which proves part (i) with $C_{g}:=4e^{2C_{\max}}M_{s}^{2}$ .

Proof of (ii). By definition of the tilt,

\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]=\mathbb{E}_{f}\!\big[\mu(\boldsymbol{X},\boldsymbol{W})\,r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}\big].

Therefore,

	$\displaystyle\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]-\mathbb{E}_{f}[\mu\mid\boldsymbol{X}]$	$\displaystyle=\mathbb{E}_{f}\!\big[\mu(\boldsymbol{X},\boldsymbol{W})\,\{r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})-1\}\mid\boldsymbol{X}\big]$
		$\displaystyle=\boldsymbol{\delta}^{\top}\,\mathbb{E}_{f}\big[\mu(\boldsymbol{X},\boldsymbol{W})\,\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}\big]+\mathbb{E}_{f}\!\big[\mu(\boldsymbol{X},\boldsymbol{W})\,r_{2}(\boldsymbol{W},\boldsymbol{X};\boldsymbol{\delta})\mid\boldsymbol{X}\big],$

where

r_{2}(\boldsymbol{w},\boldsymbol{x};\boldsymbol{\delta}):=r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})-1-\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}(\boldsymbol{w},\boldsymbol{x}).

Define

R_{\mu}(\boldsymbol{X};\boldsymbol{\delta}):=\mathbb{E}_{f}\!\big[\mu(\boldsymbol{X},\boldsymbol{W})\,r_{2}(\boldsymbol{W},\boldsymbol{X};\boldsymbol{\delta})\mid\boldsymbol{X}\big].

By Cauchy–Schwarz and $|\mu|\leq M$ ,

	$\displaystyle\\|R_{\mu}(\cdot;\boldsymbol{\delta})\\|_{L_{2}(P_{0})}^{2}$	$\displaystyle=\mathbb{E}\Big[\,\big(\mathbb{E}_{f}\big[\mu\,r_{2}\mid\boldsymbol{X}\big]\big)^{2}\,\Big]\leq\mathbb{E}\Big[\,\mathbb{E}_{f}\big[\mu^{2}\,r_{2}^{2}\mid\boldsymbol{X}\big]\,\Big]$
		$\displaystyle=\mathbb{E}\big[\mu^{2}r_{2}^{2}\big]\leq M^{2}\,\\|r_{2}\\|_{L_{2}(P_{0})}^{2}.$

From part (i), $\|r_{2}\|_{L_{2}(P_{0})}\leq C_{g}\,\|\boldsymbol{\delta}\|^{2}$ with $C_{g}=4\,e^{2C_{\max}}\,M_{s}^{2}$ . Hence

\|R_{\mu}(\cdot;\boldsymbol{\delta})\|_{L_{2}(P_{0})}\leq M\,C_{g}\,\|\boldsymbol{\delta}\|^{2}=4\,e^{2C_{\max}}\,M\,M_{s}^{2}\,\|\boldsymbol{\delta}\|^{2}.

Thus

\Big\|\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]-\mathbb{E}_{f}[\mu\mid\boldsymbol{X}]-\boldsymbol{\delta}^{\top}\mathbb{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\Big\|_{L_{2}(P_{0})}\leq 4\,e^{2C_{\max}}\,M\,M_{s}^{2}\,\|\boldsymbol{\delta}\|^{2},

which proves part (ii) with $C_{\mu}:=4e^{2C_{\max}}MM_{s}^{2}$ . ∎

C.2 First-order expansion of the EIF and its covariance

The second-order expansions above imply a first-order expansion of the efficient influence function for $\psi(\boldsymbol{\delta})$ at $\boldsymbol{\delta}=\boldsymbol{0}$ , with leading term $\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})$ and a quadratic remainder.

Lemma 7 (EIF expansion from the three-component decomposition).

Let

\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})=D_{Y}(\boldsymbol{Z})+D_{g,\mu}(\boldsymbol{Z})+D_{\psi}(\boldsymbol{Z}).

Recall $\tilde{\boldsymbol{s}}(\boldsymbol{w},\boldsymbol{x}):=\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})-\operatorname{E}_{f}[\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}=\boldsymbol{x}]$ , and define

\boldsymbol{h}(\boldsymbol{Z}):=\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\Big(Y-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]\Big)\;-\;\operatorname{E}\big[\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big].

Assume (A1)–(A3) and $\|\boldsymbol{\delta}\|\leq C_{\delta}/M_{s}$ . Then, with the constants from Lemma 6,

	$\displaystyle C_{g}$	$\displaystyle=4e^{2C_{\max}}M_{s}^{2},$
	$\displaystyle C_{\mu}$	$\displaystyle=4e^{2C_{\max}}MM_{s}^{2},$

we have the expansion

\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})=\underbrace{Y-\psi_{P}(\boldsymbol{0})}_{\varphi_{\psi(\boldsymbol{0})}(\boldsymbol{Z})}+\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})+R_{\varphi}(\boldsymbol{Z};\boldsymbol{\delta}),

and an explicit quadratic $L_{2}$ bound

	$\displaystyle\\|R_{\varphi}(\cdot;\boldsymbol{\delta})\\|_{L_{2}(P_{0})}$	$\displaystyle\leq C_{\varphi}\,\\|\boldsymbol{\delta}\\|^{2},$
	$\displaystyle C_{\varphi}$	$\displaystyle:=4M_{s}^{2}\Big(e^{2C_{\max}}\overline{\sigma}+M\big(e^{4C_{\max}}+(4+2C_{\max})e^{2C_{\max}}+1\big)\Big).$

Consequently,

\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z}):=\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})-\varphi_{\psi(\boldsymbol{0})}(\boldsymbol{Z})=\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})+R_{\varphi}(\boldsymbol{Z};\boldsymbol{\delta}).

We have $\mathbb{E}[\boldsymbol{h}(\boldsymbol{Z})]=\boldsymbol{0}$ . Moreover, the leading covariance of the influence function satisfies

	$\displaystyle\operatorname{Cov}\big(\boldsymbol{h}(\boldsymbol{Z})\big)$	$\displaystyle=\operatorname{E}\!\Big[\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})^{\top}\operatorname{Var}(Y\mid\boldsymbol{X},\boldsymbol{W})\Big]$
		$\displaystyle\quad+\operatorname{Cov}\!\Big(\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\big(\mu(\boldsymbol{X},\boldsymbol{W})-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]\big)\Big)=:\boldsymbol{\Sigma}_{\varepsilon,s}+\boldsymbol{\Sigma}_{\mu,\mathrm{full}},$

where

	$\displaystyle\boldsymbol{\Sigma}_{\varepsilon,s}$	$\displaystyle:=\operatorname{E}\big[\operatorname{Var}(Y\mid\boldsymbol{X},\boldsymbol{W})\,\tilde{\boldsymbol{s}}\tilde{\boldsymbol{s}}^{\top}\big],$
	$\displaystyle\boldsymbol{C}$	$\displaystyle:=\operatorname{E}\!\big[\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big],$
	$\displaystyle\boldsymbol{\Sigma}_{\mu,\mathrm{full}}$	$\displaystyle:=\operatorname{E}\!\Big[\tilde{\boldsymbol{s}}\tilde{\boldsymbol{s}}^{\top}\big(\mu-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]\big)^{2}\Big]-\boldsymbol{C}\boldsymbol{C}^{\top}.$

In particular,

\boldsymbol{\Sigma}_{\mu,\mathrm{full}}\succeq\underbrace{\operatorname{Cov}\!\big(\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big)}_{=:\ \boldsymbol{\Sigma}_{\mu}}.

Equality $\boldsymbol{\Sigma}_{\mu,\mathrm{full}}=\boldsymbol{\Sigma}_{\mu}$ holds if and only if $\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\{\mu(\boldsymbol{X},\boldsymbol{W})-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]\}$ is conditionally degenerate given $\boldsymbol{X}$ .

Proof.

Preliminaries. Recall

r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X}):=\frac{g_{\boldsymbol{\delta}}(\boldsymbol{W}\mid\boldsymbol{X})}{f(\boldsymbol{W}\mid\boldsymbol{X})}.

By Lemma 6, for $\|\boldsymbol{\delta}\|\leq C_{\delta}/M_{s}$ there exist remainders $r_{2}(\boldsymbol{W},\boldsymbol{X};\boldsymbol{\delta})$ and $r_{\mu}(\boldsymbol{X};\boldsymbol{\delta})$ such that

	$\displaystyle r_{\boldsymbol{\delta}}$	$\displaystyle=1+\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}+r_{2},\qquad\\|r_{2}\\|_{L_{2}(P_{0})}\leq C_{g}\\|\boldsymbol{\delta}\\|^{2},$
	$\displaystyle\operatorname{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]$	$\displaystyle=\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]+\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]+r_{\mu},\qquad\\|r_{\mu}\\|_{L_{2}(P_{0})}\leq C_{\mu}\\|\boldsymbol{\delta}\\|^{2}.$

The tilt radius implies $0<r_{\boldsymbol{\delta}}\leq e^{2\tau}\leq e^{2C_{\max}}$ with $\tau=\|\boldsymbol{\delta}\|M_{s}\leq C_{\delta}$ , hence $\|r_{\boldsymbol{\delta}}\|_{\infty}\leq e^{2C_{\max}}$ . We also use $\|\tilde{\boldsymbol{s}}\|_{\infty}\leq 2M_{s}$ , $|\mu|\leq M$ , and $\|\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\|_{\infty}\leq 2MM_{s}$ .

Also,

	$\displaystyle\psi_{P}(\boldsymbol{\delta})$	$\displaystyle=\mathbb{E}\!\left[\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]\right]$
		$\displaystyle=\psi_{P}(\boldsymbol{0})+\boldsymbol{\delta}^{\top}\,\mathbb{E}\!\left[\mathbb{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\right]+\mathbb{E}\!\left[r_{\mu}(\boldsymbol{X};\boldsymbol{\delta})\right].$

Expansion of EIF components. Using the displays above,

	$\displaystyle D_{Y}$	$\displaystyle=r_{\boldsymbol{\delta}}\,(Y-\mu)=(1+\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}+r_{2})(Y-\mu)$
		$\displaystyle=(Y-\mu)+\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}\,(Y-\mu)+r_{2}\,(Y-\mu),$

	$\displaystyle D_{g,\mu}$	$\displaystyle=r_{\boldsymbol{\delta}}\big(\mu-\operatorname{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]\big)$
		$\displaystyle=(1+\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}+r_{2})\Big(\mu-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]-\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]-r_{\mu}\Big)$
		$\displaystyle=(\mu-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}])+\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}\,(\mu-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}])-\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]+R_{2},$

	$\displaystyle D_{\psi}$	$\displaystyle=\operatorname{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]-\psi_{P}(\boldsymbol{\delta})$
		$\displaystyle=\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]-\psi_{P}(\boldsymbol{0})+\boldsymbol{\delta}^{\top}\Big(\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]-\operatorname{E}[\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]]\Big)+(r_{\mu}-\operatorname{E}[r_{\mu}]),$

where the remainder $R_{2}$ collects all quadratic and higher-order terms:

	$\displaystyle R_{2}$	$\displaystyle:=r_{2}(\mu-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}])-(1+\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}+r_{2})\,r_{\mu}$
		$\displaystyle\quad-(\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}})\big(\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big)-r_{2}\,\big(\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big).$

Collection of zero and first-order terms. Adding $D_{Y}+D_{g,\mu}+D_{\psi}$ gives

	$\displaystyle\varphi_{\psi(\boldsymbol{\delta})}$	$\displaystyle=\Big[Y-\psi_{P}(\boldsymbol{0})\Big]$
		$\displaystyle\quad+\boldsymbol{\delta}^{\top}\Big[\tilde{\boldsymbol{s}}\big(Y-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]\big)-\operatorname{E}\big[\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big]\Big]$
		$\displaystyle\quad+\underbrace{\Big\{r_{2}(Y-\mu)+R_{2}+(r_{\mu}-\operatorname{E}[r_{\mu}])\Big\}}_{=:R_{\varphi}}.$

Hence the first-order term is $\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})$ .

Explicit $L_{2}$ bounds for remainder terms. By Cauchy–Schwarz and (A2),

\|r_{2}(Y-\mu)\|_{L_{2}}\ \leq\ \overline{\sigma}\,\|r_{2}\|_{L_{2}}\ \leq\ C_{g}\,\overline{\sigma}\,\|\boldsymbol{\delta}\|^{2}.

Since $\|\mu-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]\|_{\infty}\leq 2M$ ,

\|r_{2}(\mu-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}])\|_{L_{2}}\ \leq\ 2M\,\|r_{2}\|_{L_{2}}\ \leq\ 2M\,C_{g}\,\|\boldsymbol{\delta}\|^{2}.

Using $\|r_{\boldsymbol{\delta}}\|_{\infty}\leq e^{2C_{\max}}$ ,

\|(1+\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}+r_{2})\,r_{\mu}\|_{L_{2}}=\|r_{\boldsymbol{\delta}}r_{\mu}\|_{L_{2}}\leq e^{2C_{\max}}\,C_{\mu}\,\|\boldsymbol{\delta}\|^{2}.

Moreover,

\|r_{\mu}-\operatorname{E}[r_{\mu}]\|_{L_{2}}\ \leq\ \|r_{\mu}\|_{L_{2}}+\|\operatorname{E}[r_{\mu}]\|_{L_{2}}\ \leq\ 2C_{\mu}\,\|\boldsymbol{\delta}\|^{2}.

For the quadratic product,

	$\displaystyle\big\\|(\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}})\big(\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big)\big\\|_{L_{2}}$	$\displaystyle\leq\\|\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\\|_{\infty}\,\\|\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}\\|_{L_{2}}$
		$\displaystyle\leq(2MM_{s}\\|\boldsymbol{\delta}\\|)\,(2M_{s}\\|\boldsymbol{\delta}\\|)=4MM_{s}^{2}\,\\|\boldsymbol{\delta}\\|^{2}.$

Finally,

	$\displaystyle\big\\|r_{2}\,(\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}])\big\\|_{L_{2}}$	$\displaystyle\leq\\|r_{2}\\|_{L_{2}}\,\\|\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\\|_{\infty}$
		$\displaystyle\leq C_{g}\\|\boldsymbol{\delta}\\|^{2}\cdot(2MM_{s}\\|\boldsymbol{\delta}\\|)$
		$\displaystyle\leq 2C_{\max}\,M\,C_{g}\,\\|\boldsymbol{\delta}\\|^{2},$

where we used $\|\boldsymbol{\delta}\|\leq C_{\delta}/M_{s}$ . Combining all displays yields

	$\displaystyle\\|R_{\varphi}\\|_{L_{2}}$	$\displaystyle\leq C_{g}\,\overline{\sigma}\,\\|\boldsymbol{\delta}\\|^{2}+2M\,C_{g}\,\\|\boldsymbol{\delta}\\|^{2}+e^{2C_{\max}}\,C_{\mu}\,\\|\boldsymbol{\delta}\\|^{2}+2C_{\mu}\,\\|\boldsymbol{\delta}\\|^{2}$
		$\displaystyle\quad+4MM_{s}^{2}\,\\|\boldsymbol{\delta}\\|^{2}+2C_{\max}\,M\,C_{g}\,\\|\boldsymbol{\delta}\\|^{2}\leq C_{\varphi}\,\\|\boldsymbol{\delta}\\|^{2},$

with $C_{\varphi}$ as stated.

Covariance of the gradient. Write

	$\displaystyle m(\boldsymbol{X})$	$\displaystyle=\operatorname{E}_{f}[\mu\mid\boldsymbol{X}],$
	$\displaystyle\boldsymbol{C}$	$\displaystyle=\operatorname{E}\!\big[\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big],$
	$\displaystyle\boldsymbol{h}(\boldsymbol{Z})$	$\displaystyle=\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\big(Y-m(\boldsymbol{X})\big)-\boldsymbol{C}.$

Set

	$\displaystyle\boldsymbol{U}$	$\displaystyle:=\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\big(Y-\mu(\boldsymbol{X},\boldsymbol{W})\big),$
	$\displaystyle\boldsymbol{V}$	$\displaystyle:=\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\big(\mu(\boldsymbol{X},\boldsymbol{W})-m(\boldsymbol{X})\big),$

so that $\boldsymbol{h}=\boldsymbol{U}+(\boldsymbol{V}-\boldsymbol{C})$ . Because $\mathbb{E}[\boldsymbol{U}\mid\boldsymbol{X},\boldsymbol{W}]=\boldsymbol{0}$ and $\boldsymbol{V}-\boldsymbol{C}$ is measurable with respect to $(\boldsymbol{X},\boldsymbol{W})$ , $\mathrm{Cov}(\boldsymbol{U},\boldsymbol{V}-\boldsymbol{C})=\boldsymbol{0}$ , hence

\operatorname{Cov}(\boldsymbol{h})=\operatorname{Cov}(\boldsymbol{U})+\operatorname{Cov}(\boldsymbol{V}-\boldsymbol{C}).

Moreover, conditioning on $(\boldsymbol{X},\boldsymbol{W})$ gives

\operatorname{Cov}(\boldsymbol{U})=\operatorname{E}\!\Big[\tilde{\boldsymbol{s}}\tilde{\boldsymbol{s}}^{\top}\operatorname{Var}(Y\mid\boldsymbol{X},\boldsymbol{W})\Big]=:\boldsymbol{\Sigma}_{\varepsilon,s},

and

	$\displaystyle\operatorname{Cov}(\boldsymbol{V}-\boldsymbol{C})$	$\displaystyle=\operatorname{Cov}(\boldsymbol{V})=\operatorname{E}[\boldsymbol{V}\boldsymbol{V}^{\top}]-\boldsymbol{C}\boldsymbol{C}^{\top}$
		$\displaystyle=\operatorname{E}\!\Big[\tilde{\boldsymbol{s}}\tilde{\boldsymbol{s}}^{\top}\big(\mu-\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]\big)^{2}\Big]-\boldsymbol{C}\boldsymbol{C}^{\top}=:\boldsymbol{\Sigma}_{\mu,\mathrm{full}}.$

Thus $\operatorname{Cov}(\boldsymbol{h})=\boldsymbol{\Sigma}_{\varepsilon,s}+\boldsymbol{\Sigma}_{\mu,\mathrm{full}}$ .

Finally, the law of total covariance yields

	$\displaystyle\boldsymbol{\Sigma}_{\mu,\mathrm{full}}$	$\displaystyle=\operatorname{E}\big[\operatorname{Var}(\boldsymbol{V}\mid\boldsymbol{X})\big]+\operatorname{Cov}\big(\operatorname{E}[\boldsymbol{V}\mid\boldsymbol{X}]\big)\succeq\operatorname{Cov}\big(\operatorname{E}[\boldsymbol{V}\mid\boldsymbol{X}]\big)$
		$\displaystyle=\operatorname{Cov}\!\big(\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big)=:\boldsymbol{\Sigma}_{\mu}.$

Mean zero property of $\boldsymbol{h}$ . Let $m(\boldsymbol{X})=\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]$ and $\boldsymbol{C}=\operatorname{E}[\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]]$ . Using the definition of $\boldsymbol{h}$ ,

	$\displaystyle\boldsymbol{h}(\boldsymbol{Z})$	$\displaystyle=\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\,\big(Y-m(\boldsymbol{X})\big)-\boldsymbol{C}$
		$\displaystyle=\underbrace{\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\,\big(Y-\mu(\boldsymbol{X},\boldsymbol{W})\big)}_{=:\boldsymbol{U}}+\underbrace{\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\,\big(\mu(\boldsymbol{X},\boldsymbol{W})-m(\boldsymbol{X})\big)}_{=:\boldsymbol{V}}-\boldsymbol{C}.$

The expectation decomposes as $\mathbb{E}[\boldsymbol{h}]=\mathbb{E}[\boldsymbol{U}]+\mathbb{E}[\boldsymbol{V}]-\boldsymbol{C}$ .

First,

	$\displaystyle\mathbb{E}[\boldsymbol{U}]$	$\displaystyle=\mathbb{E}\!\left[\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\,\mathbb{E}\big[Y-\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X},\boldsymbol{W}\big]\right]$
		$\displaystyle=\mathbb{E}\!\left[\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\cdot 0\right]=\boldsymbol{0}.$

Second, for $\boldsymbol{V}$ ,

\displaystyle\mathbb{E}[\boldsymbol{V}]

\displaystyle=\mathbb{E}\!\left[\mathbb{E}\big[\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\,\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}\big]\right]-\mathbb{E}\!\left[m(\boldsymbol{X})\,\mathbb{E}\big[\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}\big]\right].

By definition $\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})=\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})-\mathbb{E}_{f}[\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}]$ , hence

\mathbb{E}\big[\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}\big]=\boldsymbol{0},

\mathbb{E}[\boldsymbol{V}]=\mathbb{E}\!\left[\mathbb{E}_{f}[\mu\,\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\right]=\boldsymbol{C}.

Putting the pieces together,

\mathbb{E}[\boldsymbol{h}]=\mathbb{E}[\boldsymbol{U}]+\mathbb{E}[\boldsymbol{V}]-\boldsymbol{C}=\boldsymbol{0}+\boldsymbol{C}-\boldsymbol{C}=\boldsymbol{0}.

∎

Corollary 2.

Specializing Lemma 7 to $\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})=\boldsymbol{w}$ (so $\tilde{\boldsymbol{s}}=\tilde{\boldsymbol{W}}$ ), let

\boldsymbol{H}:=\operatorname{Cov}\!\big(\boldsymbol{h}(\boldsymbol{Z})\big)=\boldsymbol{\Sigma}_{\varepsilon,s}+\boldsymbol{\Sigma}_{\mu,\mathrm{full}}\succeq\boldsymbol{0}.

Then, for all $\|\boldsymbol{\delta}\|\leq C_{\delta}/M_{s}$ ,

\displaystyle\big(\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{H}\,\boldsymbol{\delta}}-C_{\varphi}\|\boldsymbol{\delta}\|^{2}\big)_{+}^{\,2}\ \leq\ \operatorname{Var}\!\big\{\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})\big\}\ \leq\ \big(\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{H}\,\boldsymbol{\delta}}+C_{\varphi}\|\boldsymbol{\delta}\|^{2}\big)^{2},

where $(x)_{+}=\max\{x,0\}$ and $C_{\varphi}$ is the constant in Lemma 7. Moreover, if $\lambda_{\min}(\boldsymbol{H})>0$ and $\|\boldsymbol{\delta}\|\leq r_{0}$ for some

r_{0}\ <\ \frac{\sqrt{\lambda_{\min}(\boldsymbol{H})}}{C_{\varphi}}\,,

then there exist constants

	$\displaystyle c_{\mathrm{low}}$	$\displaystyle:=\Big(1-\frac{C_{\varphi}\,r_{0}}{\sqrt{\lambda_{\min}(\boldsymbol{H})}}\Big)^{2},$
	$\displaystyle c_{\mathrm{up}}$	$\displaystyle:=\Big(1+\frac{C_{\varphi}\,r_{0}}{\sqrt{\lambda_{\min}(\boldsymbol{H})}}\Big)^{2},$

depending only on $(\boldsymbol{H},C_{\varphi},r_{0})$ but not on $\boldsymbol{\delta}$ , such that

c_{\mathrm{low}}\ \boldsymbol{\delta}^{\top}\boldsymbol{H}\,\boldsymbol{\delta}\ \leq\ \operatorname{Var}\!\big\{\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})\big\}\ \leq\ c_{\mathrm{up}}\ \boldsymbol{\delta}^{\top}\boldsymbol{H}\,\boldsymbol{\delta}\qquad\text{for all }\ \|\boldsymbol{\delta}\|\leq r_{0}.

Since

\boldsymbol{H}=\boldsymbol{\Sigma}_{\varepsilon,s}+\boldsymbol{\Sigma}_{\mu,\mathrm{full}}\succeq\boldsymbol{\Sigma}_{\varepsilon,s},

the lower bound immediately implies

\operatorname{Var}\!\big\{\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})\big\}\ \geq\ c_{\mathrm{low}}\ \boldsymbol{\delta}^{\top}\boldsymbol{\Sigma}_{\varepsilon,s}\,\boldsymbol{\delta}.

Proof.

By Lemma 7,

\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})=\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})+R_{\varphi}(\boldsymbol{Z};\boldsymbol{\delta}),\qquad\|R_{\varphi}(\cdot;\boldsymbol{\delta})\|_{L_{2}}\leq C_{\varphi}\,\|\boldsymbol{\delta}\|^{2},\qquad\operatorname{E}\{\boldsymbol{h}(\boldsymbol{Z})\}=\boldsymbol{0}.

Write $A:=\boldsymbol{\delta}^{\top}\boldsymbol{h}$ and $R:=R_{\varphi}(\cdot;\boldsymbol{\delta})$ , and center the remainder by $R_{c}:=R-\operatorname{E}[R]$ . Then

\operatorname{Var}\!\big\{\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})\big\}=\operatorname{Var}(A+R)=\operatorname{Var}(A+R_{c})=\|A+R_{c}\|_{2}^{2}.

Because $\operatorname{E}[\boldsymbol{h}]=\boldsymbol{0}$ , we have $\|A\|_{2}^{2}=\operatorname{Var}(A)=\boldsymbol{\delta}^{\top}\boldsymbol{H}\,\boldsymbol{\delta}$ . Also $\|R_{c}\|_{2}=\sqrt{\operatorname{Var}(R)}\leq\|R\|_{2}\leq C_{\varphi}\|\boldsymbol{\delta}\|^{2}$ by Cauchy–Schwarz.

By the triangle inequality and its reverse in $L_{2}$ ,

\big|\ \|A\|_{2}-\|R_{c}\|_{2}\ \big|\ \leq\ \|A+R_{c}\|_{2}\ \leq\ \|A\|_{2}+\|R_{c}\|_{2}.

Using $\|A\|_{2}=\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{H}\boldsymbol{\delta}}$ and $\|R_{c}\|_{2}\leq C_{\varphi}\|\boldsymbol{\delta}\|^{2}$ and squaring both sides yields the additive bracket.

Suppose $\lambda_{\min}(\boldsymbol{H})>0$ . Then, for all $\boldsymbol{\delta}$ ,

\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{H}\,\boldsymbol{\delta}}\ \geq\ \sqrt{\lambda_{\min}(\boldsymbol{H})}\,\|\boldsymbol{\delta}\|.

Hence, for any $\|\boldsymbol{\delta}\|\leq r_{0}$ ,

	$\displaystyle\frac{\\|R_{c}\\|_{2}}{\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{H}\,\boldsymbol{\delta}}}$	$\displaystyle\leq\frac{C_{\varphi}\\|\boldsymbol{\delta}\\|^{2}}{\sqrt{\lambda_{\min}(\boldsymbol{H})}\,\\|\boldsymbol{\delta}\\|}$
		$\displaystyle\leq\frac{C_{\varphi}\,r_{0}}{\sqrt{\lambda_{\min}(\boldsymbol{H})}}=:\ r_{*}\ <\ 1.$

Dividing the additive bracket by $\boldsymbol{\delta}^{\top}\boldsymbol{H}\boldsymbol{\delta}$ yields the multiplicative squeeze with

c_{\mathrm{low}}=(1-r_{*})^{2},\qquad c_{\mathrm{up}}=(1+r_{*})^{2}.

Finally, because $\boldsymbol{H}\succeq\boldsymbol{\Sigma}_{\varepsilon,s}$ , the coarser lower bound follows. ∎

Lemma 8 (Hardest direction after orthogonalization).

Let $\varphi_{0}(\boldsymbol{Z}):=\varphi_{\psi(\boldsymbol{0})}(\boldsymbol{Z})=Y-\psi(\boldsymbol{0})$ and

	$\displaystyle\tilde{\phi}_{0}(\boldsymbol{Z})$	$\displaystyle:=\frac{\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})}{\sqrt{\operatorname{Var}\!\big(\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})\big)}},$
	$\displaystyle\sigma_{h}^{2}$	$\displaystyle:=\operatorname{Var}\!\big(\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})\big).$

Define

	$\displaystyle\rho$	$\displaystyle:=\text{Corr}(\tilde{\phi}_{0},\varphi_{0})=\frac{\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]}{\sqrt{\operatorname{Var}(\varphi_{0})}},$
	$\displaystyle\phi(\boldsymbol{Z})$	$\displaystyle:=\frac{\tilde{\phi}_{0}(\boldsymbol{Z})-\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]\cdot\frac{\varphi_{0}(\boldsymbol{Z})}{\operatorname{Var}(\varphi_{0})}}{\sqrt{\,1-\frac{\big(\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]\big)^{2}}{\operatorname{Var}(\varphi_{0})}\,}}.$

Then $\mathbb{E}[\phi]=0$ , $\operatorname{Var}(\phi)=1$ , and $\mathbb{E}[\phi\,\varphi_{0}]=0$ . Let Assumption (A4) hold with constants $(B,\varepsilon_{0})$ . Consider $P_{1}$ with density $p_{1}=p_{0}(1+\varepsilon\phi)$ , where $|\varepsilon|\leq\min\{\varepsilon_{0},\tfrac{1}{2B}\}$ . Then $p_{1}\in\mathcal{P}$ , $\int p_{1}=1$ , and $p_{1}\geq\tfrac{1}{2}p_{0}$ . Moreover, for $\|\boldsymbol{\delta}\|\leq C_{\delta}/M_{s}$ ,

	$\displaystyle\theta_{P_{1}}(\boldsymbol{\delta})-\theta_{P_{0}}(\boldsymbol{\delta})$	$\displaystyle=\varepsilon\,\mathbb{E}\big[\phi\,\varphi_{\theta(\boldsymbol{\delta})}\big]+r_{\mathrm{A5}}(\varepsilon)$
		$\displaystyle=\varepsilon\Big(\sqrt{\operatorname{Var}(\boldsymbol{\delta}^{\top}\boldsymbol{h})}\cdot\sqrt{1-\rho^{2}}\Big)+\varepsilon\,\mathbb{E}[\phi\,R_{\varphi}(\cdot;\boldsymbol{\delta})]+r_{\mathrm{A5}}(\varepsilon),$

where $R_{\varphi}(\cdot;\boldsymbol{\delta})$ is the remainder in the linear expansion $\varphi_{\theta(\boldsymbol{\delta})}=\boldsymbol{\delta}^{\top}\boldsymbol{h}+R_{\varphi}(\cdot;\boldsymbol{\delta})$ , and the remainders satisfy

	$\displaystyle\big\|\mathbb{E}[\phi\,R_{\varphi}(\cdot;\boldsymbol{\delta})]\big\|$	$\displaystyle\leq\\|R_{\varphi}(\cdot;\boldsymbol{\delta})\\|_{L_{2}}\leq C_{\varphi}\,\\|\boldsymbol{\delta}\\|^{2},$
	$\displaystyle\|r_{\mathrm{A5}}(\varepsilon)\|$	$\displaystyle\leq 2K\,\varepsilon^{2},$

with $C_{\varphi}$ as in Lemma 7 and $K$ the differentiability modulus from (A5). Finally,

\chi^{2}(P_{1},P_{0})=\varepsilon^{2}\qquad\text{and}\qquad D_{\mathrm{KL}}(P_{1}\|P_{0})\leq\log(1+\varepsilon^{2})\leq\varepsilon^{2}.

Proof.

Properties of $\phi$ . Write $\sigma_{0}^{2}:=\operatorname{Var}(\varphi_{0})$ , so $\rho=\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]/\sigma_{0}$ . By construction,

\phi=\frac{\tilde{\phi}_{0}-(\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]/\sigma_{0}^{2})\,\varphi_{0}}{\sqrt{\,1-\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]^{2}/\sigma_{0}^{2}\,}}.

Since $\mathbb{E}[\tilde{\phi}_{0}]=0$ and $\mathbb{E}[\varphi_{0}]=0$ , we have $\mathbb{E}[\phi]=0$ . Next,

	$\displaystyle\operatorname{Var}(\phi)$	$\displaystyle=\frac{\operatorname{Var}(\tilde{\phi}_{0})+(\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]/\sigma_{0}^{2})^{2}\operatorname{Var}(\varphi_{0})-2(\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]/\sigma_{0}^{2})\operatorname{Cov}(\tilde{\phi}_{0},\varphi_{0})}{1-\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]^{2}/\sigma_{0}^{2}}$
		$\displaystyle=\frac{1+\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]^{2}/\sigma_{0}^{2}-2\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]^{2}/\sigma_{0}^{2}}{1-\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]^{2}/\sigma_{0}^{2}}=1.$

Also,

\mathbb{E}[\phi\,\varphi_{0}]=\frac{\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]-(\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]/\sigma_{0}^{2})\,\sigma_{0}^{2}}{\sqrt{\,1-\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]^{2}/\sigma_{0}^{2}\,}}=0.

Finally,

\displaystyle\mathbb{E}[\phi\,\tilde{\phi}_{0}]

\displaystyle=\frac{\mathbb{E}[\tilde{\phi}_{0}^{2}]-(\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]/\sigma_{0}^{2})\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]}{\sqrt{\,1-\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]^{2}/\sigma_{0}^{2}\,}}=\sqrt{\,1-\mathbb{E}[\tilde{\phi}_{0}\varphi_{0}]^{2}/\sigma_{0}^{2}\,}=\sqrt{1-\rho^{2}}.

Because $\tilde{\phi}_{0}=\sigma_{h}^{-1}\boldsymbol{\delta}^{\top}\boldsymbol{h}$ , we obtain

\mathbb{E}\big[\phi\,\boldsymbol{\delta}^{\top}\boldsymbol{h}\big]=\sigma_{h}\,\mathbb{E}[\phi\,\tilde{\phi}_{0}]=\sigma_{h}\sqrt{1-\rho^{2}}.

Validity of $p_{1}$ . By the definition of $\varepsilon_{0}$ in (A4), for every $\phi\in L_{\infty}$ with $\mathbb{E}[\phi]=0$ and $\|\phi\|_{\infty}\leq B$ , the measure with density $p_{0}(1+\varepsilon\phi)$ lies in $\mathcal{P}$ for all $|\varepsilon|\leq\varepsilon_{0}$ . Moreover, $\int p_{1}=\int p_{0}(1+\varepsilon\phi)=1+\varepsilon\,\mathbb{E}[\phi]=1$ . If $|\varepsilon|\leq(2B)^{-1}$ then $1+\varepsilon\phi\geq 1-|\varepsilon|\,\|\phi\|_{\infty}\geq 1/2$ , so $p_{1}\geq\tfrac{1}{2}p_{0}$ .

First-order expansion of $\theta$ via (A5). By (A5), for $|\varepsilon|\leq\varepsilon_{0}$ ,

	$\displaystyle\psi_{P_{0}(1+\varepsilon\phi)}(\boldsymbol{\delta})-\psi_{P_{0}}(\boldsymbol{\delta})$	$\displaystyle=\varepsilon\,\mathbb{E}_{P_{0}}\!\big[\varphi_{\psi(\boldsymbol{\delta})}\,\phi\big]+r_{1}(\varepsilon),\qquad\|r_{1}(\varepsilon)\|\leq K\varepsilon^{2},$
	$\displaystyle\psi_{P_{0}(1+\varepsilon\phi)}(\boldsymbol{0})-\psi_{P_{0}}(\boldsymbol{0})$	$\displaystyle=\varepsilon\,\mathbb{E}_{P_{0}}\!\big[\varphi_{\psi(\boldsymbol{0})}\,\phi\big]+r_{0}(\varepsilon),\qquad\|r_{0}(\varepsilon)\|\leq K\varepsilon^{2}.$

Subtracting gives

\theta_{P_{1}}(\boldsymbol{\delta})-\theta_{P_{0}}(\boldsymbol{\delta})=\varepsilon\,\mathbb{E}\big[\phi\,\varphi_{\theta(\boldsymbol{\delta})}\big]+r_{\mathrm{A5}}(\varepsilon),\qquad|r_{\mathrm{A5}}(\varepsilon)|\leq 2K\varepsilon^{2}.

Evaluation of the leading inner product and remainder bound. By the linearization,

\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})=\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})+R_{\varphi}(\boldsymbol{Z};\boldsymbol{\delta}),\qquad\|R_{\varphi}(\cdot;\boldsymbol{\delta})\|_{L_{2}}\leq C_{\varphi}\,\|\boldsymbol{\delta}\|^{2},

for all $\|\boldsymbol{\delta}\|\leq C_{\delta}/M_{s}$ . Hence

	$\displaystyle\mathbb{E}\big[\phi\,\varphi_{\theta(\boldsymbol{\delta})}\big]$	$\displaystyle=\mathbb{E}[\phi\,\boldsymbol{\delta}^{\top}\boldsymbol{h}]+\mathbb{E}[\phi\,R_{\varphi}]$
		$\displaystyle=\sigma_{h}\sqrt{1-\rho^{2}}+\mathbb{E}[\phi\,R_{\varphi}],\qquad\|\mathbb{E}[\phi\,R_{\varphi}]\|\leq C_{\varphi}\,\\|\boldsymbol{\delta}\\|^{2}.$

∎

C.3 Minimax Lower Bound

Recall

	$\displaystyle\boldsymbol{H}$	$\displaystyle:=\operatorname{Cov}\!\big(\boldsymbol{h}(\boldsymbol{Z})\big),$
	$\displaystyle\sigma_{0}^{2}$	$\displaystyle:=\operatorname{Var}(\varphi_{0})=\operatorname{Var}(Y-\psi(\boldsymbol{0})).$

Define the projected covariance (Schur complement) of the linear regression of $\boldsymbol{h}$ onto the one-dimensional span of $\varphi_{0}$ :

\displaystyle\boldsymbol{\Gamma}

\displaystyle:=\boldsymbol{H}-\frac{\operatorname{Cov}\big(\boldsymbol{h},\varphi_{0}\big)\,\operatorname{Cov}\big(\boldsymbol{h},\varphi_{0}\big)^{\top}}{\sigma_{0}^{2}}\succeq\boldsymbol{0}.

Theorem 4 (Minimax lower bound).

Assume (A1)–(A5). Fix some $\kappa\in(0,1)$ and assume

\displaystyle\|\boldsymbol{\delta}\|\ \leq\ \min\Big\{\frac{C_{\delta}}{M_{s}},\ (1-\kappa)\frac{\sqrt{\lambda_{\min}(\boldsymbol{\Gamma})}}{C_{\varphi}}\Big\}.

If $\boldsymbol{\Gamma}$ is positive definite, then there exists a constant $C>0$ that depends only on $\kappa$ and the constants in (A1)–(A5), but not on $\boldsymbol{\delta}$ , such that

\liminf_{n\to\infty}\ \inf_{\hat{\theta}}\ \sup_{P\in\mathcal{P}}\ \frac{n\,\mathbb{E}_{P}\big[(\hat{\theta}-\theta_{P}(\boldsymbol{\delta}))^{2}\big]}{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\,\boldsymbol{\delta}}\ \geq\ C.

In particular, the lower bound is expressed in terms of $\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}$ and generally cannot be strengthened by replacing $\boldsymbol{\Gamma}$ with a larger matrix.

Proof of Theorem 4.

Notation.

Let $\sigma_{h}^{2}:=\operatorname{Var}(\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z}))$ , $\sigma_{0}^{2}:=\operatorname{Var}(\varphi_{0}(\boldsymbol{Z}))$ , and $\tilde{\phi}_{0}:=\frac{\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})}{\sigma_{h}}$ .

Define $\rho:=\text{Corr}(\tilde{\phi}_{0},\varphi_{0}(\boldsymbol{Z}))=\frac{\operatorname{Cov}(\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z}),\varphi_{0}(\boldsymbol{Z}))}{\sigma_{h}\,\sigma_{0}}$ .

Two-point construction. Fix $P_{0}\in\mathcal{P}$ and let $\phi$ be as in Lemma 8. For a given $n$ , choose

	$\displaystyle\alpha_{0}$	$\displaystyle:=\min\Big\{\varepsilon_{0},\ \frac{1}{2B},\ \frac{1}{2}\Big\},$
	$\displaystyle\varepsilon$	$\displaystyle:=\frac{\alpha_{0}}{\sqrt{n}}.$

Then $|\varepsilon|\leq\min\{\varepsilon_{0},(2B)^{-1}\}$ , and the perturbed model $P_{1}$ defined by $p_{1}=p_{0}(1+\varepsilon\phi)$ lies in $\mathcal{P}$ , integrates to $1$ , and satisfies $p_{1}\geq\tfrac{1}{2}p_{0}$ .

Parameter gap and the Schur complement. By Lemma 8,

\displaystyle\theta_{P_{1}}(\boldsymbol{\delta})-\theta_{P_{0}}(\boldsymbol{\delta})

\displaystyle=\varepsilon\Big(\sigma_{h}\sqrt{1-\rho^{2}}\Big)+\varepsilon\,\mathbb{E}[\phi\,R_{\varphi}(\cdot;\boldsymbol{\delta})]+r_{\mathrm{A5}}(\varepsilon),

with

|\mathbb{E}[\phi\,R_{\varphi}(\cdot;\boldsymbol{\delta})]|\leq C_{\varphi}\|\boldsymbol{\delta}\|^{2},\qquad|r_{\mathrm{A5}}(\varepsilon)|\leq 2K\varepsilon^{2}.

Moreover,

\displaystyle\sigma_{h}\sqrt{1-\rho^{2}}

\displaystyle=\sqrt{\operatorname{Var}(\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z}))-\frac{\operatorname{Cov}(\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z}),\varphi_{0}(\boldsymbol{Z}))^{2}}{\operatorname{Var}(\varphi_{0}(\boldsymbol{Z}))}}=\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\,\boldsymbol{\delta}}.

Therefore

\displaystyle\Big|\theta_{P_{1}}(\boldsymbol{\delta})-\theta_{P_{0}}(\boldsymbol{\delta})\Big|

\displaystyle\geq|\varepsilon|\Big(\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}-C_{\varphi}\|\boldsymbol{\delta}\|^{2}\Big)-2K\varepsilon^{2}.

Using $\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}\geq\sqrt{\lambda_{\min}(\boldsymbol{\Gamma})}\|\boldsymbol{\delta}\|$ and the assumption $\|\boldsymbol{\delta}\|\leq(1-\kappa)\sqrt{\lambda_{\min}(\boldsymbol{\Gamma})}/C_{\varphi}$ , we have

\displaystyle\frac{C_{\varphi}\|\boldsymbol{\delta}\|^{2}}{\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}}

\displaystyle\leq\frac{C_{\varphi}\|\boldsymbol{\delta}\|}{\sqrt{\lambda_{\min}(\boldsymbol{\Gamma})}}\leq 1-\kappa,

hence

\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}-C_{\varphi}\|\boldsymbol{\delta}\|^{2}\geq\kappa\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}.

Thus

\displaystyle\Big|\theta_{P_{1}}(\boldsymbol{\delta})-\theta_{P_{0}}(\boldsymbol{\delta})\Big|

\displaystyle\geq|\varepsilon|\,\kappa\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}-2K\varepsilon^{2}.

With $\varepsilon=\alpha_{0}/\sqrt{n}$ ,

\Big|\theta_{P_{1}}(\boldsymbol{\delta})-\theta_{P_{0}}(\boldsymbol{\delta})\Big|\geq\frac{\alpha_{0}}{\sqrt{n}}\,\kappa\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}-\frac{2K\alpha_{0}^{2}}{n}.

KL and TV control. From Lemma 8,

	$\displaystyle D_{\mathrm{KL}}(P_{1}\\|P_{0})$	$\displaystyle\leq\varepsilon^{2}=\frac{\alpha_{0}^{2}}{n},$
	$\displaystyle D_{\mathrm{KL}}(P_{1}^{\otimes n}\\|P_{0}^{\otimes n})$	$\displaystyle=nD_{\mathrm{KL}}(P_{1}\\|P_{0})\leq\alpha_{0}^{2}.$

By Pinsker’s inequality,

\displaystyle\mathrm{TV}(P_{0}^{\otimes n},P_{1}^{\otimes n})

\displaystyle\leq\sqrt{\frac{1}{2}D_{\mathrm{KL}}(P_{1}^{\otimes n}\|P_{0}^{\otimes n})}\leq\frac{\alpha_{0}}{\sqrt{2}},

\displaystyle 1-\mathrm{TV}(P_{0}^{\otimes n},P_{1}^{\otimes n})

\displaystyle\geq 1-\frac{\alpha_{0}}{\sqrt{2}}=:\ c_{\mathrm{TV}}>0.

Le Cam two-point inequality and asymptotic lower bound. Le Cam’s two-point inequality yields, for any estimator $\hat{\theta}$ ,

\displaystyle\sup_{P\in\{P_{0},P_{1}\}}\mathbb{E}_{P}\!\big[(\hat{\theta}-\theta_{P}(\boldsymbol{\delta}))^{2}\big]

\displaystyle\geq\frac{1}{8}\Big(\theta_{P_{1}}(\boldsymbol{\delta})-\theta_{P_{0}}(\boldsymbol{\delta})\Big)^{2}\Big(1-\mathrm{TV}(P_{0}^{\otimes n},P_{1}^{\otimes n})\Big).

Using the TV bound above,

\displaystyle\sup_{P\in\{P_{0},P_{1}\}}\mathbb{E}_{P}\!\big[(\hat{\theta}-\theta_{P}(\boldsymbol{\delta}))^{2}\big]

\displaystyle\geq\frac{c_{\mathrm{TV}}}{8}\Big(\theta_{P_{1}}(\boldsymbol{\delta})-\theta_{P_{0}}(\boldsymbol{\delta})\Big)^{2}.

Now apply the parameter gap bound and divide by $\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}/n$ :

\displaystyle\frac{n}{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}\sup_{P\in\{P_{0},P_{1}\}}\mathbb{E}_{P}\!\big[(\hat{\theta}-\theta_{P}(\boldsymbol{\delta}))^{2}\big]

\displaystyle\geq\frac{c_{\mathrm{TV}}}{8}\cdot\frac{n}{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}\Big(\frac{\alpha_{0}}{\sqrt{n}}\kappa\sqrt{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}-\frac{2K\alpha_{0}^{2}}{n}\Big)^{2}.

The term $2K\alpha_{0}^{2}/n$ is $o(n^{-1/2})$ and does not contribute to the limit after normalization by $\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}/n$ . Therefore,

\displaystyle\liminf_{n\to\infty}\frac{n}{\boldsymbol{\delta}^{\top}\boldsymbol{\Gamma}\boldsymbol{\delta}}\sup_{P\in\{P_{0},P_{1}\}}\mathbb{E}_{P}\!\big[(\hat{\theta}-\theta_{P}(\boldsymbol{\delta}))^{2}\big]

\displaystyle\geq\frac{c_{\mathrm{TV}}}{8}\cdot\alpha_{0}^{2}\kappa^{2}.

Since $\{P_{0},P_{1}\}\subset\mathcal{P}$ , taking $\inf_{\hat{\theta}}$ and $\sup_{P\in\mathcal{P}}$ yields the same lower bound, proving the theorem with $C=(c_{\mathrm{TV}}/8)\alpha_{0}^{2}\kappa^{2}$ . ∎

On positive definiteness of $\boldsymbol{\Gamma}$ .

	$\displaystyle\boldsymbol{H}$	$\displaystyle=\operatorname{Cov}(\boldsymbol{h}),$
	$\displaystyle\boldsymbol{\Gamma}$	$\displaystyle:=\boldsymbol{H}-\frac{\operatorname{Cov}(\boldsymbol{h},\varphi_{0})\,\operatorname{Cov}(\boldsymbol{h},\varphi_{0})^{\top}}{\sigma_{0}^{2}},$
	$\displaystyle\sigma_{0}^{2}$	$\displaystyle:=\operatorname{Var}(\varphi_{0}).$

Let

	$\displaystyle\boldsymbol{\beta}$	$\displaystyle:=\frac{\operatorname{Cov}(\boldsymbol{h},\varphi_{0})}{\sigma_{0}^{2}},$
	$\displaystyle\boldsymbol{h}_{\perp}$	$\displaystyle:=\boldsymbol{h}-\boldsymbol{\beta}\,\varphi_{0}.$

Then

\displaystyle\boldsymbol{\Gamma}

\displaystyle=\operatorname{Cov}(\boldsymbol{h}_{\perp})=\operatorname{Cov}\!\big(\boldsymbol{h}-f_{\mathrm{span}(\varphi_{0})}\boldsymbol{h}\big)\succeq\boldsymbol{0}.

The matrix $\boldsymbol{\Gamma}$ is positive definite if and only if the residual $\boldsymbol{h}_{\perp}$ is nondegenerate, that is

\displaystyle\forall\,\boldsymbol{\delta}\neq\boldsymbol{0}:\quad\operatorname{Var}\!\big(\boldsymbol{\delta}^{\top}\boldsymbol{h}_{\perp}\big)

\displaystyle=\operatorname{Var}(\boldsymbol{\delta}^{\top}\boldsymbol{h})-\frac{\operatorname{Cov}(\boldsymbol{\delta}^{\top}\boldsymbol{h},\varphi_{0})^{2}}{\operatorname{Var}(\varphi_{0})}\;>\;0,

equivalently $\text{Corr}(\boldsymbol{\delta}^{\top}\boldsymbol{h},\varphi_{0})\neq\pm 1$ for every nonzero $\boldsymbol{\delta}$ .

If $\text{Corr}(\boldsymbol{\delta}^{\top}\boldsymbol{h},\varphi_{0})=\pm 1$ for some $\boldsymbol{\delta}\neq\boldsymbol{0}$ , then there exist scalars $a,b$ such that

\boldsymbol{\delta}^{\top}\boldsymbol{h}(\boldsymbol{Z})=a\,\varphi_{0}(\boldsymbol{Z})+b\quad\text{a.s.}

Recall $\varphi_{0}(\boldsymbol{Z})=Y-\operatorname{E}[Y]$ and $\boldsymbol{h}(\boldsymbol{Z})=\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\big(Y-m(\boldsymbol{X})\big)-\boldsymbol{C}$ with $m(\boldsymbol{X})=\operatorname{E}_{f}[\mu\mid\boldsymbol{X}]$ and $\boldsymbol{C}=\operatorname{E}[\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]]$ . Collecting coefficients of $Y$ yields

\displaystyle\big(\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})-a\big)\,Y

\displaystyle=\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\,m(\boldsymbol{X})+\boldsymbol{\delta}^{\top}\boldsymbol{C}-a\,\operatorname{E}[Y]+b.

By (A2), $\operatorname{Var}(Y\mid\boldsymbol{X},\boldsymbol{W})>0$ a.s., hence

\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})-a\;=\;0\quad\text{a.s.}

Taking conditional expectation given $\boldsymbol{X}$ and using $\operatorname{E}[\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\mid\boldsymbol{X}]=\boldsymbol{0}$ gives $a=0$ and thus

\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})\equiv 0\quad\text{a.s.}

Thus, failure of $\boldsymbol{\Gamma}\succ 0$ requires a nontrivial direction $\boldsymbol{\delta}$ along which $\tilde{\boldsymbol{s}}(\boldsymbol{W},\boldsymbol{X})$ is almost surely degenerate.

Linear Incremental Effect

By explicitly specializing the general tilt function to the linear form $\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})=\boldsymbol{w}$ , the projected covariance $\boldsymbol{\Gamma}$ recovers the exact geometric structure defined in Section 4.1, thereby completing the minimax lower bound for the linear incremental effect $\theta(\boldsymbol{\delta})$ evaluated in the main text.

Appendix D On convergence and normality

We first establish a key auxiliary result under the same assumptions and notation used in the proof of the minimax lower bound. We clarify how our rate conditions on the tilted nuisances $(m_{\boldsymbol{\delta}},r_{\boldsymbol{\delta}})$ translate into standard $L_{2}$ rates for the outcome regression $\mu$ and the exposure density $f$ .

Lemma 9 (Reduction to outcome regression and exposure density).

Assume (A1)–(A3), and let $f(\boldsymbol{w}\mid\boldsymbol{x})$ denote the conditional density of $\boldsymbol{W}$ given $\boldsymbol{X}=\boldsymbol{x}$ , with support $\mathcal{W}\subset\mathbb{R}^{q}$ of finite Lebesgue measure $|\mathcal{W}|<\infty$ . Suppose there exist constants $0<f_{\min}\leq f_{\max}<\infty$ such that

f_{\min}\;\leq\;f(\boldsymbol{w}\mid\boldsymbol{x})\;\leq\;f_{\max}\qquad\text{for all $(\boldsymbol{x},\boldsymbol{w})$ in the support of $(\boldsymbol{X},\boldsymbol{W})$.}

Let $\mu(\boldsymbol{x},\boldsymbol{w})=\mathbb{E}[Y\mid\boldsymbol{X}=\boldsymbol{x},\boldsymbol{W}=\boldsymbol{w}]$ , and fix $\Delta<\infty$ . For any $\boldsymbol{\delta}$ with $\|\boldsymbol{\delta}\|\leq\Delta$ define

	$\displaystyle\nu_{\boldsymbol{\delta}}(\boldsymbol{x})$	$\displaystyle:=\int_{\mathcal{W}}\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\,f(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w},$
	$\displaystyle\eta_{\boldsymbol{\delta}}(\boldsymbol{x})$	$\displaystyle:=\int_{\mathcal{W}}\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\,\mu(\boldsymbol{x},\boldsymbol{w})\,f(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w},$

and

	$\displaystyle m_{\boldsymbol{\delta}}(\boldsymbol{x})$	$\displaystyle:=\frac{\eta_{\boldsymbol{\delta}}(\boldsymbol{x})}{\nu_{\boldsymbol{\delta}}(\boldsymbol{x})},$
	$\displaystyle r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})$	$\displaystyle:=\frac{\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}}{\nu_{\boldsymbol{\delta}}(\boldsymbol{x})}.$

Let $\widehat{\mu},\widehat{f}$ be any estimators of $\mu,f$ , and construct

	$\displaystyle\widehat{\nu}_{\boldsymbol{\delta}}(\boldsymbol{x})$	$\displaystyle:=\int_{\mathcal{W}}\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\,\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w},$
	$\displaystyle\widehat{\eta}_{\boldsymbol{\delta}}(\boldsymbol{x})$	$\displaystyle:=\int_{\mathcal{W}}\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\,\widehat{\mu}(\boldsymbol{x},\boldsymbol{w})\,\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w},$

To ensure the denominators in $\widehat{m}_{\boldsymbol{\delta}}$ and $\widehat{r}_{\boldsymbol{\delta}}$ are strictly bounded away from zero, we define the truncated estimator $\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}$ below. Later we show this regularization enforces positivity constraints consistent with the true $\nu_{\boldsymbol{\delta}}$ without compromising the $L_{2}$ convergence rates inherited from the nuisance estimators.

	$\displaystyle\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{x})$	$\displaystyle:=\min\Big\{\max\big(\widehat{\nu}_{\boldsymbol{\delta}}(\boldsymbol{x}),\ e^{-\tau_{\Delta}}/2\big),\ 2e^{\tau_{\Delta}}\Big\},$
	$\displaystyle\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{x})$	$\displaystyle:=\frac{\widehat{\eta}_{\boldsymbol{\delta}}(\boldsymbol{x})}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{x})},$
	$\displaystyle\widehat{r}_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})$	$\displaystyle:=\frac{\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{x})}.$

Then there exist finite constants $C_{1}(\Delta),C_{2}(\Delta)<\infty$ , depending only on $(\Delta,f_{\min},f_{\max},|\mathcal{W}|,M)$ and the bound on $\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})$ , such that

	$\displaystyle\\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\\|_{2}$	$\displaystyle\leq C_{1}(\Delta)\,\\|\widehat{f}-f\\|_{2},$		(10)
	$\displaystyle\\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\\|_{2}$	$\displaystyle\leq C_{2}(\Delta)\,\big(\\|\widehat{\mu}-\mu\\|_{2}+\\|\widehat{f}-f\\|_{2}\big),$		(11)

where $\|\cdot\|_{2}$ denotes the $L_{2}(P)$ –norm with respect to the law of $(\boldsymbol{X},\boldsymbol{W})$ .

Proof.

Since $\|\boldsymbol{\delta}\|\leq\Delta$ and $\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})$ is bounded, there exists $\tau_{\Delta}<\infty$ such that $|\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}|\leq e^{\tau_{\Delta}}$ almost surely. Hence, for all such $\boldsymbol{\delta}$ ,

e^{-\tau_{\Delta}}\;\leq\;\nu_{\boldsymbol{\delta}}(\boldsymbol{x})\;\leq\;e^{\tau_{\Delta}}\qquad\text{for all $\boldsymbol{x}$.}

In particular, $\nu_{\boldsymbol{\delta}}$ is bounded away from $0$ and $\infty$ ; the same holds for $\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}$ by construction.

Control of $\widehat{\nu}_{\boldsymbol{\delta}}-\nu_{\boldsymbol{\delta}}$ . By definition,

\widehat{\nu}_{\boldsymbol{\delta}}(\boldsymbol{x})-\nu_{\boldsymbol{\delta}}(\boldsymbol{x})=\int_{\mathcal{W}}\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\big\{\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})-f(\boldsymbol{w}\mid\boldsymbol{x})\big\}\,d\boldsymbol{w}.

Taking absolute values and using the bound on the exponential tilt gives

|\widehat{\nu}_{\boldsymbol{\delta}}(\boldsymbol{x})-\nu_{\boldsymbol{\delta}}(\boldsymbol{x})|\leq e^{\tau_{\Delta}}\int_{\mathcal{W}}\big|\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})-f(\boldsymbol{w}\mid\boldsymbol{x})\big|\,d\boldsymbol{w}.

By Cauchy–Schwarz and finiteness of $|\mathcal{W}|$ ,

\int_{\mathcal{W}}\big|\widehat{f}-f\big|\,d\boldsymbol{w}\leq|\mathcal{W}|^{1/2}\Big(\int_{\mathcal{W}}\big|\widehat{f}-f\big|^{2}\,d\boldsymbol{w}\Big)^{1/2}.

Squaring both sides and integrating over $\boldsymbol{x}$ with respect to $P_{\boldsymbol{X}}$ gives

\displaystyle\int\{\widehat{\nu}_{\boldsymbol{\delta}}(\boldsymbol{x})-\nu_{\boldsymbol{\delta}}(\boldsymbol{x})\}^{2}\,dP_{\boldsymbol{X}}(\boldsymbol{x})

\displaystyle\leq e^{2\tau_{\Delta}}|\mathcal{W}|\int\!\!\int_{\mathcal{W}}\big|\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})-f(\boldsymbol{w}\mid\boldsymbol{x})\big|^{2}\,d\boldsymbol{w}\,dP_{\boldsymbol{X}}(\boldsymbol{x}).

By definition,

\|\widehat{f}-f\|_{2}^{2}=\int\!\!\int_{\mathcal{W}}\big|\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})-f(\boldsymbol{w}\mid\boldsymbol{x})\big|^{2}f(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w}\,dP_{\boldsymbol{X}}(\boldsymbol{x}).

Under the boundedness assumption on the exposure density, there exists $f_{\min}>0$ such that $f(\boldsymbol{w}\mid\boldsymbol{x})\geq f_{\min}$ for all $(\boldsymbol{x},\boldsymbol{w})$ . Hence, for any non-negative integrand $h(\boldsymbol{x},\boldsymbol{w})$ we have

	$\displaystyle\int\!\!\int_{\mathcal{W}}h(\boldsymbol{x},\boldsymbol{w})\,d\boldsymbol{w}\,dP_{\boldsymbol{X}}(\boldsymbol{x})$	$\displaystyle=\int\!\!\int_{\mathcal{W}}\frac{h(\boldsymbol{x},\boldsymbol{w})}{f(\boldsymbol{w}\mid\boldsymbol{x})}\,f(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w}\,dP_{\boldsymbol{X}}(\boldsymbol{x})$
		$\displaystyle\leq f_{\min}^{-1}\int\!\!\int_{\mathcal{W}}h(\boldsymbol{x},\boldsymbol{w})\,f(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w}\,dP_{\boldsymbol{X}}(\boldsymbol{x}).$

Applying this with $h(\boldsymbol{x},\boldsymbol{w})=\big|\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})-f(\boldsymbol{w}\mid\boldsymbol{x})\big|^{2}$ yields

\int\!\!\int_{\mathcal{W}}\big|\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})-f(\boldsymbol{w}\mid\boldsymbol{x})\big|^{2}\,d\boldsymbol{w}\,dP_{\boldsymbol{X}}(\boldsymbol{x})\leq f_{\min}^{-1}\,\|\widehat{f}-f\|_{2}^{2}.

Substituting into the previous display, and noting that for any function depending only on $\boldsymbol{X}$ we have

\|\widehat{\nu}_{\boldsymbol{\delta}}-\nu_{\boldsymbol{\delta}}\|_{2}^{2}=\int\{\widehat{\nu}_{\boldsymbol{\delta}}(\boldsymbol{x})-\nu_{\boldsymbol{\delta}}(\boldsymbol{x})\}^{2}\,dP_{\boldsymbol{X}}(\boldsymbol{x}),

we obtain

\|\widehat{\nu}_{\boldsymbol{\delta}}-\nu_{\boldsymbol{\delta}}\|_{2}^{2}\leq e^{2\tau_{\Delta}}|\mathcal{W}|\,f_{\min}^{-1}\,\|\widehat{f}-f\|_{2}^{2},

and hence

\|\widehat{\nu}_{\boldsymbol{\delta}}-\nu_{\boldsymbol{\delta}}\|_{2}\leq e^{\tau_{\Delta}}|\mathcal{W}|^{1/2}f_{\min}^{-1/2}\,\|\widehat{f}-f\|_{2}.

Moreover, since $\nu_{\boldsymbol{\delta}}(\boldsymbol{x})\in[e^{-\tau_{\Delta}},e^{\tau_{\Delta}}]\subset[e^{-\tau_{\Delta}}/2,2e^{\tau_{\Delta}}]$ for all $\boldsymbol{x}$ and the truncation map is $1$ –Lipschitz, we have

\|\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}-\nu_{\boldsymbol{\delta}}\|_{2}\leq\|\widehat{\nu}_{\boldsymbol{\delta}}-\nu_{\boldsymbol{\delta}}\|_{2}.

Control of $\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}$ . We have

	$\displaystyle\widehat{r}_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})-r_{\boldsymbol{\delta}}(\boldsymbol{w},\boldsymbol{x})$	$\displaystyle=\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\Big\{\frac{1}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{x})}-\frac{1}{\nu_{\boldsymbol{\delta}}(\boldsymbol{x})}\Big\}$
		$\displaystyle=\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\frac{\nu_{\boldsymbol{\delta}}(\boldsymbol{x})-\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{x})}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{x})\nu_{\boldsymbol{\delta}}(\boldsymbol{x})}.$

Using the finite- $\Delta$ bounds on the numerator and denominators, there exists a constant $C_{r}(\Delta)$ such that

|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}|\leq C_{r}(\Delta)\,|\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}-\nu_{\boldsymbol{\delta}}|,

so that

	$\displaystyle\\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\\|_{2}$	$\displaystyle\leq C_{r}(\Delta)\,\\|\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}-\nu_{\boldsymbol{\delta}}\\|_{2}\leq C_{r}(\Delta)\,\\|\widehat{\nu}_{\boldsymbol{\delta}}-\nu_{\boldsymbol{\delta}}\\|_{2}$
		$\displaystyle\leq C_{1}(\Delta)\,\\|\widehat{f}-f\\|_{2},$

with $C_{1}(\Delta):=C_{r}(\Delta)e^{\tau_{\Delta}}|\mathcal{W}|^{1/2}f_{\min}^{-1/2}$ , which proves (10).

Control of $\widehat{\eta}_{\boldsymbol{\delta}}-\eta_{\boldsymbol{\delta}}$ . Using the product expansion $\widehat{\mu}\widehat{f}-\mu f=(\widehat{\mu}-\mu)\widehat{f}+\mu(\widehat{f}-f)$ , we find

\displaystyle\widehat{\eta}_{\boldsymbol{\delta}}(\boldsymbol{x})-\eta_{\boldsymbol{\delta}}(\boldsymbol{x})

\displaystyle=\int_{\mathcal{W}}\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{w},\boldsymbol{x})\}\big[(\widehat{\mu}-\mu)(\boldsymbol{x},\boldsymbol{w})\,\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})+\mu(\boldsymbol{x},\boldsymbol{w})\big\{\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})-f(\boldsymbol{w}\mid\boldsymbol{x})\big\}\big]\,d\boldsymbol{w}.

Assumption (A1) implies $|\mu(\boldsymbol{x},\boldsymbol{w})|\leq M<\infty$ , and the bounds on $f,\widehat{f}$ give $0\leq\widehat{f}\leq f_{\max}+o_{P}(1)$ . Hence

\displaystyle|\widehat{\eta}_{\boldsymbol{\delta}}(\boldsymbol{x})-\eta_{\boldsymbol{\delta}}(\boldsymbol{x})|

\displaystyle\leq e^{\tau_{\Delta}}\Big[\int_{\mathcal{W}}|(\widehat{\mu}-\mu)(\boldsymbol{x},\boldsymbol{w})|\,\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})\,d\boldsymbol{w}+M\int_{\mathcal{W}}|\widehat{f}(\boldsymbol{w}\mid\boldsymbol{x})-f(\boldsymbol{w}\mid\boldsymbol{x})|\,d\boldsymbol{w}\Big].

On an event whose probability tends to one we have $0\leq\widehat{f}\leq f_{\max}+1$ , and another application of Cauchy–Schwarz yields

\|\widehat{\eta}_{\boldsymbol{\delta}}-\eta_{\boldsymbol{\delta}}\|_{2}\leq C_{\eta}(\Delta)\big(\|\widehat{\mu}-\mu\|_{2}+\|\widehat{f}-f\|_{2}\big)

for some finite constant $C_{\eta}(\Delta)$ .

Control of $\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}$ . By the algebraic identity $a/b-c/d=(a-c)/b+c(1/b-1/d)$ ,

\displaystyle\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{x})-m_{\boldsymbol{\delta}}(\boldsymbol{x})

\displaystyle=\frac{\widehat{\eta}_{\boldsymbol{\delta}}(\boldsymbol{x})-\eta_{\boldsymbol{\delta}}(\boldsymbol{x})}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{x})}+\eta_{\boldsymbol{\delta}}(\boldsymbol{x})\Big\{\frac{1}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{x})}-\frac{1}{\nu_{\boldsymbol{\delta}}(\boldsymbol{x})}\Big\}.

Using again the uniform bounds on $\eta_{\boldsymbol{\delta}},\nu_{\boldsymbol{\delta}},\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}$ implied by (A1)–(A3), we obtain

|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}|\leq C_{m}(\Delta)\Big(|\widehat{\eta}_{\boldsymbol{\delta}}-\eta_{\boldsymbol{\delta}}|+|\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}-\nu_{\boldsymbol{\delta}}|\Big)

for some finite constant $C_{m}(\Delta)$ . Taking $L_{2}(P)$ –norms and combining the bounds above gives (11) with

C_{2}(\Delta):=C_{m}(\Delta)\big(C_{\eta}(\Delta)+e^{\tau_{\Delta}}|\mathcal{W}|^{1/2}\big).

∎

Corollary 3 (Rate condition in terms of $(\mu,f)$ ).

Under the conditions of Lemma 9, if

\|\widehat{\mu}-\mu\|_{2}=o_{P}(n^{-1/4})\qquad\text{and}\qquad\|\widehat{f}-f\|_{2}=o_{P}(n^{-1/4}),

then, for every fixed $\boldsymbol{\delta}$ with $\|\boldsymbol{\delta}\|\leq\Delta$ ,

\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}=o_{P}(n^{-1/4}),\qquad\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}=o_{P}(n^{-1/4}),

and hence the product condition $\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}\,\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}=o_{P}(n^{-1/2})$ required in Theorem 5 holds.

We now turn to the von Mises expansion and the second-order remainder bound that underpin the finite- $\boldsymbol{\delta}$ central limit theorem in the main text. The results in this subsection are stated in terms of the tilted nuisances $(m_{\boldsymbol{\delta}},r_{\boldsymbol{\delta}})$ ; together with Corollary 3, they imply Theorem 3.

We work with the cross-fitted one-step estimator written directly in density-ratio form,

\widehat{\psi}(\boldsymbol{\delta})=P_{n}\!\big[\widehat{r}_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\{Y-\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X})\}\big]\;+\;P_{n}\!\big[\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X})\big],\qquad\widehat{\theta}(\boldsymbol{\delta}):=\widehat{\psi}(\boldsymbol{\delta})-\widehat{\psi}(\boldsymbol{0}),

(12)

where, for a fixed tilt $\boldsymbol{\delta}$ with $\|\boldsymbol{\delta}\|\leq\Delta$ ,

	$\displaystyle m_{\boldsymbol{\delta}}(\boldsymbol{x})$	$\displaystyle:=\mathbb{E}_{g_{\boldsymbol{\delta}}}\!\left[\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}=\boldsymbol{x}\right]$
		$\displaystyle=\frac{\mathbb{E}\!\left[\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}=\boldsymbol{x}\right]}{\mathbb{E}\!\left[\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}\mid\boldsymbol{X}=\boldsymbol{x}\right]}$
		$\displaystyle=\frac{\eta_{\boldsymbol{\delta}}(\boldsymbol{x})}{\nu_{\boldsymbol{\delta}}(\boldsymbol{x})},$

and

	$\displaystyle\nu_{\boldsymbol{\delta}}(\boldsymbol{x})$	$\displaystyle:=\mathbb{E}\!\left[\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}\mid\boldsymbol{X}=\boldsymbol{x}\right],$
	$\displaystyle\eta_{\boldsymbol{\delta}}(\boldsymbol{x})$	$\displaystyle:=\mathbb{E}\!\left[\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}\mu(\boldsymbol{X},\boldsymbol{W})\mid\boldsymbol{X}=\boldsymbol{x}\right],$

	$\displaystyle\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{x})$	$\displaystyle:=\min\Big\{\max\big(\widehat{\nu}_{\boldsymbol{\delta}}(\boldsymbol{x}),\ e^{-\tau_{\Delta}}/2\big),\ 2e^{\tau_{\Delta}}\Big\},$
	$\displaystyle r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})$	$\displaystyle:=\frac{\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}}{\nu_{\boldsymbol{\delta}}(\boldsymbol{X})},$
	$\displaystyle\widehat{r}_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})$	$\displaystyle:=\frac{\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{X})},$
	$\displaystyle\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X})$	$\displaystyle:=\frac{\widehat{\eta}_{\boldsymbol{\delta}}(\boldsymbol{X})}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{X})}.$

Here $\widehat{\nu}_{\boldsymbol{\delta}}$ and $\widehat{\eta}_{\boldsymbol{\delta}}$ are obtained from cross-fitted regressions of the transformed outcomes $\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}$ and $\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}\mu(\boldsymbol{X},\boldsymbol{W})$ on $\boldsymbol{X}$ , respectively.

From the von Mises decomposition (see, e.g., Kennedy, 2024) and by adding and subtracting $P\varphi_{\psi(\boldsymbol{\delta})}$ , $P_{n}\varphi_{\psi(\boldsymbol{\delta})}$ , and $P_{n}\widehat{\varphi}_{\psi(\boldsymbol{\delta})}$ on the right-hand side, we obtain

\widehat{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})=(P_{n}-P)\{\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})\}+(P_{n}-P)\!\big\{\widehat{\varphi}_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})-\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})\big\}+R_{2}(\widehat{P},P;\boldsymbol{\delta}),

(13)

where

	$\displaystyle\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})$	$\displaystyle:=r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\{Y-m_{\boldsymbol{\delta}}(\boldsymbol{X})\}+m_{\boldsymbol{\delta}}(\boldsymbol{X})-\psi(\boldsymbol{\delta}),$
	$\displaystyle\widehat{\varphi}_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})$	$\displaystyle:=\widehat{r}_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\{Y-\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X})\}+\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X})-\widehat{\psi}(\boldsymbol{\delta}),$

and the second-order remainder is

R_{2}(\widehat{P},P;\boldsymbol{\delta}):=\widehat{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})+P\!\left[\widehat{\varphi}_{\psi(\boldsymbol{\delta})}\right].

Lemma 10 (Second-order remainder).

For any fixed $\boldsymbol{\delta}$ with $\|\boldsymbol{\delta}\|\leq\Delta$ ,

\big|R_{2}(\widehat{P},P;\boldsymbol{\delta})\big|\ \leq\ \|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}\,\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}.

Proof.

Starting from the estimator representation (12), the definition of $R_{2}$ gives

	$\displaystyle R_{2}$	$\displaystyle=\mathbb{E}\big[\widehat{r}_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\{Y-\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X})\}+\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X})\big]$
		$\displaystyle\quad-\mathbb{E}\big[r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\{Y-m_{\boldsymbol{\delta}}(\boldsymbol{X})\}+m_{\boldsymbol{\delta}}(\boldsymbol{X})\big]$
		$\displaystyle=\mathbb{E}\big[(\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}})\{Y-m_{\boldsymbol{\delta}}(\boldsymbol{X})\}\big]-\mathbb{E}\big[(\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}})\{\widehat{m}_{\boldsymbol{\delta}}(\boldsymbol{X})-m_{\boldsymbol{\delta}}(\boldsymbol{X})\}\big].$

The first term vanishes by iterated expectation:

\mathbb{E}\!\big[(\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}})\{Y-\mu(\boldsymbol{X},\boldsymbol{W})\}\big]=\mathbb{E}\!\Big[\,\mathbb{E}\big[(\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}})\{Y-\mu(\boldsymbol{X},\boldsymbol{W})\}\mid\boldsymbol{X},\boldsymbol{W}\big]\Big]=0,

and

	$\displaystyle\mathbb{E}\big[(\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}})\{\mu(\boldsymbol{X},\boldsymbol{W})-m_{\boldsymbol{\delta}}(\boldsymbol{X})\}\mid\boldsymbol{X}\big]$	$\displaystyle=\mathbb{E}[\widehat{r}_{\boldsymbol{\delta}}\mu\mid\boldsymbol{X}]-m_{\boldsymbol{\delta}}(\boldsymbol{X})\mathbb{E}[\widehat{r}_{\boldsymbol{\delta}}\mid\boldsymbol{X}]$
		$\displaystyle\quad-\mathbb{E}[r_{\boldsymbol{\delta}}\mu\mid\boldsymbol{X}]+m_{\boldsymbol{\delta}}(\boldsymbol{X})\mathbb{E}[r_{\boldsymbol{\delta}}\mid\boldsymbol{X}]$
		$\displaystyle=\frac{\eta_{\boldsymbol{\delta}}(\boldsymbol{X})}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{X})}-\frac{\eta_{\boldsymbol{\delta}}(\boldsymbol{X})}{\nu_{\boldsymbol{\delta}}(\boldsymbol{X})}$
		$\displaystyle\quad-\frac{\eta_{\boldsymbol{\delta}}(\boldsymbol{X})}{\nu_{\boldsymbol{\delta}}(\boldsymbol{X})}\Big(\frac{\nu_{\boldsymbol{\delta}}(\boldsymbol{X})}{\widehat{\nu}_{\boldsymbol{\delta}}^{\dagger}(\boldsymbol{X})}-1\Big)$
		$\displaystyle=0.$

Hence $R_{2}=-\mathbb{E}[(\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}})(\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}})]$ , and the Cauchy–Schwarz inequality yields

\big|R_{2}(\widehat{P},P;\boldsymbol{\delta})\big|\leq\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}\,\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}.

∎

Theorem 5 (Asymptotic normality at $\sqrt{n}$ rate).

Assume (A1)–(A3). Let $\boldsymbol{Z}:=(\boldsymbol{X},\boldsymbol{W},Y)$ and suppose $\{\boldsymbol{Z}_{i}\}_{i=1}^{n}$ are i.i.d. draws from $P$ . Fix $\boldsymbol{\delta}$ with $\|\boldsymbol{\delta}\|\leq\Delta<\infty$ , and use $K$ -fold cross-fitting to obtain the nuisance estimators $(\widehat{r}_{\boldsymbol{\delta}},\widehat{m}_{\boldsymbol{\delta}})$ . Suppose

\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}=o_{P}(1),\qquad\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}=o_{P}(1),

and the product condition

\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}\,\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}=o_{P}(n^{-1/2}).

A convenient sufficient condition is

\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}=o_{P}(n^{-1/4}),\qquad\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}=o_{P}(n^{-1/4}),

which can be achieved by suitable cross-fitted learners under standard regularity conditions. Then

\sqrt{n}\,\big\{\widehat{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})\big\}\ \rightsquigarrow\ \mathcal{N}\!\big(0,\ \mathrm{Var}\{\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})\}\big).

Consequently, for the incremental effect $\widehat{\theta}(\boldsymbol{\delta})=\widehat{\psi}(\boldsymbol{\delta})-\widehat{\psi}(\boldsymbol{0})$ ,

\sqrt{n}\,\big\{\widehat{\theta}(\boldsymbol{\delta})-\theta(\boldsymbol{\delta})\big\}\ \rightsquigarrow\ \mathcal{N}\!\big(0,\ \mathrm{Var}\{\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})\}\big),\qquad\varphi_{\theta(\boldsymbol{\delta})}:=\varphi_{\psi(\boldsymbol{\delta})}-\varphi_{\psi(\boldsymbol{0})}.

Proof.

Control of the leading empirical process term. For any $\|\boldsymbol{\delta}\|\leq\Delta$ , boundedness of $\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})$ implies there exists $\tau_{\Delta}<\infty$ such that

e^{-\tau_{\Delta}}\leq\exp\{\boldsymbol{\delta}^{\top}\boldsymbol{s}(\boldsymbol{W},\boldsymbol{X})\}\leq e^{\tau_{\Delta}},\qquad e^{-\tau_{\Delta}}\leq\nu_{\boldsymbol{\delta}}(\boldsymbol{X})\leq e^{\tau_{\Delta}},

so that $e^{-2\tau_{\Delta}}\leq r_{\boldsymbol{\delta}}(\boldsymbol{W},\boldsymbol{X})\leq e^{2\tau_{\Delta}}$ . Under (A1), $|Y|\leq M$ almost surely, and hence $|m_{\boldsymbol{\delta}}(\boldsymbol{X})|\leq M$ as well. Therefore

	$\displaystyle\|\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})\|$	$\displaystyle\leq e^{2\tau_{\Delta}}\,\|Y-m_{\boldsymbol{\delta}}(\boldsymbol{X})\|+\|m_{\boldsymbol{\delta}}(\boldsymbol{X})-\psi(\boldsymbol{\delta})\|$
		$\displaystyle\leq 2Me^{2\tau_{\Delta}}+2M<\infty.$

Since $\{\boldsymbol{Z}_{i}\}_{i=1}^{n}$ are i.i.d., the classical Lindeberg–Feller CLT applies and yields

\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z}_{i})\ \rightsquigarrow\ \mathcal{N}\!\big(0,\ \mathrm{Var}\{\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})\}\big).

Control of the estimated influence term. From the definitions,

	$\displaystyle\widehat{\varphi}_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})$	$\displaystyle=\widehat{r}_{\boldsymbol{\delta}}\{Y-\widehat{m}_{\boldsymbol{\delta}}\}+\widehat{m}_{\boldsymbol{\delta}}-\widehat{\psi}(\boldsymbol{\delta}),$
	$\displaystyle\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})$	$\displaystyle=r_{\boldsymbol{\delta}}\{Y-m_{\boldsymbol{\delta}}\}+m_{\boldsymbol{\delta}}-\psi(\boldsymbol{\delta}),$

so that

\displaystyle\widehat{\varphi}_{\psi(\boldsymbol{\delta})}-\varphi_{\psi(\boldsymbol{\delta})}

\displaystyle=(\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}})\{Y-m_{\boldsymbol{\delta}}\}+\big(1-\widehat{r}_{\boldsymbol{\delta}}\big)\{\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\}-\{\widehat{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})\}.

Taking $P_{n}-P$ of both sides and noting that $(P_{n}-P)\{\widehat{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})\}=0$ , we obtain

\displaystyle(P_{n}-P)\big\{\widehat{\varphi}_{\psi(\boldsymbol{\delta})}-\varphi_{\psi(\boldsymbol{\delta})}\big\}

\displaystyle=(P_{n}-P)\!\left[(\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}})\{Y-m_{\boldsymbol{\delta}}\}+\big(1-\widehat{r}_{\boldsymbol{\delta}}\big)\{\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\}\right].

Using $|Y|\leq M$ , the finite- $\Delta$ bounds on $r_{\boldsymbol{\delta}}$ and $\widehat{r}_{\boldsymbol{\delta}}$ , and the Cauchy–Schwarz inequality, it follows that

\displaystyle(P_{n}-P)\big\{\widehat{\varphi}_{\psi(\boldsymbol{\delta})}-\varphi_{\psi(\boldsymbol{\delta})}\big\}

\displaystyle=O_{P}\!\left(\frac{\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}+\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}}{\sqrt{n}}\right)=o_{P}(1),

under the assumed $L_{2}$ rates. In particular,

\displaystyle\sqrt{n}\,(P_{n}-P)\big\{\widehat{\varphi}_{\psi(\boldsymbol{\delta})}-\varphi_{\psi(\boldsymbol{\delta})}\big\}

\displaystyle=O_{P}\!\Big(\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}+\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2}\Big)=o_{P}(1).

Control of the remainder. By Lemma 10,

\big|R_{2}(\widehat{P},P;\boldsymbol{\delta})\big|\leq\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\|_{2}\,\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\|_{2},

and the product condition implies $\sqrt{n}\,R_{2}(\widehat{P},P;\boldsymbol{\delta})=o_{P}(1)$ .

Conclusion. Combining (13) with the three displays above and applying Slutsky’s theorem yields

\sqrt{n}\{\widehat{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})\}\ \rightsquigarrow\ \mathcal{N}\!\big(0,\ \mathrm{Var}\{\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z})\}\big).

Moreover, applying the same argument at $\boldsymbol{\delta}=\boldsymbol{0}$ and using the multivariate Lindeberg–Feller CLT gives the joint convergence

\sqrt{n}\Big(\widehat{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta}),\ \widehat{\psi}(\boldsymbol{0})-\psi(\boldsymbol{0})\Big)\ \rightsquigarrow\ \mathcal{N}\!\Big(\boldsymbol{0},\ \operatorname{Cov}\big(\varphi_{\psi(\boldsymbol{\delta})}(\boldsymbol{Z}),\ \varphi_{\psi(\boldsymbol{0})}(\boldsymbol{Z})\big)\Big).

Therefore, by the continuous mapping theorem,

	$\displaystyle\sqrt{n}\big\{\widehat{\theta}(\boldsymbol{\delta})-\theta(\boldsymbol{\delta})\big\}$	$\displaystyle=\sqrt{n}\Big(\big\{\widehat{\psi}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})\big\}-\big\{\widehat{\psi}(\boldsymbol{0})-\psi(\boldsymbol{0})\big\}\Big)$
		$\displaystyle\ \rightsquigarrow\ \mathcal{N}\!\big(0,\ \mathrm{Var}\{\varphi_{\theta(\boldsymbol{\delta})}(\boldsymbol{Z})\}\big).$

∎

Combining Theorem 5 with Corollary 3 immediately yields the finite- $\boldsymbol{\delta}$ CLT stated in Theorem 3 in the main text.

Appendix E Sensitivity analysis for unmeasured confounding

This section assesses the robustness of our incremental policy estimands to violations of the no-unmeasured-confounding assumption and follows the framework of (Chernozhukov et al., 2022).

E.1 Setup: long vs. short worlds

Let the observed data be $\mathbf{Z}_{i}=(\mathbf{X}_{i},\mathbf{W}_{i},Y_{i})$ drawn i.i.d. from $P_{0}$ . Assume there exists an unobserved confounder $U$ such that, with $\mathbf{V}:=(\mathbf{X},U)$ , the potential outcomes $\{Y(\mathbf{w}):\mathbf{w}\in\mathcal{W}\}$ satisfy

Y(\mathbf{w})\ \perp\!\!\!\perp\ \mathbf{W}\ \big|\ \mathbf{V},\qquad\forall\mathbf{w}\in\mathcal{W},

(14)

together with SUTVA/consistency.

Define the long and short outcome regressions

\mu(\mathbf{v},\mathbf{w}):=\operatorname{E}[Y\mid\mathbf{V}=\mathbf{v},\mathbf{W}=\mathbf{w}],\qquad\mu_{s}(\mathbf{x},\mathbf{w}):=\operatorname{E}[Y\mid\mathbf{X}=\mathbf{x},\mathbf{W}=\mathbf{w}].

Let $f(\mathbf{w}\mid\mathbf{v})$ denote the long conditional density of $\mathbf{W}$ given $\mathbf{V}$ , and $f(\mathbf{w}\mid\mathbf{x})$ the short conditional density of $\mathbf{W}$ given $\mathbf{X}$ (i.e. the observed conditional law of the exposure mixture given $\mathbf{X}$ ).

Stochastic intervention fixed from observed data.

We keep the intervention rule identical to Section 2: for any fixed $\boldsymbol{\delta}\in\mathbb{R}^{q}$ ,

g_{\boldsymbol{\delta}}(\mathbf{w}\mid\mathbf{x})=\frac{\exp(\boldsymbol{\delta}^{\top}\mathbf{w})\,f(\mathbf{w}\mid\mathbf{x})}{\int_{\mathcal{W}}\exp(\boldsymbol{\delta}^{\top}\mathbf{v})\,f(\mathbf{v}\mid\mathbf{x})\,d\mathbf{v}}.

(15)

Importantly, $g_{\boldsymbol{\delta}}(\cdot\mid\mathbf{x})$ depends only on $\mathbf{x}$ and is therefore implementable; it is well-defined regardless of whether $U$ exists.

Target (long) estimand.

Let $Y^{g_{\boldsymbol{\delta}}}$ denote the counterfactual outcome under the intervention that draws $\mathbf{W}$ from $g_{\boldsymbol{\delta}}(\cdot\mid\mathbf{X})$ (independently of $U$ given $\mathbf{X}$ ). Under (14), the causal estimand is

\psi(\boldsymbol{\delta}):=\operatorname{E}\big[Y^{g_{\boldsymbol{\delta}}}\big]=\operatorname{E}\Big[\int_{\mathcal{W}}\mu(\mathbf{V},\mathbf{w})\,g_{\boldsymbol{\delta}}(\mathbf{w}\mid\mathbf{X})\,d\mathbf{w}\Big].

(16)

Identified (short) estimand under ignorability given $\mathbf{X}$ .

If one incorrectly assumes ignorability given $\mathbf{X}$ only, the same g-formula yields the identified estimand

\psi_{s}(\boldsymbol{\delta}):=\operatorname{E}\Big[\int_{\mathcal{W}}\mu_{s}(\mathbf{X},\mathbf{w})\,g_{\boldsymbol{\delta}}(\mathbf{w}\mid\mathbf{X})\,d\mathbf{w}\Big],

(17)

which is the estimand targeted by our estimators.

Our sensitivity analysis bounds the discrepancy $\psi_{s}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})$ as a function of interpretable sensitivity parameters.

E.2 Linear functional form and Riesz representers

For a fixed $\boldsymbol{\delta}$ , define the functional on square-integrable functions $h(\mathbf{v},\mathbf{w})\in L_{2}(P_{\mathbf{V},\mathbf{W}})$ :

\mathcal{T}_{\boldsymbol{\delta}}(h):=\operatorname{E}\Big[\int_{\mathcal{W}}h(\mathbf{V},\mathbf{w})\,g_{\boldsymbol{\delta}}(\mathbf{w}\mid\mathbf{X})\,d\mathbf{w}\Big].

(18)

Then $\psi(\boldsymbol{\delta})=\mathcal{T}_{\boldsymbol{\delta}}(\mu)$ and $\psi_{s}(\boldsymbol{\delta})=\mathcal{T}_{\boldsymbol{\delta}}(\mu_{s})$ .

Proposition 6.

The functional $\mathcal{T}_{\boldsymbol{\delta}}(h)$ is a linear functional.

Proof.

To establish linearity, we must verify that $\psi$ satisfies both the property of additivity and homogeneity. Let $h_{1},h_{2}\in L_{2}(P_{\mathbf{V},\mathbf{W}})$ be arbitrary functions and let $c\in\mathbb{R}$ be a scalar constant.

1. Additivity.

By the linearity of the Lebesgue integral and the linearity of the expectation operator, we have:

	$\displaystyle\mathcal{T}_{\boldsymbol{\delta}}(h_{1}+h_{2})$	$\displaystyle=\mathbb{E}\left[\int_{\mathcal{W}}(h_{1}+h_{2})(\boldsymbol{V},\boldsymbol{w})\,g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right]$
		$\displaystyle=\mathbb{E}\left[\int_{\mathcal{W}}\left(h_{1}(\boldsymbol{V},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})+h_{2}(\boldsymbol{V},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\right)\,d\boldsymbol{w}\right]$
		$\displaystyle=\mathbb{E}\left[\int_{\mathcal{W}}h_{1}(\boldsymbol{V},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}+\int_{\mathcal{W}}h_{2}(\boldsymbol{V},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right]$
		$\displaystyle=\mathbb{E}\left[\int_{\mathcal{W}}h_{1}(\boldsymbol{V},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right]+\mathbb{E}\left[\int_{\mathcal{W}}h_{2}(\boldsymbol{V},\boldsymbol{w})g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right]$
		$\displaystyle=\mathcal{T}_{\boldsymbol{\delta}}(h_{1})+\mathcal{T}_{\boldsymbol{\delta}}(h_{2}).$

2. Homogeneity.

	$\displaystyle\mathcal{T}_{\boldsymbol{\delta}}(c\cdot h)$	$\displaystyle=\mathbb{E}\left[\int_{\mathcal{W}}c\cdot h(\boldsymbol{V},\boldsymbol{w})\,g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right]$
		$\displaystyle=\mathbb{E}\left[c\int_{\mathcal{W}}h(\boldsymbol{V},\boldsymbol{w})\,g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right]$
		$\displaystyle=c\,\mathbb{E}\left[\int_{\mathcal{W}}h(\boldsymbol{V},\boldsymbol{w})\,g_{\boldsymbol{\delta}}(\boldsymbol{w}\mid\boldsymbol{X})\,d\boldsymbol{w}\right]$
		$\displaystyle=c\cdot\mathcal{T}_{\boldsymbol{\delta}}(h).$

Since both conditions are satisfied, $\mathcal{T}_{\boldsymbol{\delta}}$ is a linear functional. ∎

Weak overlap / continuity condition.

Assume $\mathcal{T}_{\boldsymbol{\delta}}$ is continuous on $L_{2}(P_{\mathbf{V},\mathbf{W}})$ ; a sufficient condition required here is the “weak overlap” requirement

\operatorname{E}\!\left[\left(\frac{g_{\boldsymbol{\delta}}(\mathbf{W}\mid\mathbf{X})}{f(\mathbf{W}\mid\mathbf{V})}\right)^{2}\right]<\infty.

(19)

This condition is untestable without $U$ , but it is the standard integrability requirement needed for the Riesz representation.

Riesz representation.

Under (19), by the Riesz–Fréchet representation theorem there exists a unique (long) Riesz representer $\alpha_{\boldsymbol{\delta}}(\mathbf{V},\mathbf{W})\in L_{2}(P_{\mathbf{V},\mathbf{W}})$ such that

\mathcal{T}_{\boldsymbol{\delta}}(h)=\operatorname{E}\big[h(\mathbf{V},\mathbf{W})\,\alpha_{\boldsymbol{\delta}}(\mathbf{V},\mathbf{W})\big],\quad\forall h\in L_{2}(P_{\mathbf{V},\mathbf{W}}).

Moreover, the representer has the Radon–Nikodym form

\alpha_{\boldsymbol{\delta}}(\mathbf{V},\mathbf{W})=\frac{g_{\boldsymbol{\delta}}(\mathbf{W}\mid\mathbf{X})}{f(\mathbf{W}\mid\mathbf{V})}.

(20)

Likewise, the short Riesz representer for $\mathcal{T}_{\boldsymbol{\delta}}$ over $L_{2}(P_{\mathbf{X},\mathbf{W}})$ is

\alpha_{s,\boldsymbol{\delta}}(\mathbf{X},\mathbf{W})=\frac{g_{\boldsymbol{\delta}}(\mathbf{W}\mid\mathbf{X})}{f(\mathbf{W}\mid\mathbf{X})}=\operatorname{E}\big[\alpha_{\boldsymbol{\delta}}(\mathbf{V},\mathbf{W})\mid\mathbf{X},\mathbf{W}\big].

(21)

Under the exponential tilt (15), $\alpha_{s,\boldsymbol{\delta}}$ simplifies to

\alpha_{s,\boldsymbol{\delta}}(\mathbf{X},\mathbf{W})=\frac{\exp(\boldsymbol{\delta}^{\top}\mathbf{W})}{\nu_{\boldsymbol{\delta}}(\mathbf{X})},\qquad\nu_{\boldsymbol{\delta}}(\mathbf{x}):=\operatorname{E}\big[\exp(\boldsymbol{\delta}^{\top}\mathbf{W})\mid\mathbf{X}=\mathbf{x}\big].

(22)

Thus our policy estimand is a continuous linear functional of the long regression $\mu$ with a well-defined RR.

E.3 Exact OVB identity and sharp bound

Define the outcome regression error and the RR error:

\Delta_{\mu}(\mathbf{V},\mathbf{W}):=\mu(\mathbf{V},\mathbf{W})-\mu_{s}(\mathbf{X},\mathbf{W}),\qquad\Delta_{\alpha}(\mathbf{V},\mathbf{W}):=\alpha_{\boldsymbol{\delta}}(\mathbf{V},\mathbf{W})-\alpha_{s,\boldsymbol{\delta}}(\mathbf{X},\mathbf{W}).

Then the omitted-variable bias satisfies the exact identity

\psi_{s}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})=\operatorname{E}\big[\Delta_{\mu}(\mathbf{V},\mathbf{W})\,\Delta_{\alpha}(\mathbf{V},\mathbf{W})\big].

(23)

By Cauchy–Schwarz,

\big|\psi_{s}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})\big|\leq B(\boldsymbol{\delta}):=\sqrt{\operatorname{E}[\Delta_{\mu}^{2}]}\;\sqrt{\operatorname{E}[\Delta_{\alpha}^{2}]}.

(24)

It is often useful to isolate the “degree of adversity”

\varrho(\boldsymbol{\delta}):=\operatorname{Corr}\!\big(\Delta_{\mu}(\mathbf{V},\mathbf{W}),\Delta_{\alpha}(\mathbf{V},\mathbf{W})\big)\in[-1,1],

so that (23) yields $|\psi_{s}-\psi|^{2}=\varrho(\boldsymbol{\delta})^{2}\,B(\boldsymbol{\delta})^{2}\leq B(\boldsymbol{\delta})^{2}.$ In our primary analysis and reported bounds, we focus on the worst-case scenario by considering adversarial confounding, which implicitly sets $|\varrho(\boldsymbol{\delta})|=1$ . This allows us to establish a conservative bound and subsequently omit the correlation term from our final operational formulas.

E.4 Reparameterization by interpretable partial $R^{2}$

Following the general theory, the bound can be rewritten as a product of an identifiable scale and two sensitivity parameters with partial- $R^{2}$ interpretations.

Identifiable scale.

Let $\sigma_{s}^{2}:=\operatorname{E}\big[(Y-\mu_{s}(\mathbf{X},\mathbf{W}))^{2}\big]$ and $\nu_{s}^{2}(\boldsymbol{\delta}):=\operatorname{E}\big[\alpha_{s,\boldsymbol{\delta}}(\mathbf{X},\mathbf{W})^{2}\big]$ . Define

S(\boldsymbol{\delta})^{2}:=\sigma_{s}^{2}\;\nu_{s}^{2}(\boldsymbol{\delta}),

(25)

which depends only on the observed-data law of $(Y,\mathbf{W},\mathbf{X})$ and is therefore estimable.

Outcome-side sensitivity (partial $R^{2}$ ).

Define

C_{Y}^{2}:=\frac{\operatorname{E}[\Delta_{\mu}(\mathbf{V},\mathbf{W})^{2}]}{\operatorname{E}[(Y-\mu_{s}(\mathbf{X},\mathbf{W}))^{2}]}=\frac{\operatorname{Var}\!\big(\operatorname{E}[Y\mid\mathbf{X},\mathbf{W},U]\big)-\operatorname{Var}\!\big(\operatorname{E}[Y\mid\mathbf{X},\mathbf{W}]\big)}{\operatorname{Var}(Y)-\operatorname{Var}\!\big(\operatorname{E}[Y\mid\mathbf{X},\mathbf{W}]\big)}\in[0,1],

(26)

i.e. the nonparametric partial $R^{2}$ of $U$ with $Y$ given $(\mathbf{X},\mathbf{W})$ .

Exposure/RR-side sensitivity.

Because $\alpha_{s,\boldsymbol{\delta}}$ is the $L_{2}$ projection of $\alpha_{\boldsymbol{\delta}}$ onto $(\mathbf{X},\mathbf{W})$ , we have $\operatorname{E}[\Delta_{\alpha}^{2}]=\operatorname{E}[\alpha_{\boldsymbol{\delta}}^{2}]-\operatorname{E}[\alpha_{s,\boldsymbol{\delta}}^{2}].$ Define the (RR-space) $R^{2}$ :

R_{\alpha}^{2}(\boldsymbol{\delta}):=\operatorname{Corr}^{2}\!\big(\alpha_{\boldsymbol{\delta}}(\mathbf{V},\mathbf{W}),\alpha_{s,\boldsymbol{\delta}}(\mathbf{X},\mathbf{W})\big)=\frac{\operatorname{E}[\alpha_{s,\boldsymbol{\delta}}^{2}]}{\operatorname{E}[\alpha_{\boldsymbol{\delta}}^{2}]}\in(0,1],

and let

C_{D}^{2}(\boldsymbol{\delta}):=\frac{\operatorname{E}[\alpha_{\boldsymbol{\delta}}^{2}]-\operatorname{E}[\alpha_{s,\boldsymbol{\delta}}^{2}]}{\operatorname{E}[\alpha_{s,\boldsymbol{\delta}}^{2}]}=\frac{1-R_{\alpha}^{2}(\boldsymbol{\delta})}{R_{\alpha}^{2}(\boldsymbol{\delta})}.

(27)

Thus $1-R_{\alpha}^{2}(\boldsymbol{\delta})$ is the fraction of RR variation generated by the latent confounder.

Final bound in $R^{2}$ form.

Combining (24)–(27) yields

B(\boldsymbol{\delta})^{2}=S(\boldsymbol{\delta})^{2}\,C_{Y}^{2}\,C_{D}^{2}(\boldsymbol{\delta}),\qquad\big|\psi_{s}(\boldsymbol{\delta})-\psi(\boldsymbol{\delta})\big|\leq\,S(\boldsymbol{\delta})\,C_{Y}\,C_{D}(\boldsymbol{\delta}).

(28)

Therefore, for any posited $(C_{Y}^{2},R_{\alpha}^{2}(\boldsymbol{\delta}))$ , we obtain the sensitivity interval

\psi(\boldsymbol{\delta})\in\Big[\;\psi_{s}(\boldsymbol{\delta})-S(\boldsymbol{\delta})C_{Y}C_{D}(\boldsymbol{\delta}),\;\psi_{s}(\boldsymbol{\delta})+S(\boldsymbol{\delta})C_{Y}C_{D}(\boldsymbol{\delta})\;\Big].

(29)

E.5 Incremental effect

If the scientific target is the incremental effect contrast $\theta(\boldsymbol{\delta}):=\psi(\boldsymbol{\delta})-\psi(\mathbf{0})$ , then the difference of continuous linear functionals is again a continuous linear functional. The same analysis applies with the RR replaced by the RR contrast:

\alpha_{\theta,\boldsymbol{\delta}}(\mathbf{V},\mathbf{W}):=\alpha_{\boldsymbol{\delta}}(\mathbf{V},\mathbf{W})-\alpha_{\mathbf{0}}(\mathbf{V},\mathbf{W}),\qquad\alpha_{s,\theta,\boldsymbol{\delta}}(\mathbf{X},\mathbf{W}):=\alpha_{s,\boldsymbol{\delta}}(\mathbf{X},\mathbf{W})-\alpha_{s,\mathbf{0}}(\mathbf{X},\mathbf{W}),

and $\mu$ unchanged. One obtains an interval for $\theta(\boldsymbol{\delta})$ by applying (29) to $\theta$ .

E.6 Implementation

For each fixed $\boldsymbol{\delta}$ considered in the paper, the sensitivity analysis requires three estimated quantities:

1.

$\widehat{\theta}(\boldsymbol{\delta})$ : our EIF-based (cross-fitted) estimator of the incremental effect from Section 3.
2.

$\widehat{\sigma}_{s}^{2}:=n^{-1}\sum_{i=1}^{n}\{Y_{i}-\widehat{\mu}_{s}(\mathbf{X}_{i},\mathbf{W}_{i})\}^{2}$ using the same cross-fitted $\widehat{\mu}_{s}$ as in the main estimator.
3.

$\widehat{\nu}_{s,\theta}^{2}(\boldsymbol{\delta}):=n^{-1}\sum_{i=1}^{n}\widehat{\alpha}_{s,\theta,\boldsymbol{\delta}}(\mathbf{X}_{i},\mathbf{W}_{i})^{2}$ , where $\widehat{\alpha}_{s,\theta,\boldsymbol{\delta}}=\widehat{\alpha}_{s,\boldsymbol{\delta}}-\widehat{\alpha}_{s,\boldsymbol{0}}=\widehat{\alpha}_{s,\boldsymbol{\delta}}-1$ , and $\widehat{\alpha}_{s,\boldsymbol{\delta}}$ is obtained either as the estimated density ratio $\widehat{g}_{\boldsymbol{\delta}}/\widehat{f}$ or via the closed form (22) by estimating $\nu_{\boldsymbol{\delta}}(\mathbf{X})=\operatorname{E}[\exp(\boldsymbol{\delta}^{\top}\mathbf{W})\mid\mathbf{X}]$ with regression.

Then $\widehat{S}(\boldsymbol{\delta}):=\sqrt{\widehat{\sigma}_{s}^{2}\,\widehat{\nu}_{s,\theta}^{2}(\boldsymbol{\delta})}$ .

Finally, for any posited sensitivity parameters

\eta_{Y}^{2}:=C_{Y}^{2}\in[0,1],\qquad\eta_{\alpha}^{2}(\boldsymbol{\delta}):=1-R_{\alpha}^{2}(\boldsymbol{\delta})\in[0,1),

the bias bound becomes

\widehat{B}(\boldsymbol{\delta};\eta_{Y}^{2},\eta_{\alpha}^{2})=\widehat{S}(\boldsymbol{\delta})\,\sqrt{\eta_{Y}^{2}}\,\sqrt{\frac{\eta_{\alpha}^{2}(\boldsymbol{\delta})}{1-\eta_{\alpha}^{2}(\boldsymbol{\delta})}}.

Accordingly, the estimated point identified set for the incremental effect is

\theta(\boldsymbol{\delta})\in\Big[\,\widehat{\theta}(\boldsymbol{\delta})-\widehat{B}(\boldsymbol{\delta};\eta_{Y}^{2},\eta_{\alpha}^{2}),\;\widehat{\theta}(\boldsymbol{\delta})+\widehat{B}(\boldsymbol{\delta};\eta_{Y}^{2},\eta_{\alpha}^{2})\Big].

When $\widehat{\theta}(\boldsymbol{\delta})\neq 0$ , the ratio

\frac{\widehat{B}(\boldsymbol{\delta};\eta_{Y}^{2},\eta_{\alpha}^{2})}{|\widehat{\theta}(\boldsymbol{\delta})|}

compares the benchmarked bias half-width with the magnitude of the estimated incremental effect. Table 2 reports this ratio for the main-text benchmark with $k_{Y}=1$ and $k_{D}=1$ over the 10 positive Gelbrich targets on each of the 9 intervention paths. Values below one indicate that the benchmarked point bound is narrower than the estimated effect in magnitude, whereas values above one indicate that the benchmarked bias half-width exceeds the estimated effect itself.

Table 2: Ratio

\widehat{B}(\boldsymbol{\delta})/|\widehat{\theta}(\boldsymbol{\delta})|

under the formal benchmark with

k_{Y}=1

and

k_{D}=1

. Columns correspond to the positive Gelbrich targets used in the application.

Scenario	0.05	0.10	0.15	0.20	0.25	0.30	0.35	0.40	0.45	0.50
BC	0.22	0.23	0.25	0.27	0.28	0.30	0.32	0.33	0.35	0.37
NO₃	0.42	0.62	0.60	0.53	0.47	0.41	0.36	0.32	0.29	0.26
OM	0.21	0.24	0.26	0.28	0.30	0.33	0.35	0.37	0.39	0.41
SO₄	0.10	0.10	0.09	0.08	0.07	0.07	0.06	0.06	0.05	0.05
NH₄	0.13	0.18	0.21	0.21	0.20	0.19	0.17	0.16	0.15	0.14
BC+OM	0.84	0.98	1.24	1.72	2.67	3.73	7.09	50.85	12.16	12.16
NO₃+SO₄+NH₄	0.17	0.19	0.22	0.27	0.32	0.35	0.36	0.35	0.32	0.32
All	0.07	0.08	0.10	0.12	0.14	0.17	0.22	0.24	0.25	0.25
BFGS	0.03	0.21	0.23	0.20	0.16	0.09	0.08	0.07	0.07	0.07

Confidence bounds for the estimated endpoints.

The empirical results in the main text report confidence bounds for the two estimated endpoints above. Let $\widehat{\varphi}_{\theta(\boldsymbol{\delta})}(\mathbf{Z}_{i})$ denote the cross-fitted EIF contribution of $\widehat{\theta}(\boldsymbol{\delta})$ . Define the centered plugin signals

	$\displaystyle\widehat{\varphi}_{\sigma,i}$	$\displaystyle:=\{Y_{i}-\widehat{\mu}_{s}(\mathbf{X}_{i},\mathbf{W}_{i})\}^{2}-\widehat{\sigma}_{s}^{2},$
	$\displaystyle\widehat{\varphi}_{\nu,i}(\boldsymbol{\delta})$	$\displaystyle:=\widehat{\alpha}_{s,\theta,\boldsymbol{\delta}}(\mathbf{X}_{i},\mathbf{W}_{i})^{2}-\widehat{\nu}_{s,\theta}^{2}(\boldsymbol{\delta}),$

and write

\lambda(\boldsymbol{\delta}):=\,\sqrt{\eta_{Y}^{2}}\,\sqrt{\frac{\eta_{\alpha}^{2}(\boldsymbol{\delta})}{1-\eta_{\alpha}^{2}(\boldsymbol{\delta})}}.

The delta method gives

	$\displaystyle\widehat{\varphi}_{S,i}(\boldsymbol{\delta})$	$\displaystyle:=\frac{\widehat{\nu}_{s,\theta}^{2}(\boldsymbol{\delta})\,\widehat{\varphi}_{\sigma,i}+\widehat{\sigma}_{s}^{2}\,\widehat{\varphi}_{\nu,i}(\boldsymbol{\delta})}{2\,\widehat{S}(\boldsymbol{\delta})},$
	$\displaystyle\widehat{\varphi}_{-,i}(\boldsymbol{\delta})$	$\displaystyle:=\widehat{\varphi}_{\theta(\boldsymbol{\delta})}(\mathbf{Z}_{i})-\lambda(\boldsymbol{\delta})\widehat{\varphi}_{S,i}(\boldsymbol{\delta}),$
	$\displaystyle\widehat{\varphi}_{+,i}(\boldsymbol{\delta})$	$\displaystyle:=\widehat{\varphi}_{\theta(\boldsymbol{\delta})}(\mathbf{Z}_{i})+\lambda(\boldsymbol{\delta})\widehat{\varphi}_{S,i}(\boldsymbol{\delta}).$

Writing

	$\displaystyle\widehat{\theta}_{-}(\boldsymbol{\delta})$	$\displaystyle:=\widehat{\theta}(\boldsymbol{\delta})-\widehat{B}(\boldsymbol{\delta};\eta_{Y}^{2},\eta_{\alpha}^{2}),$
	$\displaystyle\widehat{\theta}_{+}(\boldsymbol{\delta})$	$\displaystyle:=\widehat{\theta}(\boldsymbol{\delta})+\widehat{B}(\boldsymbol{\delta};\eta_{Y}^{2},\eta_{\alpha}^{2}),$

the corresponding standard errors are

	$\displaystyle\widehat{\mathrm{se}}_{-}(\boldsymbol{\delta})$	$\displaystyle:=\sqrt{\frac{1}{n^{2}}\sum_{i=1}^{n}\widehat{\varphi}_{-,i}(\boldsymbol{\delta})^{2}},$
	$\displaystyle\widehat{\mathrm{se}}_{+}(\boldsymbol{\delta})$	$\displaystyle:=\sqrt{\frac{1}{n^{2}}\sum_{i=1}^{n}\widehat{\varphi}_{+,i}(\boldsymbol{\delta})^{2}}.$

The sensitivity-adjusted $95\%$ interval is obtained by combining a lower confidence bound for $\widehat{\theta}_{-}(\boldsymbol{\delta})$ and an upper confidence bound for $\widehat{\theta}_{+}(\boldsymbol{\delta})$ :

\left[\widehat{\theta}_{-}(\boldsymbol{\delta})-z_{0.95}\widehat{\mathrm{se}}_{-}(\boldsymbol{\delta}),\;\widehat{\theta}_{+}(\boldsymbol{\delta})+z_{0.95}\widehat{\mathrm{se}}_{+}(\boldsymbol{\delta})\right].

Selection of sensitivity parameters.

The parameters $\eta_{Y}^{2}$ and $\eta_{\alpha}^{2}(\boldsymbol{\delta})$ have direct interpretations: $\eta_{Y}^{2}$ is the maximal fraction of residual outcome variance explainable by $U$ given $(\mathbf{X},\mathbf{W})$ , and $\eta_{\alpha}^{2}(\boldsymbol{\delta})$ is the maximal fraction of RR variation explainable by $U$ for the policy indexed by $\boldsymbol{\delta}$ . These can be calibrated by subject-matter knowledge and by benchmarking against observed covariates, and then used in the bound and the endpoint confidence bounds above.

E.7 Formal benchmarking and calibration

Following Chernozhukov et al. (2022), we calibrate the sensitivity parameters on the $f^{2}$ scale induced by nested linear projections. This construction translates the observed contribution of a single covariate $X_{j}$ into a benchmark that is commensurate with the omitted-variable calibration in the bias bound. We carry out this comparison for each of the 22 observed covariates.

Outcome-side benchmark.

For each observed covariate $X_{j}$ , let

	$\displaystyle\widehat{\sigma}_{s}^{2}$	$\displaystyle:=\min_{a,\boldsymbol{b},\boldsymbol{c}}\frac{1}{n}\sum_{i=1}^{n}\Big\{Y_{i}-a-\boldsymbol{b}^{\top}\mathbf{W}_{i}-\boldsymbol{c}^{\top}\mathbf{X}_{i}\Big\}^{2},$
	$\displaystyle\widehat{\sigma}_{s,-j}^{2}$	$\displaystyle:=\min_{a,\boldsymbol{b},\boldsymbol{c}}\frac{1}{n}\sum_{i=1}^{n}\Big\{Y_{i}-a-\boldsymbol{b}^{\top}\mathbf{W}_{i}-\boldsymbol{c}^{\top}\mathbf{X}_{i,-j}\Big\}^{2},$

where $\mathbf{X}_{i,-j}$ denotes the observed covariate vector with $X_{ij}$ removed. The associated benchmark statistics are

	$\displaystyle\widehat{\eta}_{Y,j}^{2}$	$\displaystyle:=\frac{\widehat{\sigma}_{s,-j}^{2}-\widehat{\sigma}_{s}^{2}}{\widehat{\sigma}_{s,-j}^{2}},$
	$\displaystyle\widehat{f}_{Y,j}^{2}$	$\displaystyle:=\frac{\widehat{\eta}_{Y,j}^{2}}{1-\widehat{\eta}_{Y,j}^{2}}=\frac{\widehat{\sigma}_{s,-j}^{2}-\widehat{\sigma}_{s}^{2}}{\widehat{\sigma}_{s}^{2}}.$

Thus $\widehat{\eta}_{Y,j}^{2}$ is the observed partial $R^{2}$ for $X_{j}$ after adjusting for $(\boldsymbol{W},\boldsymbol{X}_{-j})$ , and $\widehat{f}_{Y,j}^{2}$ expresses the same gain relative to the residual variation in the full projection. Table 3 reports these quantities for all 22 covariates. The largest outcome-side benchmark is White, with $\widehat{\eta}_{Y,j}^{2}=0.0642$ and $\widehat{f}_{Y,j}^{2}=0.0686$ .

RR-side benchmark.

For the RR-side, we first evaluate the fitted short RR $\widehat{\alpha}_{s,\boldsymbol{\delta}}(\mathbf{X}_{i},\mathbf{W}_{i})$ at every intervention target retained in the application analysis. For each $X_{j}$ and each target $\boldsymbol{\delta}$ , we then compute the nested linear projections

	$\displaystyle\widehat{r}_{\alpha,\mathrm{full},j}^{2}(\boldsymbol{\delta})$	$\displaystyle:=\min_{a,\boldsymbol{b}}\frac{1}{n}\sum_{i=1}^{n}\Big\{\widehat{\alpha}_{s,\boldsymbol{\delta}}(\mathbf{X}_{i},\mathbf{W}_{i})-a-\boldsymbol{b}^{\top}\mathbf{X}_{i}\Big\}^{2},$
	$\displaystyle\widehat{r}_{\alpha,\mathrm{red},j}^{2}(\boldsymbol{\delta})$	$\displaystyle:=\min_{a,\boldsymbol{b}}\frac{1}{n}\sum_{i=1}^{n}\Big\{\widehat{\alpha}_{s,\boldsymbol{\delta}}(\mathbf{X}_{i},\mathbf{W}_{i})-a-\boldsymbol{b}^{\top}\mathbf{X}_{i,-j}\Big\}^{2}.$

The corresponding pointwise benchmark statistics are

	$\displaystyle\widehat{\eta}_{\alpha,j}^{2}(\boldsymbol{\delta})$	$\displaystyle:=\frac{\widehat{r}_{\alpha,\mathrm{red},j}^{2}(\boldsymbol{\delta})-\widehat{r}_{\alpha,\mathrm{full},j}^{2}(\boldsymbol{\delta})}{\widehat{r}_{\alpha,\mathrm{red},j}^{2}(\boldsymbol{\delta})},$
	$\displaystyle\widehat{f}_{\alpha,j}^{2}(\boldsymbol{\delta})$	$\displaystyle:=\frac{\widehat{\eta}_{\alpha,j}^{2}(\boldsymbol{\delta})}{1-\widehat{\eta}_{\alpha,j}^{2}(\boldsymbol{\delta})}=\frac{\widehat{r}_{\alpha,\mathrm{red},j}^{2}(\boldsymbol{\delta})-\widehat{r}_{\alpha,\mathrm{full},j}^{2}(\boldsymbol{\delta})}{\widehat{r}_{\alpha,\mathrm{full},j}^{2}(\boldsymbol{\delta})}.$

This is the direct analogue of the outcome-side construction, with the fitted short RR taking the place of the observed outcome.

The benchmark reported in the main text is attached to an intervention path rather than to a single target. For each scenario, we therefore rank the 22 covariates by the average of $\widehat{f}_{\alpha,j}^{2}(\boldsymbol{\delta})$ over the positive Gelbrich targets on that path. This average is the RR-side score used in the formal benchmark. It summarizes how much adding $X_{j}$ improves the linear approximation to the fitted short RR along the displayed path, and it keeps the benchmark tied to an intervention curve that is actually reported in the application. Table 4 reports these scenario-level mean $\widehat{f}_{\alpha,j}^{2}$ values for all 22 covariates.

With this aggregation, the selected RR-side benchmark covariates are Housing More People Units for BC, OM, and BC+OM; Cigarette Smoking for NO₃, NH₄, NO₃+SO₄+NH₄, and All; Households Smartphone for SO₄; and Housing Vacant for BFGS.

Benchmark calibration.

For the selected outcome-side covariate,

\eta_{Y}^{2}=\frac{k_{Y}\,\widehat{f}_{Y,j}^{2}}{1+k_{Y}\,\widehat{f}_{Y,j}^{2}}.

For the selected RR-side covariate in a given scenario,

\eta_{\alpha}^{2}(\boldsymbol{\delta})=\frac{k_{D}\,\widehat{f}_{\alpha,j}^{2}(\boldsymbol{\delta})}{1+k_{D}\,\widehat{f}_{\alpha,j}^{2}(\boldsymbol{\delta})}.

These calibrated values are inserted directly into the point bounds and the endpoint confidence bounds above. The main application figure uses $k_{Y}=1$ and $k_{D}=1$ , so the omitted confounder is calibrated to the observed benchmark strength on both the outcome side and the RR side.

Table 3: Outcome-side benchmarking statistics for the 22 observed covariates. The table reports the observed partial

R^{2}

and its

f^{2}

transform from the nested linear projections of

Y

(\boldsymbol{W},\boldsymbol{X})

and

(\boldsymbol{W},\boldsymbol{X}_{-j})

Covariate	$\widehat{\eta}_{Y,j}^{2}$	$\widehat{f}_{Y,j}^{2}$
White	0.0642	0.0686
Poverty	0.0341	0.0353
Physical Activity	0.0322	0.0333
Housing More People Units	0.0151	0.0154
Binge Drinking	0.0132	0.0134
Households No Internet	0.0123	0.0124
Housing No Vehicle	0.0112	0.0113
Housing 10 Units	0.0087	0.0087
Median Income	0.0082	0.0083
Cigarette Smoking	0.0052	0.0052
Low Education Computer No Internet	0.0042	0.0042
Percentage No Insurance	0.0037	0.0037
Male	0.0032	0.0032
Households Smartphone	0.0023	0.0023
Housing Renter	0.0011	0.0011
Housing Vacant	0.0010	0.0010
HS Higher	0.0006	0.0006
Housing Mobile	0.0001	0.0001
Obesity	0.0001	0.0001
Unemployed	0.0001	0.0001
Households Low Income No Internet	0.0001	0.0001
Households Only Smartphone	0.0000	0.0000

Table 4: RR-side benchmarking statistics for the 22 observed covariates. Entries are the scenario-level means of

\widehat{f}_{\alpha,j}^{2}(\boldsymbol{\delta})

over the positive Gelbrich grid within each scenario, computed from nested linear projections of

\widehat{\alpha}_{s,\boldsymbol{\delta}}(\mathbf{X}_{i},\mathbf{W}_{i})

\boldsymbol{X}

and

\boldsymbol{X}_{-j}

. The main-text benchmark with

k_{D}=1

selects the largest entry within each scenario.

Covariate	BC	NO₃	OM	SO₄	NH₄	BC+OM	NO₃+SO₄+NH₄	All	BFGS
White	0.0003	0.0049	0.0029	0.0017	0.0005	0.0002	0.0012	0.0008	0.0024
Poverty	0.0011	0.0171	0.0005	0.0015	0.0081	0.0012	0.0023	0.0026	0.0089
Physical Activity	0.0014	0.0018	0.0011	0.0003	0.0007	0.0014	0.0014	0.0001	0.0002
Housing More People Units	0.0089	0.0017	0.0124	0.0026	0.0007	0.0043	0.0007	0.0003	0.0043
Binge Drinking	0.0003	0.0002	0.0027	0.0002	0.0005	0.0004	0.0017	0.0002	0.0021
Households No Internet	0.0003	0.0008	0.0003	0.0002	0.0009	0.0017	0.0003	0.0000	0.0004
Housing No Vehicle	0.0021	0.0006	0.0054	0.0001	0.0006	0.0027	0.0125	0.0022	0.0006
Housing 10 Units	0.0004	0.0004	0.0023	0.0002	0.0001	0.0003	0.0010	0.0003	0.0003
Median Income	0.0018	0.0210	0.0005	0.0016	0.0058	0.0020	0.0025	0.0028	0.0071
Cigarette Smoking	0.0005	0.0259	0.0015	0.0022	0.0167	0.0014	0.0174	0.0079	0.0036
Low Education Computer No Internet	0.0001	0.0001	0.0003	0.0004	0.0008	0.0005	0.0009	0.0009	0.0038
Percentage No Insurance	0.0001	0.0015	0.0004	0.0000	0.0016	0.0002	0.0003	0.0002	0.0015
Male	0.0000	0.0008	0.0020	0.0008	0.0007	0.0001	0.0002	0.0006	0.0017
Households Smartphone	0.0009	0.0042	0.0004	0.0037	0.0029	0.0004	0.0009	0.0015	0.0006
Housing Renter	0.0026	0.0004	0.0001	0.0001	0.0001	0.0013	0.0008	0.0000	0.0002
Housing Vacant	0.0007	0.0224	0.0006	0.0010	0.0035	0.0011	0.0157	0.0064	0.0138
HS Higher	0.0003	0.0005	0.0007	0.0006	0.0000	0.0004	0.0004	0.0005	0.0025
Housing Mobile	0.0008	0.0060	0.0000	0.0006	0.0002	0.0015	0.0145	0.0072	0.0031
Obesity	0.0003	0.0001	0.0012	0.0000	0.0011	0.0009	0.0001	0.0003	0.0006
Unemployed	0.0002	0.0026	0.0001	0.0005	0.0003	0.0000	0.0008	0.0006	0.0029
Households Low Income No Internet	0.0002	0.0010	0.0000	0.0013	0.0002	0.0000	0.0016	0.0009	0.0005
Households Only Smartphone	0.0002	0.0004	0.0003	0.0009	0.0003	0.0001	0.0003	0.0002	0.0016

	$\displaystyle\Big\\|\frac{g_{\boldsymbol{\delta}}}{f}-1-\boldsymbol{\delta}^{\top}\tilde{\boldsymbol{s}}\Big\\|_{L_{2}(P_{0})}$	$\displaystyle\leq C_{g}\,\\|\boldsymbol{\delta}\\|^{2},$
	$\displaystyle\big\\|\mathbb{E}_{g_{\boldsymbol{\delta}}}[\mu\mid\boldsymbol{X}]-\mathbb{E}_{f}[\mu\mid\boldsymbol{X}]-\boldsymbol{\delta}^{\top}\mathbb{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\big\\|_{L_{2}(P_{0})}$	$\displaystyle\leq C_{\mu}\,\\|\boldsymbol{\delta}\\|^{2},$

	$\displaystyle\big\\|r_{2}\,(\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}])\big\\|_{L_{2}}$	$\displaystyle\leq\\|r_{2}\\|_{L_{2}}\,\\|\boldsymbol{\delta}^{\top}\operatorname{E}_{f}[\mu\tilde{\boldsymbol{s}}\mid\boldsymbol{X}]\\|_{\infty}$
		$\displaystyle\leq C_{g}\\|\boldsymbol{\delta}\\|^{2}\cdot(2MM_{s}\\|\boldsymbol{\delta}\\|)$
		$\displaystyle\leq 2C_{\max}\,M\,C_{g}\,\\|\boldsymbol{\delta}\\|^{2},$

	$\displaystyle\big\|\mathbb{E}[\phi\,R_{\varphi}(\cdot;\boldsymbol{\delta})]\big\|$	$\displaystyle\leq\\|R_{\varphi}(\cdot;\boldsymbol{\delta})\\|_{L_{2}}\leq C_{\varphi}\,\\|\boldsymbol{\delta}\\|^{2},$
	$\displaystyle\|r_{\mathrm{A5}}(\varepsilon)\|$	$\displaystyle\leq 2K\,\varepsilon^{2},$

	$\displaystyle D_{\mathrm{KL}}(P_{1}\\|P_{0})$	$\displaystyle\leq\varepsilon^{2}=\frac{\alpha_{0}^{2}}{n},$
	$\displaystyle D_{\mathrm{KL}}(P_{1}^{\otimes n}\\|P_{0}^{\otimes n})$	$\displaystyle=nD_{\mathrm{KL}}(P_{1}\\|P_{0})\leq\alpha_{0}^{2}.$

	$\displaystyle\\|\widehat{r}_{\boldsymbol{\delta}}-r_{\boldsymbol{\delta}}\\|_{2}$	$\displaystyle\leq C_{1}(\Delta)\,\\|\widehat{f}-f\\|_{2},$		(10)
	$\displaystyle\\|\widehat{m}_{\boldsymbol{\delta}}-m_{\boldsymbol{\delta}}\\|_{2}$	$\displaystyle\leq C_{2}(\Delta)\,\big(\\|\widehat{\mu}-\mu\\|_{2}+\\|\widehat{f}-f\\|_{2}\big),$		(11)

Multivariate incremental effects for continuous treatments: Studying the health effects of environmental mixtures

Abstract

1 Introduction

2 Exposure Shifts under Exponential Tilts

2.1 Notation and Potential Outcomes under SUTVA

2.2 Estimands using Exponential Tilts

2.3 Different Exposure Shifts and Efficiency

Proposition 1.

2.4 Fair Exposure Shifts under Fixed Gelbrich Distance

2.5 Choice of Estimand

2.5.1 Optimal Shifts

2.5.2 Single Exposure Shifts

2.5.3 Efficient Exposure Shifts

3 Estimation and Inference

3.1 Estimating ψ​(𝜹)\psi(\boldsymbol{\delta})

3.1.1 One-step Estimation and Cross-fitting

3.1.2 Direct estimation using regression approaches

3.2 Optimizing over a Manifold

4 Semiparametric Efficiency Theory

4.1 Minimax Lower Bound

Lemma 1 (Variance of the efficient influence function).

Theorem 2 (Minimax lower bound).

4.2 Convergence and Normality

Theorem 3 (Finite-𝜹\boldsymbol{\delta} CLT).

5 Sensitivity analysis to unmeasured confounding

5.1 The Bias Representation

5.2 Sensitivity Parameters

5.3 Sensitivity parameters and confidence bounds for θ​(𝜹)\theta(\boldsymbol{\delta})

6 Simulation studies

6.1 Simulation design

6.2 Implemented estimators and method comparison

6.3 Simulation Results

7 Application: Assessing Health Impacts of PM2.5 Component Mixtures

8 Discussion

References

Appendix A Neyman-orthogonality and robustness to misspecified outcome model

Setup.

(a) Orthogonality with respect to μ\mu.

(b) Sensitivity in rr.

Robustness to misspecified outcome model.

Proof.

Appendix B Validity for Riemannian BFGS on Gelbrich Constraint for exponential tilting

Geometry of the Gelbrich constraint.

Lemma 2 (Exact gradient and generic regularity of Gelbrich level sets).

Proof.

Projection retraction and scaled vector transport.

Lemma 3 (Validity of transport).

Proof.

Objective regularity for the global target.

Lemma 4 (Conditional regularity of the local tilted mean).

Proof.

Corollary 1 (Objective regularity and line search on Gelbrich level sets).

Proof.

Lemma 5 (Feasible initialization and containment).

Proof.

Stationarity conclusion.

Appendix C Detailed Proof for Minimax Lower Bound

Standing Assumptions and Notation

C.1 Uniform second-order bounds

Lemma 6 (Uniform L2L_{2} bounds with explicit constants).

Proof.

C.2 First-order expansion of the EIF and its covariance

Lemma 7 (EIF expansion from the three-component decomposition).

Proof.

Corollary 2.

Proof.

Lemma 8 (Hardest direction after orthogonalization).

Proof.

C.3 Minimax Lower Bound

Theorem 4 (Minimax lower bound).

Proof of Theorem 4.

Linear Incremental Effect

Appendix D On convergence and normality

Lemma 9 (Reduction to outcome regression and exposure density).

Proof.

Corollary 3 (Rate condition in terms of (μ,f)(\mu,f)).

Lemma 10 (Second-order remainder).

Proof.

Theorem 5 (Asymptotic normality at n\sqrt{n} rate).

Proof.

3.1 Estimating $\psi(\boldsymbol{\delta})$

Theorem 3 (Finite- $\boldsymbol{\delta}$ CLT).

5.3 Sensitivity parameters and confidence bounds for $\theta(\boldsymbol{\delta})$

7 Application: Assessing Health Impacts of PM_2.5 Component Mixtures

(a) Orthogonality with respect to $\mu$ .

(b) Sensitivity in $r$ .

Lemma 6 (Uniform $L_{2}$ bounds with explicit constants).

Corollary 3 (Rate condition in terms of $(\mu,f)$ ).

Theorem 5 (Asymptotic normality at $\sqrt{n}$ rate).

Identified (short) estimand under ignorability given $\mathbf{X}$ .

E.4 Reparameterization by interpretable partial $R^{2}$

Outcome-side sensitivity (partial $R^{2}$ ).

Final bound in $R^{2}$ form.