Multivariate incremental effects for continuous treatments: Studying the health effects of environmental mixtures
Abstract
Evaluating the causal health effects of multivariate, continuous exposures, such as air pollution mixtures, is a critical public health challenge. A primary obstacle is the frequent violation of the positivity assumption, which renders the effects of standard deterministic interventions unidentified or heavily reliant on unreliable model extrapolation. In this paper, we develop a novel causal inference framework to address this challenge. We extend exponential tilting to multivariate exposures and address the critical question of how to compare different intervention directions fairly. This establishes a systematic framework for defining and evaluating various policy-relevant causal estimands, allowing researchers to address diverse scientific questions. We develop numerous methodological advancements, including efficient one-step estimation strategies, a Riemannian BFGS algorithm to solve a constrained manifold optimization problem, semiparametric efficiency bounds for causal estimands, minimax rates for estimators, and asymptotic normality results. We demonstrate our framework’s utility by applying it to a nationwide environmental health dataset to identify the optimal strategy for reducing adverse health outcomes associated with a PM2.5 chemical mixture.
1 Introduction
Understanding the causal effects of complex air exposure mixtures is crucial for effective public health policy, but traditional statistical methods face significant challenges. A key challenge is that individuals are exposed to a complex mixture of correlated substances. Single-pollutant analyses often fall short, as they fail to account for the confounding effect of pollutants and cannot capture potential interaction effects, leading to misleading policy recommendations (Bobb et al., 2015; Antonelli and Zigler, 2024). This necessitates a causal inference framework for multivariate continuous exposures. A primary obstacle in evaluating the causal effects of multivariate continuous exposures is the positivity assumption. This assumption requires that for any given set of covariates, the exposure levels of interest have nonzero density. In environmental mixtures, this assumption is frequently violated; certain combinations of pollutants may be physically implausible or absent from observational data (Peters et al., 2012). Consequently, estimating the effect of deterministic interventions, such as setting a pollutant mixture to a specific counterfactual level, becomes impossible without resorting to unreliable extrapolation (Antonelli and Zigler, 2024; Rudolph et al., 2025). In the presence of positivity violations, there are (at least) two potential solutions. The less frequently used approach is to modify the estimation method to be robust to extrapolation, such as “extrapolation-aware” inferential methods that explicitly bound or characterize the uncertainty when moving outside the data support (Pfister and Bühlmann, 2021). The more common approach, and the one we focus on throughout, is to modify the scientific question by redefining the estimand.
One useful strategy for modifying the estimand with continuous treatments involves identifying and estimating the derivative of the exposure-response function, then integrating this derivative to recover the full curve. By focusing on local effects first and developing novel bias-corrected and Neyman Orthogonal estimators for the derivative function, this “differentiate-then-integrate” approach can circumvent the need for a global positivity assumption and has been a subject of significant recent research (Zhang et al., 2023; Kallus and Oprescu, 2022; Rothenhäusler and Yu, 2021). Another alternative is to leverage instrumental variables, where recent advances have extended these methods to continuous treatments without relying on positivity to identify local effects (Dorn and Guo, 2024; Bruns and Kallus, 2024).
While these methods are powerful, a distinct set of approaches that are flexible and policy-relevant rely on the use of stochastic interventions. These interventions generally define a policy that modifies the observed distribution of exposures rather than replacing it with a fixed value. Motivated by defining more realistic policies, early applications have been seen in epidemiology. For example, considering interventions that would truncate the exposure distribution, such as enforcing pollution levels below a certain cutoff (Taubman et al., 2009). This idea was later formalized in the statistical literature to provide a general framework for evaluating the population causal effect of shifting an entire exposure distribution (Díaz and van der Laan, 2012). Over the past decade, several distinct stochastic interventions have been developed. One approach is the shift intervention, which defines the post-intervention exposure as the natural exposure value shifted by a certain amount (Haneuse and Rotnitzky, 2013). A second, influential family of interventions is based on the natural value of treatment. These policies, often called Modified Treatment Policies (MTPs), define the counterfactual exposure as a function of the exposure that would have naturally occurred. This concept was pioneered in the context of dynamic treatment regimes (Robins et al., 2004) and later formalized for single time points (Young et al., 2011; Richardson and Robins, 2013). The framework has since been extended to handle longitudinal scenarios with continuous treatments (Díaz et al., 2023). There is ongoing research exploring the properties of generalized policies that depend on an individual’s natural value of treatment, for example using optimal transport to derive tighter bounds under unmeasured confounding (Kallus and Mbougwe, 2024). Generally speaking, this class of estimands relies on weaker positivity conditions than most deterministic estimands, however, they still rely on positivity holding for certain exposure values.
When positivity violations are a big concern and are likely to occur, a potentially more robust alternative that inherently respects the data’s support and thereby does not rely on positivity is the incremental intervention. This approach was initially developed for binary and longitudinal treatments through incremental shifts in propensity scores, and defines a new interventional distribution by tilting the exposure distribution with an exponential function in a specified direction (Kennedy, 2019). This concept has subsequently been generalized to univariate continuous treatments as well (Díaz and Hejazi, 2020; Schindl et al., 2024). Some recent work has criticized exponential tilted estimands such as these (Schindl and Wasserman, 2025) due to their less intuitive parameterization, asymmetric reallocation of probability mass, and for possessing less favorable asymptotic properties compared to certain alternatives. Regardless of their form, stochastic interventions, typically present their own set of challenges. One key concern is that estimands identified under positivity violations may correspond to interventions that are not directly implementable, revealing a fundamental “interpretability-implementability tradeoff” (Rudolph et al., 2024). Furthermore, when comparing the effects of different stochastic interventions, one must ensure the comparisons are fair, a concept that related work in different contexts has begun to formalize to prevent misleading conclusions (McClean et al., 2024).
Nearly all work to date has focused on univariate treatments, which is insufficient for addressing problems raised in the analysis of environmental mixtures. In this work, we build on the existing literature by using exponentially tilted estimands within the multivariate treatment context. This extension to the multivariate setting is non-trivial and introduces several new, complex questions. First, the intervention parameter becomes a vector, creating an infinite space of possible intervention directions. This raises a variety of policy questions about how to shift all exposures at once and which direction is best. Second, a fair basis for comparing these different directional shifts is needed; interventions must be constrained to be of the same size for meaningful comparisons, where the definition of fair is not unique. We propose solutions to each of these issues and explore a variety of estimands within this framework that target different policy-relevant questions of interest in environmental epidemiology. We show that these different estimands vary in terms of how efficiently they can be estimated from the data, and we provide algorithms for finding optimal policies in terms of shifts in exposures that lead to the biggest reduction in adverse health outcomes. We provide theoretical support in terms of minimax rates that show how well the estimands can be estimated and how the difficulty of estimation intrinsically depends on the conditional covariance matrix of the exposures. We also develop efficient influence function based estimators that can achieve root- convergence and asymptotic normality using complex, machine learning estimators for nuisance function estimation.
2 Exposure Shifts under Exponential Tilts
2.1 Notation and Potential Outcomes under SUTVA
Let our observed data consist of independent and identically distributed samples , drawn from some underlying distribution . Each observation is composed of a -dimensional vector of covariates , a -dimensional vector representing the environmental exposures or treatments111Note that we use the terms treatment and exposure interchangeably throughout the manuscript. Also note that environmental mixture simply refers to a vector of environmental exposures. , and a scalar outcome of interest . We denote the conditional density of the exposure mixture given the covariates as .
To formally define causal effects, we operate within the potential outcomes framework. For each individual and any potential exposure vector , we let denote the potential outcome that would have been observed had individual received exposure level . This notation relies on the Stable Unit Treatment Value Assumption (SUTVA, (Rubin, 1980)), which comprises two key principles:
-
1.
No Interference: The potential outcome for one individual is unaffected by the exposure assignments of other individuals. That is, depends only on the exposure assigned to individual .
-
2.
No multiple versions of treatment: Treatment is well-defined in the sense that there are not two distinct treatments that lead to the same value of .
Importantly, this assumption ensures that an individual’s observed outcome corresponds to their potential outcome under their observed exposure. Formally, if an individual is observed to have exposure , their observed outcome is .
2.2 Estimands using Exponential Tilts
In this work, we extend the framework of exponential tilting incremental causal effects to the multivariate treatment setting. This type of stochastic treatment was first proposed for binary treatments (Kennedy, 2019) and later generalized to single continuous treatments (Díaz and Hejazi, 2020; Schindl et al., 2024). We adapt this formulation to define interventions on the entire -dimensional exposure vector, . Given the conditional density of the exposure mixture, , we define an exponentially tilted interventional density, , indexed by a user-specified vector :
| (1) |
Here, the denominator is a normalizing constant that ensures integrates to one, and is a vector determining both the direction and magnitude of how the natural density is shifted.
Our causal estimand of interest, the incremental effect, is the expected potential outcome under the stochastic intervention defined by the tilted density . We denote this estimand as :
This represents the population average outcome if, for covariates , each individual’s exposure were a random draw from the shifted distribution . In order to identify this quantity from the observed data, we must make a standard no unmeasured confounding assumption: for all ,
We develop sensitivity analysis approaches to assess violations of this assumption in Section 5. Under SUTVA and the no unmeasured confounding assumption, this causal quantity is identified from the observed data distribution via:
| (2) |
where denotes the marginal probability measure of the covariates . Note that because the support of is identical to the support of , we do not have to invoke any positivity assumptions and the intervention does not require us to estimate outcomes for exposure combinations that are never observed in the data.
The parameter has a similar interpretation as the constant gradient of the log-likelihood ratio between the interventional and observational densities as in (Schindl et al., 2024):
This means that each component, , quantifies the change in this log density ratio for an infinitesimal increase in the -th exposure component, . Intuitively, setting defines an intervention that tilts the distribution to favor higher values of the first exposure, .
2.3 Different Exposure Shifts and Efficiency
Let . Following the same derivation as (Schindl et al., 2024) but generalized to the multivariate setting, we get the following formula for the efficient influence function.
Proposition 1.
The efficient influence function of under a nonparametric model is given by for
Note the subscript denotes expectations with respect to the tilted exposure density. The asymptotic variance of any regular and asymptotically linear (RAL) estimator for is equal to the variance of its EIF, . A critical observation from the structure of is the repeated appearance of the density ratio, . When this ratio exhibits high variability, it inflates the variance of the first two components of the EIF, leading to less precise estimates of the causal effect. Therefore, to improve statistical efficiency for a given intervention strength, we could select an intervention direction that minimizes the variance of this density ratio, . The third term, on the other hand, represents the deviation between the average post-intervention effect for a specific covariate group and the overall average post-intervention effect. It quantifies how much higher or lower the expected intervention effect for an individual with covariates is compared to the overall population average.
To gain analytical insight into this variance, we can let the exposures follow a multivariate normal distribution, . Under this assumption, we have two key results. First, the tilted distribution is also normal, with a shifted mean:
This provides a clear interpretation of the intervention: it is a shift of the mean of the exposure distribution in the direction . Second, the variance of the density ratio has a simple, closed-form expression:
Minimizing this variance is equivalent to minimizing the quadratic form .
The question of interest thus becomes an optimization problem: which direction minimizes subject to a constraint on the “strength” of the intervention? While various constraints are possible (e.g., fixing ), a particularly meaningful constraint is to fix the distance between the original and tilted distributions, for which the 2-Wasserstein distance provides a natural metric. We discuss the choice of this metric in the following section. For the multivariate normal case with the same covariance matrix, the squared Wasserstein distance has a simple closed-form expression as .
If our goal then is to find a shift that we can estimate with a high degree of efficiency, we can solve the following:
for some constant defining the size of intervention. It is straightforward to show that this expression is minimized when is chosen to be proportional to the eigenvector of corresponding to its largest eigenvalue, . Hence our finding demonstrates that for a fixed intervention strength (as measured by the Wasserstein distance), the most statistically efficient causal effect to estimate, at least in terms of this density ratio, is the one that corresponds to shifting the exposure mixture along its primary axis of variation. Note that this result only holds exactly under a multivariate normal distribution, but we have seen empirically that this choice of typically leads to efficient estimates of under a variety of exposure distributions.
2.4 Fair Exposure Shifts under Fixed Gelbrich Distance
A primary motivation for developing our framework is to answer policy-relevant questions, such as identifying the optimal way to modify an environmental exposure mixture to achieve the greatest public health benefit. This naturally leads to an optimization problem: finding the intervention vector that maximizes (or minimizes) the causal estimand . However, a naive comparison across all possible vectors is misleading. For any intervention direction that yields a beneficial effect, one could simply increase the intervention’s strength, for example, by scaling the magnitude of to produce an arbitrarily larger (or smaller) value of . This would invariably lead to trivial solutions that correspond to extreme, unrealistic shifts in the exposure distribution. A more relevant policy question is “for a fixed amount of interventional effort, what is the best direction to apply that effort?”. To formalize this, we must first establish a fair basis for comparison, constraining our search to a set of interventions that are of the same size, which captures the actual movement of the exposure values, not just the magnitude of the parameter . Note that the notion of fairness here is with respect to the size of the intervention being applied, which differs from typical notions of fairness in causal inference or related fields, such as in recent work that defines a fairness criterion based on whether an estimand preserves the ordinal ranking of effects across all covariate subgroups.(McClean et al., 2024)
In order to establish our notion of fairness between interventions, we can use the 2-Wasserstein distance, , which is widely studied and serves this purpose (Panaretos and Zemel, 2019). Intuitively, the Wasserstein distance measures the minimum cost of transporting the probability mass of one distribution to match another, akin to the cost of moving a pile of dirt. By fixing the Wasserstein distance, we ensure that the total amount of shift between the old and new distributions is fixed. The optimization problem thus becomes a meaningful search for the that minimizes among all possible intervention directions of a comparable magnitude. This distributional fairness notion is widely used in both the operations and statistics research literatures. (Mohajerin Esfahani and Kuhn, 2018; Blanchet and Murthy, 2019; Duchi and Namkoong, 2021; Gao and Kleywegt, 2023)
For analytical tractability, we rely on a well-established result providing a formula for the squared 2-Wasserstein distance based on the means () and covariance matrices () of two distributions , which is referred to as the Gelbrich formula (Gelbrich, 1990). Crucially, this formula serves as a general lower bound for the true squared 2-Wasserstein distance between any two probability measures on with finite second moments. Specifically, we have that
Furthermore, this bound becomes an exact equality for any two distributions belonging to the same family of elliptically contoured distributions, a class which notably includes all multivariate normal distributions as well as uniform distributions on ellipsoids. The tractability of the Gelbrich formula has invited various researchers to use it as a surrogate for the 2-Wasserstein distance, a method frequently employed to overcome the computational complexity associated with the Wasserstein metric (Kuhn et al., 2019; Hakobyan and Yang, 2024), since the empirical 2-Wasserstein distance lacks a closed-form formula from data and can only be approached by numerical methods (Panaretos and Zemel, 2020). Moreover, the lower bound it provides has been shown to be tight in a fairly general situation against an upper bound derived for the 2-Wasserstein distance (Biswas and Mackey, 2024; Papp and Sherlock, 2024), and is therefore generally close to the 2-Wasserstein distance (Nguyen et al., 2021; Ye et al., 2024). Therefore, we believe our use of the Gelbrich formula as an approximation formula for the true 2-Wasserstein distance is reasonable.
This provides a computationally tractable and well-justified measure of intervention size. Accordingly, we measure intervention size through the Gelbrich formula applied to the marginal baseline and tilted laws of , and write the resulting quantity as . Comparing interventions on the level set puts them on a common scale. Under a multivariate normal model this coincides with the squared 2-Wasserstein distance, while in more general settings it remains a rigorous lower bound. The optimization problem thus becomes a search for the that minimizes among all possible intervention directions of a comparable size, allowing us to disentangle the direction of an intervention from its size and enabling a principled exploration of which changes to an environmental exposure mixture are most beneficial or harmful.
2.5 Choice of Estimand
Having established a principled method for comparing interventions of equivalent magnitude, we are now positioned to define our estimands of interest. The framework of incremental effects, when extended to a multivariate setting, moves beyond simply estimating the effect of a single, pre-specified shift or simply examining how the causal effect depends on shift size. Instead, it allows us to explore the entire space of potential interventions to identify those that are most impactful.
2.5.1 Optimal Shifts
In environmental contexts, a key objective is to determine the most effective strategy for intervention given limited resources. For instance, how should regulators modify a complex mixture of air pollutants to achieve the greatest improvement in health outcomes? Our framework directly addresses this question by defining an estimand for the optimal policy shift. We define our primary estimand of interest, , as the intervention direction that minimizes the causal effect over the set of all “fair shifts” of a given size :
where
The corresponding value of this optimal policy is . To provide some intuition for the optimal , we can explore it analytically in a simplified setting. If we assume the outcome model is linear () and the exposure distribution is normal (), the causal effect is minimized when is proportional to . This result provides some intuition: the optimal direction is a balance between the direct effect of each pollutant on the outcome (the vector ) and the correlation structure of the mixture (represented by ). This highlights a key insight of our multivariate approach: the covariance of the exposures is critical for determining not only which shifts are easiest to estimate, but also which shifts are most impactful. While the optimal policy shift is one primary goal, our framework is flexible and allows for the definition of other estimands that can be useful in answering other scientific questions of interest in environmental epidemiology. We now detail these in the following sections.
2.5.2 Single Exposure Shifts
A common goal in the analysis of environmental mixtures is to identify the components of the mixture that are most harmful. This has led to a wide range of statistical approaches aimed at performing exposure selection (Bobb et al., 2015; Antonelli et al., 2020; Ferrari and Dunson, 2020; Wei et al., 2020; Samanta and Antonelli, 2022). These approaches have inherently disregarded the potential impacts of positivity violations when examining which exposures are most harmful, but we can adapt our approach here to target similar questions without the need for strong positivity assumptions. A natural choice is to let be a vector of zeros with a single non-zero component, e.g., . The magnitude would be chosen to satisfy the same fairness constraint, . While this estimand would encourage the component of the exposures to increase more than others, in the presence of correlated exposures, it will also shift the remaining exposures, and the extent to which this occurs is less clear. A distinct approach is to define a value of such that the means of the exposures in the tilted distribution are the same, except for the exposure, thereby isolating the impact of that single exposure. If the exposures follow a multivariate normal distribution, this is straightforward since the mean shift is given by . If we want to ensure that only the exposure’s mean is shifted, then we could set where and again is chosen to ensure a fair shift. Note that this analytical form only holds when the exposure follows a multivariate normal distribution, which won’t be true in general. We can instead use this value of as a starting point in a numerical algorithm that searches for the exponential tilt that only shifts the component.
Once these are obtained, we can calculate for all to infer which exposures have the biggest effect on the outcome. Further note that this approach can be naturally extended to explore interactions or combined effects. For example, by setting two components of to be non-zero, one could investigate whether simultaneously increasing two pollutants has a synergistic effect greater than the sum of their individual impacts. We point readers to recent work examining stochastic interventions to identify interactions, as similar ideas could be applied within our framework, though we focus here on single exposure effects for now (McCoy et al., 2023).
2.5.3 Efficient Exposure Shifts
While the previous two estimands are arguably the most scientifically and policy-relevant in most applications, there are other considerations at play, such as how efficiently we can estimate the chosen estimand. Environmental applications in particular are known to have relatively small effect sizes, and therefore efficiency can be particularly important when sample sizes are not exceedingly large. As we established in Section 2.3, the statistical difficulty of estimating is heavily driven by the variance of the density ratio, . For a fixed intervention size, as defined by the Wasserstein distance, we showed that the most efficient intervention direction, in terms of minimizing this variance, is proportional to the first eigenvector of the exposure covariance matrix, . We call this direction , and it is clear that this direction depends only on the correlation structure of the observed exposures. While this direction is only the most efficient one under normality of the exposures, we proceed with this choice in general, as we have found that it leads to efficient estimates in more general settings, and deriving the most efficient direction in general is a difficult task.
However, the optimal policy direction, , which maximizes the causal effect , depends on both the exposure distribution and the exposure-outcome relationship, . In simplified linear models, the direction of steepest ascent for the causal effect is proportional to . In general, there is no reason for the direction of maximal statistical efficiency (related to the eigenvectors of ) to be the same as the direction of the maximal causal effect (related to ). To see this, consider an intuitive example with two highly and positively correlated pollutants, where the first principal component is approximately in the direction, which corresponds to the direction that is “easiest” to estimate with the observed data. Now, suppose that only the second pollutant has a strong causal effect on the outcome (i.e., ). The optimal policy would primarily involve shifting the second pollutant. Our framework reveals an inherent tension: the policy we most want to evaluate (shifting the second pollutant alone) is statistically difficult because it moves in a direction against the data’s strong correlation structure, leading to a high-variance density ratio and thus high uncertainty in our estimate of its effect. In general, this presents a trade-off between interpretability and statistical efficiency, and users can decide based on features of their observed data which estimand to target.
3 Estimation and Inference
3.1 Estimating
For a fixed intervention vector , the target estimand is the expected outcome under the tilted exposure distribution:
where the outer expectation is over the marginal distribution of covariates . Therefore, a direct approach to estimating is through a plug-in procedure. This method involves replacing each component in the expression above with a corresponding empirical estimate, which leads to the plug-in estimator:
However, this estimator faces certain practical and theoretical challenges. One issue is that the estimator’s performance is highly dependent on an accurate estimate of the multivariate conditional density , which can be difficult to estimate for multivariate exposures, and will likely have slower convergence rates. Furthermore, this approach does not leverage the structure of the efficient influence function and, as a result, is generally not statistically efficient. These limitations motivate alternative estimators with superior statistical properties.
3.1.1 One-step Estimation and Cross-fitting
To overcome the limitations of the plug-in estimator, we employ an approach rooted in semiparametric efficiency theory. This method uses the efficient influence function (EIF) to construct an estimator that is consistent under weaker conditions and achieves the optimal asymptotic variance. The EIF for the estimand is given in Section 2 by . In our notation, substituting the EIF components and algebraically combining the first two terms yields
where and expectations with subscript are taken with respect to . The one-step estimation procedure utilizes this EIF to correct an initial parameter estimate by adding the empirical average of the EIF to the initial plug-in estimate, which serves as a bias-correction term:
This can be re-written to show that the one-step estimator takes the following form:
This estimator has a number of key features that we will describe in subsequent sections when we study the asymptotic properties of this estimator. To summarize, it allows the use of flexible machine learning methods for estimation of each of the nuisance functions, and it is asymptotically efficient given its construction based on the EIF. The practical performance of the estimator is highly dependent on an estimate of the conditional density of the exposures, given by , which can be challenging with a moderate number of exposures. This density shows up in the expectations with respect to , but also in the ratio term, denoted by . First, we detail how to estimate this quantity directly using estimates of and , though in Section 3.1.2 we propose an approach to directly estimating this quantity using regression techniques that may not have the same theoretical properties, but can have good finite-sample performance when there are a moderate number of exposures.
To ensure the desirable asymptotic properties of our final estimator for , the entire estimation procedure is embedded within a cross-fitting framework. Cross-fitting is a technique designed to eliminate a key source of bias that arises when the same data is used both to train nuisance parameters and to evaluate the final parameter of interest. This prevents overfitting inherent in data-adaptive methods (Zheng and van der Laan, 2011; Chernozhukov et al., 2018) and has been shown to improve performance for related EIF based estimators. The procedure is implemented as follows. The data is randomly partitioned into disjoint folds of approximately equal size. For each fold , we treat it as the evaluation set and use the remaining folds as the training set. On the training set, we fit our nuisance models, including the outcome regression and the exposure density . These models, trained on data not in fold , are then used to compute the components of the efficient influence function for every observation only within the evaluation fold . The final estimate, , is then constructed by solving the estimating equation aggregated across all folds, ensuring that the nuisance function estimates for any given observation are always independent of that observation itself. This approach to nuisance function estimation has theoretical advantages by allowing for less restrictive assumptions on the nuisance functions, and important finite-sample properties as it tends to reduce bias in the estimated causal effects that is induced from overfitting.
3.1.2 Direct estimation using regression approaches
When there are a moderate number of exposures, estimation of becomes increasingly challenging, even for the efficient one-step estimators described above due to the inherent difficulty of estimating a multivariate, conditional distribution . For this reason we also explore an approach, first described in Schindl et al. (2024), that does not require estimation of the exposure density at all. This can be advantageous in finite samples, particularly when estimation of the density ratio is unstable. The first key insight is that the density ratio can be written as
which shows the density ratio can be estimated by estimating the conditional expectation . This does not require the conditional density of the exposures and can be carried out using flexible, univariate regression techniques. Further, the other key component of our one-step estimator can be written as
This shows that this quantity can be estimated by taking the ratio of two quantities, each of which can be estimated using flexible, univariate regression techniques. This was introduced in Schindl et al. (2024), though they did not implement this estimator as they found it to be unstable by taking the ratio of the estimates for these two conditional expectations. They were working, however, in the univariate exposure setting where estimating the conditional density of the exposures is more straightforward. In our setting, with multiple exposures, conditional density estimation can be very difficult, whereas this approach relies on univariate prediction models only. Additionally note, that one can take a third strategy, which is to still estimate the conditional density of the exposures and use it whenever calculating , but then use the regression approach described above for estimating the density ratio to improve stability of our estimates. While potentially useful for estimation, these estimators that obviate the need to estimate the exposure density will not inherit the same theoretical properties as the one-step estimator that directly uses estimates of and , which we study theoretically in Section 4. They may, however, produce better finite sample performance, which we study in the simulation studies in Section 6 across a wide range of exposure distributions.
3.2 Optimizing over a Manifold
Our primary estimand, the optimal policy shift , is defined as the solution to
where and is the Gelbrich quantity defined in Section 2.4. We assume is a regular value of (equivalently, for all ), so that is a smooth embedded hypersurface. This regularity holds generically and is verified for the Gelbrich constraint in Appendix B. Standard Euclidean optimization algorithms such as gradient descent are not directly applicable as a gradient step taken from a point on the manifold will likely lead to a point outside of the feasible set.
To properly solve this problem, we must recognize that the constraint set forms a smooth, curved space known as a Riemannian manifold. Optimization on manifolds requires specialized techniques that generalize concepts from Euclidean optimization. The core idea is to perform optimization steps within the tangent space at each point on the manifold, which is a local linear approximation of the manifold, and then map the result back onto the manifold itself. For our specific constrained problem, we employ a Riemannian Broyden Fletcher Goldfarb Shanno (BFGS) algorithm. This is a powerful quasi-Newton method adapted for optimization on manifolds, which generally offers faster convergence than simpler first-order methods. In practice, the method is implemented with two computationally light ingredients: a projection-based retraction back to the level set, and an inexpensive projection-based vector transport between tangent spaces. The key components of this algorithm for our problem are:
-
1.
Preliminary definitions: A definition of the tangent space is to take any small smooth curve on
Then
is a valid tangent vector. The tangent space at a point is the collection of all such vectors:
For a level set , we also have the equivalent characterization . The BFGS (Broyden–Fletcher–Goldfarb–Shanno) algorithm is a highly effective quasi-Newton method, which finds an optimum by maintaining and iteratively updating an approximation of the inverse Hessian matrix. In Euclidean space , the standard update formula is:
Where:
-
•
is the updated approximation of the inverse Hessian, which is the target of the update.
-
•
is the current approximation of the inverse Hessian.
-
•
is the change in position (step), defined as .
-
•
is the change in gradient, defined as
-
•
-
2.
Riemannian Gradient: The standard (Euclidean) gradient, , must be projected onto the tangent space at the current point to find the direction of steepest ascent along the manifold. Let denote the Euclidean normal to the level set at , and let be the corresponding unit normal vector. The orthogonal projection matrix onto the tangent space is defined as . Consequently, the Riemannian gradient under the induced Euclidean metric is given by
-
3.
Retraction: Even though we are moving based on a Riemannian gradient step that is in the current point’s tangent space (which is essentially a linear approximation of the manifold at the current point), we are still technically moving out of the manifold. After finding a search direction , we need a retraction to map the point from the tangent space back onto the manifold. For the level-set manifold , we take the projection retraction , where is the orthogonal projection onto , which is a canonical construction for embedded manifolds. In our code, is computed approximately by a single normal-direction correction, which can be interpreted as a projection-like update that preserves the local first-order accuracy required of a retraction while substantially reducing computational cost.
-
4.
Vector Transport: Vector transport is required to compare tangent vectors that live in different tangent spaces across iterates. For our embedded level-set manifold, we use the simple projection transport:
namely, orthogonally projecting an ambient vector onto the new tangent space. Such a transport is generally not an isometry, even though it is often computationally attractive. This transport is also used to move tangent-space quantities between iterates when forming the RBFGS update.
To summarize, the iterative process of the Riemannian BFGS algorithm is outlined as follows.
When the objective is to minimize over the Gelbrich level set, convergence guarantees for RBFGS depend on the line-search conditions, the regularity of the objective along retraction curves, and the choice of vector transport. Under the assumptions stated in Appendix B, the global convergence result in Theorem 4.2 of Huang et al. (2018) yields for the RBFGS scheme equipped with a suitable isometric vector transport and weak Wolfe line search. Appendix B verifies the geometric regularity of the Gelbrich level set, establishes objective regularity along the retraction curve, and invokes the existence of a weak Wolfe step. The proof therefore uses a fully rigorous projection retraction and the scaled isometric vector transport (Huang et al., 2015) together with an explicit boundary condition on that yields compactness of the relevant level set. Generally, however, these stricter choices are made for the convergence proof. In the actual implementation we use a computationally cheaper projection-type retraction and the orthogonal projection transport , together with safeguard checks on the RBFGS update. This separation between the proof device and the working algorithm is standard in large-scale Riemannian optimization (Ring and Wirth, 2012; Huang et al., 2018). In our numerical work, the simplified implementation converged stably to a local solution while substantially reducing computation time.
In embedded-manifold settings, the projection transport can significantly reduce computational time without degrading practical convergence behavior. This phenomenon is well documented in the RBFGS experiments of Qi et al. (2010), where projection-based transport on the Stiefel manifold reduces wall-clock time from seconds to seconds and also decreases the iteration count from to in the Procrustes problem. Moreover, Huang et al. (2018) compare combinations that satisfy sufficient conditions used in global convergence theory with cheaper combinations that do not, and find that the numbers of function and gradient evaluations (as well as vector transports) are not significantly affected by the choice of retraction and transport; as a result, lower-complexity retractions and transports can be markedly faster in terms of computational time. In our applications, the simplified implementation described above converges reliably and is computationally fast.
4 Semiparametric Efficiency Theory
This section explores the fundamental limits of estimation for the multivariate incremental effect, . The analysis characterizes how the statistical difficulty of this problem depends not only on the sample size , but on the intervention vector and its interplay with the exposure covariance structure . We generalize and further develop the theoretical analysis for univariate continuous exposures from (Schindl et al., 2024) to our multivariate setting.
4.1 Minimax Lower Bound
In this section, we establish a minimax lower bound for the incremental effect
under a flexible nonparametric model. Minimax lower bounds benchmark what is statistically achievable without imposing additional structure: they show that no estimator can attain uniformly smaller risk over a given model class. Before presenting the main results, we must first define a number of important terms. First, we refer to the centered version of the exposures as
To analyze the asymptotic variance of the effect difference , we decompose the problem into an incremental component and a baseline influence function . We define
Next, we calculate the covariance structure of . A key part of this variance arises from the outcome regression. To ensure this term represents a proper covariance (centered moment), we compute the raw second moment and subtract the outer product of the mean:
Then, letting be the residual variance component, we show in the appendix that we can write the full covariance matrix of as
Finally, estimating the incremental effect requires distinguishing the variation in the shift direction from the variation in the baseline. Since the gradient component and the baseline influence are typically correlated, the fundamental difficulty of estimating their difference is governed by the variability of that cannot be explained by . Mathematically, this corresponds to the residual variance of after linearly projecting it onto the space spanned by , yielding the Schur–complement covariance
The next result records how the efficiency bound depends on .
Lemma 1 (Variance of the efficient influence function).
Under (A1)–(A5) defined in the appendix and for all within a fixed small radius, there exist constants (depending only on the model constants and the chosen radius; explicit expressions are given in the Appendix) such that
Lemma 1 makes explicit that the nonparametric efficiency bound (the variance of the efficient influence function) grows quadratically in the tilt magnitude, with geometry determined by the matrix ; this geometry blends the noise component and the outcome–regression component . Since , the lower bound immediately implies
As the direction of changes, the bound can increase substantially showing how the most efficient shifts are those proportional to the first eigenvector of . Note that this matrix is not the same as the covariance matrix of the exposures, but is closely related to it, and therefore this result confirms our findings of Section 2 about which directions are most efficient to estimate. Now, we derive minimax bounds for the estimation error of .
Theorem 2 (Minimax lower bound).
Assume (A1)–(A5) and let lie in the finite-tilt regime specified in the Appendix. There exists a universal constant (independent of and ) such that, for any estimator based on a sample of size ,
Theorem 2 shows that the best possible root mean–squared error obeys
revealing an effective sample size of order . Hence larger tilts and directions aligned with high–variance components of are intrinsically harder, regardless of the estimation strategy. Conversely, if lies near a low–variance direction of , faster convergence is attainable. Together with Lemma 1, the theorem clarifies why one must account for when assessing estimation difficulty and designing procedures: the geometry induced by determines how precision scales with both the size and the direction of the tilt.
4.2 Convergence and Normality
Under mild regularity conditions (boundedness, i.i.d. sampling, and a fixed finite tilt), we establish a finite- central limit theorem for our cross-fitted one-step estimator. At first sight, the efficient influence function suggests that asymptotic linearity requires controlling the -specific nuisance functions
Since both objects are indexed by the tilt parameter , and is a nonlinear functional of the observed-data law, it is not obvious how to guarantee the rates needed for one-step asymptotics.
Our analysis shows that, for any fixed finite tilt , it is in fact sufficient to estimate only two familiar observed-data nuisance components:
namely the outcome regression and the (generalized) propensity score. In the Appendix we show that, under bounded support for and the finite-tilt condition , the maps and are Lipschitz in : there exist fixed finite constants such that for any estimators ,
where are obtained from by the same formulas as above; see Lemma 9. Consequently, the product condition
which ensures that the second-order remainder is negligible, is implied by the more interpretable requirement that both and converge at rate in ; see Corollary 3. These rates for , and hence for , can be guaranteed under standard smoothness and complexity conditions by the highly adaptive lasso (van der Laan, 2017), which attains near-optimal convergence rates for a broad nonparametric class.
Theorem 3 (Finite- CLT).
Assume (A1)–(A3) and (C1)–(C5) in the Appendix. Fix and any tilt with . Then
where the efficient influence function is
Consequently,
Theorem 3 provides the basis for asymptotic inference, which proceeds in a standard influence-function manner. Let , and define the empirical influence values for by
where and are constructed from the cross-fitted estimators as above. A consistent estimator of the asymptotic variance is the sample second moment
and analogously for the incremental effect we use
These yield Wald-type confidence intervals of the form
with a similar process for . When inference is required for several tilts , a joint covariance estimator is obtained by the empirical covariance matrix of the vector
which enables simultaneous confidence intervals or Wald tests via a multivariate normal approximation.
5 Sensitivity analysis to unmeasured confounding
In this section we develop a sensitivity analysis approach to assess the robustness of our causal estimates to the presence of unmeasured confounders. Throughout, we assume that there exist unmeasured confounders such that
For notational clarity, let denote the full adjustment set, reserving for the observed data throughout. We still consider incremental policies defined by exponential tilting of the observed conditional exposure density :
The full-data causal estimand is given by
where is referred to as the long outcome regression. We further denote the estimand obtained by applying the same identification formula while omitting as
where is referred to as the short outcome regression, and captures confounding bias at exposure level . When there is unmeasured confounding, the identified short estimand , which conditions only on observed covariates , diverges from the causal target that is defined under conditional exchangeability given . We follow Chernozhukov et al. (2022) and leverage the geometry of Riesz representers to derive sharp bias bounds based on norms, yielding interpretable calibration of confounding strength without imposing restrictive structural assumptions on .
5.1 The Bias Representation
Let denote the true conditional exposure density given the full adjustment set. The parameter can be written as a linear functional of the long regression evaluated under the observed distribution of :
Analogously, the short estimand admits the representation
Under mild regularity conditions ensuring square-integrability of the relevant Riesz representers, Chernozhukov et al. (2022) shows that the bias admits the exact representation
where
This identity isolates two necessary sources of bias: the additional outcome variation explained by beyond through , and the additional information about the exposure distribution provided by beyond through . By Cauchy–Schwarz,
5.2 Sensitivity Parameters
To express the bound in terms of identifiable scale components and partial -type sensitivity parameters, define the following identifiable parameters
We parameterize the outcome component by the nonparametric partial of with given ,
For the treatment component, we use the relative increase in variation of the Riesz representer induced by conditioning on :
Combining these definitions yields the sensitivity bound
For the empirical analysis, we parameterize the strength of unmeasured confounding by
where is the squared correlation between the long and short Riesz representers. Here is the fraction of residual outcome variation explainable by given , and is the fraction of RR variation explainable by along the intervention path indexed by . These two quantities determine the width of the bias bound and, in turn, the width of the endpoint confidence bounds. The sensitivity parameters can be set subjectively or calibrated by formal benchmarking against observed covariates (Cinelli and Hazlett, 2020). The benchmarking procedure used in the empirical analysis is given in Appendix E.7.
5.3 Sensitivity parameters and confidence bounds for
Our empirical results focus on the incremental effect . As shown in Appendix E, the same omitted-variable-bias representation applies to this contrast after replacing the short RR by the RR contrast . Writing
the plugin bound used in our analysis is
with . Therefore the estimated point identified set is
For formal sensitivity-adjusted inference, we construct confidence bounds for the estimated endpoints and . Appendix E gives the endpoint expansions and the resulting standard errors. The empirical results report a sensitivity-adjusted interval obtained from a lower confidence bound for and an upper confidence bound for :
6 Simulation studies
We conduct simulation studies to systematically evaluate the finite-sample performance of various nuisance-parameter estimation pipelines. Our goal is to assess how different modeling and estimation pipelines perform when the underlying exposure distribution exhibits complex features such as heavy tails, skewness, or multimodality.
6.1 Simulation design
In all simulations, we fix the sample size at and the dimensions of the covariates and exposures at and , respectively. The results are averaged over 100 independent Monte Carlo repetitions. We generate the covariate vector from a multivariate normal distribution with an autoregressive correlation structure. Specifically, , where the -th entry of the covariance matrix is given by . The exposure vector is generated conditional on using a linear location-shift model:
where is a sparse coefficient matrix with non-zero entries drawn from (sparsity level ), and is the intercept vector. The error term determines the distributional characteristics of the exposures. We consider three scenarios to test model robustness:
-
1.
Scenario I (Gaussian): The errors follow a multivariate normal distribution , where has an AR(1) structure with correlation parameter .
-
2.
Scenario II (Skewed): The errors follow a multivariate skew-normal distribution, , with a slant parameter vector . This scenario introduces substantial asymmetry and non-Gaussianity.
-
3.
Scenario III (Truncated contaminated normal): To investigate robustness against heavy-tail-like deviations while ensuring the finiteness of the normalizing constant required for exponential tilting, we consider a truncated contaminated normal distribution. Specifically, the error distribution follows a mixture of two centered Gaussians, , with contamination rate and scale inflation . The support is restricted to the region bounded by standard deviations of the baseline component. This setting introduces heavier tails relative to the standard Gaussian assumption while maintaining the required integrability conditions.
We generate the outcome under two different scenarios that comprise differing levels of complexity.
-
1.
Scenario I (Simple Linear Model): We generate a continuous outcome using a linear model:
Here, the regression coefficients are generated from normal distributions: entries of are drawn from , and entries of are drawn from , ensuring a strong but different signal for both covariates and exposures.
-
2.
Scenario II (Complex Model): We evaluate a second, more complex DGP for the outcome to incorporate structural nonlinearities and interactions. The continuous outcome is generated using the following model:
Here, the baseline linear confounding coefficients and the marginal independent effect coefficients are drawn from and , respectively. Beyond the linear main effects, the model explicitly introduces a quadratic effect for a specific covariate (, with ), a cross-product interaction between two exposures (, with ), and a bilinear interaction between a covariate and an exposure (, with ).
This more complex setting is designed to emulate mechanisms often encountered in the health effects of air pollution mixtures. The quadratic covariate term can be viewed as mimicking the classical U-shaped meteorological confounding effect of temperature. The cross-product term represents synergistic toxicity between two distinct pollutants, analogous to the combined effects of fine particulate matter (PM2.5) and ozone (O3). The interaction captures effect modification, reflecting how a vulnerability factor such as patient age can amplify the health impact of a specific exposure.
6.2 Implemented estimators and method comparison
Our primary estimand is the tilted mean , evaluated at a fixed tilting parameter . We compare seven estimation pipelines that differ in how they approximate the nuisance structure induced by the tilted exposure law. For all methods, the outcome regression is estimated using XGBoost with 5-fold cross-fitting. Our approaches can largely be categorized into two distinct types, which vary in how the exposure distribution is estimated.
-
1.
Semi-parametric location-shift working models.
We model the exposure distribution using a semi-parametric location-shift decomposition, . For each exposure dimension, the conditional mean is flexibly estimated via XGBoost, while the scale parameter is estimated as the empirical standard deviation of the residuals within the corresponding training fold. This specification accommodates nonlinear covariate-dependent shifts in the exposure mean while imposing a homoscedastic error structure across the covariate space.We evaluate three variations of this working model based on the specified marginal distributions of the standardized residuals . In all three variations, the joint dependence structure of is modeled using a Gaussian copula:
-
•
Path 2 (Gaussian): Assumes the standardized residuals follow a Gaussian distribution.
-
•
Path 2 (): Models each marginal residual using a scaled Student’s distribution, , where the degrees of freedom are estimated via maximum likelihood. This allows the model to accommodate heavy tails in specific exposure dimensions.
-
•
Path 2 (Empirical): Employs a fully nonparametric approach for the marginal residuals. The marginal cumulative distribution function for each residual dimension is flexibly estimated using log-spline density estimation.
-
•
-
2.
Direct estimation of nuisance parameters via regression.
In addition to the standard outcome regression , the one-step estimator depends on nuisance functions that vary with the tilting parameter . These are given by and , which are defined in Section 3.1.2. In this pipeline, we use 5-fold cross-fitted SoftBART regression (Linero and Yang, 2018) for estimation of these functions.
Within each of the three approaches to conditional density estimation, we explore 1) estimating the density ratio using the estimate of , and 2) directly estimating the density ratio by estimating . This leads to six estimators, though we consider a seventh estimator that involves directly estimating both and without ever needing to estimate the exposure density.
6.3 Simulation Results
Figure 1 displays the estimated tilted mean, , across 100 Monte Carlo repetitions for the seven pipelines under the six data-generating designs. Additionally, Table 1 complements the boxplots with numerical summaries aggregated over the six designs. We report the average signed bias together with the average absolute bias and the average RMSE.
| Mean bias | Mean absolute bias | Mean RMSE | |
|---|---|---|---|
| Fully direct SoftBART regression | 0.32 | 0.51 | 1.07 |
| Gaussian residual model | 0.78 | 0.82 | 1.05 |
| Gaussian residual model with SoftBART estimation of | 0.15 | 0.37 | 0.64 |
| residual model | 0.72 | 0.80 | 1.01 |
| residual model with SoftBART estimation of | 0.15 | 0.37 | 0.64 |
| Empirical residual model | 0.24 | 0.42 | 0.66 |
| Empirical residual model with SoftBART estimation of | 0.08 | 0.31 | 0.62 |
Across the six designs, the empirical residual working model is the most reliable default among the three distributional choices. Additionally, the results make it clear that estimating the density ratio directly using SoftBART outperforms estimation of the density ratio by simply plugging in the estimate of . Regardless of the distributional choice, direct modeling of the density ratio drastically improves both bias and RMSE. Note, however, that the approach that does not model the exposure density at all, and directly estimates all nuisance functions directly through regression, performs poorly with a high RMSE. This highlights that taking a ratio of two estimated nuisance functions leads to additional, undesirable instability. Overall, the simulation results support a concrete practical conclusion: for the one-step estimator considered here, finite-sample performance is determined primarily by accurate estimation of density ratio while the remaining choice among the Gaussian, , and empirical residual families is secondary once that component is well estimated.
7 Application: Assessing Health Impacts of PM2.5 Component Mixtures
We now evaluate an important public health question regarding the health impacts of long-term exposure to fine particulate matter (PM2.5) and its complex chemical mixtures. Specifically, we constructed a county-level dataset across the United States for the year 2019 to estimate the exposure-response relationship between PM2.5 constituents and age-adjusted hospitalization rates per 10,000 people for Chronic Obstructive Pulmonary Disease (COPD). Our analysis obtained health records from the CDC Environmental Public Health Tracking Network (EPHTN) and CDC WONDER. High-resolution estimates of PM2.5 mass and its chemical constituents are derived from the Atmospheric Composition Analysis Group (Van Donkelaar et al., 2019). To adjust for potential confounding, we integrated a broad set of sociodemographic, behavioral, and clinical covariates compiled from the U.S. Census Bureau and CDC surveillance systems. Our analysis focuses on a dimensional PM2.5 component mixture: black carbon (BC), nitrates (NO3), organic matter (OM), sulfates (SO4), and ammonium (NH4). All reported curves use the cross-fitted one-step estimator from Section 3 together with the empirical residual model shown to perform best in the simulation studies.
We study three types of intervention paths corresponding to single exposure shifts, shifting groups of exposures, and finding the optimal shift in terms of reducing overall hospitalization rates. For the single exposure shifts, we utilize shifts of the form when studying exposure . The magnitude is chosen to satisfy the corresponding Gelbrich constraint. These paths are useful for comparing feasible directions on the intervention manifold, though they do not necessarily isolate single-pollutant causal effects because the tilted law remains multivariate and continues to alter the entire mixture distribution, not just exposure . We take a different approach for studying groups of exposures, where we consider three groups: 1) BC+OM, 2) NO3+SO4+NH4, and 3) the full five-pollutant mixture. We solve a numerical algorithm for finding the value that shifts the means of each of the exposures within a group, while holding the means of the other exposures constant. We do not apply this idea to the single pollutant shifts, because finding such a shift is not feasible in many cases due to the high correlations among the exposures. Third, for the optimal policy we run the Riemannian BFGS algorithm, though we do so from 100 independent random starting values since the optimization problem is non-convex and global optimality is not guaranteed.
Figure 2 shows the main empirical results. Among the five single pollutant analyses, all estimated curves are negative indicating a harmful effect of the air pollution mixture. The steepest declines are for SO4 and NH4, followed by NO3 and OM, with BC producing a smaller but still negative curve. Because the single exposure shifts may also be shifting other exposures due to correlation among them, we interpret these as rankings of feasible intervention directions rather than pollutant-specific dose-response effects. Arguably the more interesting and informative results therefore are given by the grouped exposure analyses. The BC+OM path remains close to zero throughout the displayed Gelbrich range, indicating that these two exposures do not strongly impact COPD hospitalizations. The NO3+SO4+NH4 group and the all exposures group, however, show significantly more pronounced effects showing that these three exposures are likely driving the adverse health effects seen.
As expected, the BFGS path has the largest estimated effect showing the biggest reductions in COPD hospitalizations possible at each level of the Gelbrich constraint. To provide more intuition about the results from the optimal policy shift, Figure 3 shows the distribution of the causal effect across the 100 different starting values, and Figure 4 shows the distribution of the values for each exposure in the optimal exposure shift across the different starting values. We can see there is some heterogeneity across starting values in both figures, though certain coherent patterns do emerge. The exposures with the most negative tilts assigned to them are SO4 and NO3, highlighting that they are potentially the most impactful of the exposures in the air pollution mixture.
Figure 2 also summarizes the sensitivity of our results to unmeasured confounding bias. The shaded ribbons correspond to two fixed sensitivity parameter settings, while the dashed blue curves implement the formal benchmark procedure described in Appendix E.7. Intuitively, scales how much residual outcome variation an omitted confounder may explain relative to the strongest observed covariate, and scales how much additional RR variation it may explain relative to the strongest observed covariate for a given intervention path. In our data, the strongest outcome-side benchmark is the covariate White, whereas the RR-side benchmark is selected separately for each scenario. Under the benchmark and , the BC, OM, SO4, NH4, NO3+SO4+NH4, all-pollutant, and BFGS curves remain significantly negative throughout the displayed Gelbrich range showing moderate sensitivity to unmeasured confounding. The NO3 path is more sensitive at the smallest Gelbrich constraints, though it becomes more robust for larger Gelbrich distances. Figure 5 provides an assessment of how large the sensitivity parameters would have to become in order to make any of the results insignificant for each exposure (or group of exposures) examined. For this reason, we refer to this as the least favorable point, because it is the level of confounding required simply to make the result insignificant at any Gelbrich distance, not all distances simultaneously. The NO3 contour lies closest to the origin, indicating that comparatively small confounding would suffice to remove significance for at least one value of the Gelbrich constraint. At the other end, SO4, the all-pollutant equal-mean path, and the BFGS path lie farthest from the origin, so they require materially stronger confounding to overturn the estimated negative effects. BC, OM, NH4, and the NO3+SO4+NH4 group fall between these extremes, still showing a modest degree of robustness to confounding.
8 Discussion
In this manuscript we developed methodology for estimating the health effects of multiple air pollutants simultaneously in a way that is robust to the presence of severe positivity violations. By examining stochastic interventions with tilted exposure distributions, we can study which exposures are most harmful without relying on model-based extrapolation. One critical issue in the multivariate setting is how to define a fair shift that corresponds to similar shifts in the exposure distribution, which we do via the 2-Wasserstein distance. We provide asymptotic theory and minimax estimation rates for our proposed estimands, and show in a national study of the health effects of air pollution that there are detrimental effects of the air pollution mixture, but that these are largely driven by nitrates and sulfates.
There are a number of directions for future work that could expand upon, and improve, the methodology seen here. For one, our estimators are applicable to any stochastic shift estimand, and future research could target different shifts other than the exponentially tilted ones seen here, which maintain public health relevance and could be potentially more interpretable for practitioners. Additionally, one could expand on the sensitivity analyses developed here by incorporating recent results on sensitivity analysis for multiple exposures (Zheng et al., 2021). These incorporate moderate parametric assumptions in the multiple exposure setting and allow one to produce partial identification regions that could be tighter than those seen here, and allow one to incorporate additional assumptions or sources of information, such as negative control variables. Overall, we believe the proposed framework provides analysts, particularly those involved in the analysis of air pollution mixtures, robust approaches to estimating causal effects of multivariate, continuous exposures.
References
- Estimating the health effects of environmental mixtures using bayesian semiparametric regression and sparsity inducing priors. The Annals of Applied Statistics. Cited by: §2.5.2.
- Causal analysis of air pollution mixtures: estimands, positivity, and extrapolation. American Journal of Epidemiology 193 (10), pp. 1392–1398. Cited by: §1.
- Bounding wasserstein distance with couplings. Journal of the American Statistical Association 119 (548), pp. 2947–2958. Cited by: §2.4.
- Quantifying distributional model risk via optimal transport. Mathematics of Operations Research 44 (2), pp. 565–600. Cited by: §2.4.
- Causal inference for mixtures. Statistical science 30 (4), pp. 514–530. Cited by: §1, §2.5.2.
- Local effects of continuous instruments without positivity. arXiv preprint arXiv:2403.06450. Cited by: §1.
- Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 21 (1), pp. C1–C68. Cited by: §3.1.1.
- Long story short: omitted variable bias in causal machine learning. Technical report National Bureau of Economic Research. Cited by: §E.7, Appendix E, §5.1, §5.
- Making sense of sensitivity: extending omitted variable bias. Journal of the Royal Statistical Society Series B: Statistical Methodology 82 (1), pp. 39–67. Cited by: §5.2.
- Causal mediation analysis for stochastic interventions. Journal of the Royal Statistical Society Series B: Statistical Methodology 82 (3), pp. 661–683. Cited by: §1, §2.2.
- Population intervention causal effects based on stochastic interventions. Biometrics 68 (2), pp. 541–549. Cited by: §1.
- Nonparametric causal effects based on longitudinal modified treatment policies. Journal of the American Statistical Association 118 (542), pp. 846–857. Cited by: §1.
- Nonparametric estimation of local treatment effects with continuous instruments. Journal of Business & Economic Statistics, pp. 1–14. Cited by: §1.
- Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics 49 (3), pp. 1378–1406. Cited by: §2.4.
- Identifying main effects and interactions among exposures using gaussian processes. The annals of applied statistics 14 (4), pp. 1743. Cited by: §2.5.2.
- Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research 48 (2), pp. 603–655. Cited by: §2.4.
- On a formula for the wasserstein metric between measures on euclidean and hilbert spaces. Mathematische Nachrichten 147 (1), pp. 185–203. Cited by: §2.4.
- Wasserstein distributionally robust control of partially observable linear stochastic systems. IEEE Transactions on Automatic Control 69 (9), pp. 6121–6136. Cited by: §2.4.
- Estimation of the effect of interventions that modify the received treatment. Statistics in medicine 32 (30), pp. 5260–5277. Cited by: §1.
- A riemannian bfgs method without differentiated retraction for nonconvex optimization problems. SIAM Journal on Optimization 28 (1), pp. 470–495. Cited by: Appendix B, Appendix B, Appendix B, Appendix B, §3.2, §3.2.
- A broyden class of quasi-newton methods for riemannian optimization. SIAM Journal on Optimization 25 (3), pp. 1660–1685. Cited by: Appendix B, Appendix B, Appendix B, Appendix B, Appendix B, §3.2.
- Stochastic interventions, sensitivity analysis, and optimal transport. The Annals of Statistics 52 (2), pp. 522–545. Cited by: §1.
- Doubly robust inference on causal derivative effects for continuous treatments. arXiv preprint arXiv:2203.01878. Cited by: §1.
- Non-parametric causal effects based on incremental propensity score interventions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 81 (4), pp. 719–742. Cited by: §1, §2.2.
- Semiparametric doubly robust targeted double machine learning: a review. Handbook of statistical methods for precision medicine, pp. 207–236. Cited by: Appendix D.
- Wasserstein distributionally robust optimization: theory and applications in machine learning. In Operations research & management science in the age of analytics, pp. 130–166. Cited by: §2.4.
- Bayesian regression tree ensembles that adapt to smoothness and sparsity. Journal of the Royal Statistical Society Series B: Statistical Methodology 80 (5), pp. 1087–1110. Cited by: item 2.
- Wasserstein riemannian geometry of gaussian densities. Information Geometry 1 (2), pp. 137–179. Cited by: Appendix B.
- Fair comparisons of causal parameters with many treatments and positivity violations. arXiv preprint arXiv:2410.13522. Cited by: §1, §2.4.
- Semiparametric discovery and estimation of interaction in mixed exposures using stochastic interventions. arXiv preprint arXiv:2305.01849. Cited by: §2.5.2.
- Data-driven distributionally robust optimization using the wasserstein metric: performance guarantees and tractable reformulations. Mathematical Programming 171 (1), pp. 115–166. Cited by: §2.4.
- Mean-covariance robust risk measurement. arXiv preprint arXiv:2112.09959. Cited by: §2.4.
- Statistical aspects of wasserstein distances. Annual review of statistics and its application 6 (1), pp. 405–431. Cited by: §2.4.
- An invitation to statistics in wasserstein space. Springer Nature. Cited by: §2.4.
- Scalable couplings for the random walk metropolis algorithm. Journal of the Royal Statistical Society Series B: Statistical Methodology, pp. qkae113. Cited by: §2.4.
- Causal inference for observed sudden changes in the composition of a multi-pollutant mixture: application to the effects of the utah valley steel mill closure. Epidemiology (Cambridge, Mass.) 23 (4), pp. 559. Cited by: §1.
- Extrapolation-aware nonparametric statistical inference. Journal of the Royal Statistical Society Series B: Statistical Methodology 83 (5), pp. 915–941. Cited by: §1.
- Riemannian bfgs algorithm with applications. In Recent Advances in Optimization and its Applications in Engineering: The 14th Belgian-French-German Conference on Optimization, pp. 183–192. Cited by: §3.2.
- Single world intervention graphs (swigs): a unification of the counterfactual and graphical approaches to causality. Center for the Statistics and the Social Sciences, University of Washington Series. Working Paper 128. Cited by: §1.
- Optimization methods on riemannian manifolds and their application to shape space. SIAM Journal on Optimization 22 (2), pp. 596–627. Cited by: Appendix B, Appendix B, §3.2.
- Effects of multiple interventions. In Comparative quantification of health risks: Global and regional burden of disease attributable to selected major risk factors, Vol. 1, pp. 2191–2230. Cited by: §1.
- Incremental causal effects. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 83 (3), pp. 578–605. Cited by: §1.
- Randomization analysis of experimental data: the fisher randomization test comment. Journal of the American statistical association 75 (371), pp. 591–593. Cited by: §2.1.
- Propensity score weighting across counterfactual worlds: longitudinal effects under positivity violations. Statistical Methods in Medical Research 33 (1), pp. 137–154. Cited by: §1.
- Everything all at once: on choosing an estimand for multi-component environmental exposures. arXiv preprint arXiv:2509.17960. Cited by: §1.
- Estimation and false discovery control for the analysis of environmental mixtures. Biostatistics 23 (4), pp. 1039–1055. Cited by: §2.5.2.
- Incremental effects for continuous exposures. arXiv preprint arXiv:2409.11967. Cited by: §1, §2.2, §2.2, §2.3, §3.1.2, §3.1.2, §4.
- Causal geodesy: counterfactual estimation along the path between correlation and causation. arXiv preprint arXiv:2508.08499. Cited by: §1.
- Intervening on risk factors for coronary heart disease: an application of the parametric g-formula. International journal of epidemiology 38 (6), pp. 1599–1611. Cited by: §1.
- A generally efficient targeted minimum loss based estimator based on the highly adaptive lasso. The International Journal of Biostatistics 13 (2), pp. 20160075. External Links: Document Cited by: §4.2.
- Regional estimates of chemical composition of fine particulate matter using a combined geoscience-statistical method with information from satellites, models, and monitors. Environmental science & technology 53 (5), pp. 2595–2611. Cited by: §7.
- Sparse bayesian additive nonparametric regression with application to health effects of pesticides mixtures. Statistica Sinica 30 (1), pp. 55–79. Cited by: §2.5.2.
- Distributionally fair stochastic optimization using wasserstein distance. arXiv preprint arXiv:2402.01872. Cited by: §2.4.
- Identification, estimation and approximation of risk under interventions that depend on the natural value of treatment using observational data. In Statistical models and causal inference: a dialogue with the social sciences, pp. 103–128. Cited by: §1.
- Nonparametric inference on dose-response curves without the positivity condition. Biometrika 110 (1), pp. 219–236. Cited by: §1.
- Bayesian inference and partial identification in multi-treatment causal inference with unobserved confounding. arXiv preprint arXiv:2111.07973. Cited by: §8.
- Cross-validated targeted minimum-loss-based estimation. In Targeted Learning, pp. 459–474. Cited by: §3.1.1.
Appendix A Neyman-orthogonality and robustness to misspecified outcome model
Setup.
Let denote the outcome regression, let be the observed exposure density, and let be the tilted density with density ratio . We write and for conditional expectations with respect to and , respectively. Assume standard regularity conditions. Consider the efficient influence function for a scalar parameter :
We utilize the identities:
(a) Orthogonality with respect to .
Fix and perturb along a path . Since and do not depend on , we have:
Hence, the influence function is Neyman-orthogonal with respect to .
(b) Sensitivity in .
Fix and perturb along a normalized path , where is a measurable function satisfying the constraint for all . Differentiating this constraint at yields , which is equivalent to .
Let , where is the tilted density corresponding to . Since , we have:
Let . The influence function along the path is:
Differentiating the expected influence function yields:
In general, . Therefore, the derivative does not vanish, implying that the influence function is not Neyman-orthogonal in the direction unless additional restrictive constraints are imposed on the perturbation .
This non-orthogonality with respect to extends directly to the exposure density . For a fixed target density , any perturbation in induces a corresponding change in the density ratio . Consider a perturbation path , where is a standard score function satisfying . Specifically:
Differentiating with respect to at :
This corresponds to the perturbation in the previous derivation for . Substituting this into the derivative obtained in part (b), we get:
Since is generally non-zero (unless is constant or is orthogonal to under ), the derivative does not vanish. Thus, the influence function is not Neyman-orthogonal with respect to the exposure density .
Robustness to misspecified outcome model.
Let be the true outcome regression and be the true density ratio. Define the target parameter:
which represents the average outcome under the tilted distribution . We now show that at , the influence function is globally robust to . That is, for any measurable :
Consequently, estimators based on the efficient influence function , such as the one-step estimator employed in this work, remain consistent for even under misspecification of the outcome regression model.
Proof.
Fix any measurable function and define:
Since depends only on and , we have:
Thus, at :
By the law of iterated expectations:
Therefore, for all . This confirms that consistent estimation of relies primarily on the consistency of , rendering the estimator robust to misspecification of . ∎
Appendix B Validity for Riemannian BFGS on Gelbrich Constraint for exponential tilting
The convergence theory of Riemannian optimization algorithms, including Riemannian BFGS with Wolfe line search, is established in, e.g., Ring and Wirth (2012); Huang et al. (2015, 2018). We consider the deterministic level–set constraint
where is the squared Gelbrich distance between the marginal baseline distribution of and the marginal tilted distribution of induced by . The manifold is equipped with the Riemannian metric induced by the Euclidean inner product on . Global convergence of cautious Riemannian BFGS in the sense that follows from Huang et al. (2018, Theorem 4.2) under a compact level-set condition and Lipschitz continuous differentiability with respect to the chosen transport. We establish these properties for the marginal Gelbrich constraint by showing that the level set is a smooth embedded hypersurface, that a retraction and a continuous scaled transport with the pointwise norm-preserving property used in Huang et al. (2015, 2018) are available, and that is regular on a neighbourhood containing the line-search trial points.
Geometry of the Gelbrich constraint.
Let denote the marginal baseline law of on , and let
For (an open set on which the marginal log–moment generating function is finite), define
Consequently, let and . Equivalently,
The Gelbrich function is
Equip with the Euclidean inner product and with the Frobenius inner product . For each , define the linear map by
and extend by linearity so that . Let be the adjoint of with respect to these inner products, that is,
equivalently, for any ,
Define
For , define the Lyapunov (Sylvester) operator , and write for its solution operator.
Lemma 2 (Exact gradient and generic regularity of Gelbrich level sets).
Assume is finite on a nonempty open set . Then is real–analytic on , hence so are and .
-
(i)
is on and
-
(ii)
Let and . Then has Lebesgue measure zero and empty interior. In particular, for any with , the level set
is a (indeed real-analytic) embedded hypersurface and for all .
Proof.
Write with
where
Let and denote the directional derivative by . Since and is symmetric,
hence .
For , the first Fréchet derivative of is given by
where . This follows from the Fréchet derivative of the principal matrix square root, (Malagò et al., 2018, Eqs. (14)–(16)). In particular, for and ,
With and ,
By the chain rule and the definition of ,
which implies . Summing these terms yields the gradient in (i).
For part (ii), is real–analytic on , hence so are and , and therefore is on . Since is real–analytic on , it is on , and Sard’s theorem yields that the set of critical values has Lebesgue measure zero. Since has Lebesgue measure zero, it has empty interior. For any with , every satisfies ; the regular level–set theorem yields that is a (indeed real-analytic) embedded hypersurface. ∎
Projection retraction and scaled vector transport.
Fix such that and write . For , let be the unit normal vector and denote the orthogonal projection onto the tangent space by
Let be the orthogonal projection from a tubular neighbourhood of . For , define the retraction on a sufficiently small ball by
Define the differentiated-retraction transport
Following Huang et al. (2015, Eq. (4.3)), define the scaled transport
which is norm-preserving and satisfies the requirements (2.5)–(2.8) therein. For a step and , define
Lemma 3 (Validity of transport).
The retraction and the transport defined above satisfy:
-
(a)
there exists an open neighbourhood of the zero section such that and are well defined for and are in all arguments;
-
(b)
and , and for every ,
and in particular ;
-
(c)
for every and ,
so for each fixed , the map is pointwise norm-preserving on , and is uniformly bounded on compact subsets of in the sense that
for all admissible .
Proof.
Property (a) follows from the tubular neighbourhood theorem and the smoothness of : is well defined and for small, and is a linear map between tangent spaces for sufficiently small. The scaled transport satisfies the continuity and first-order conditions invoked in Huang et al. (2015), so is continuous in all arguments.
For (b), and , hence and . Since is the identity on , . The first–order agreement between and follows from Huang et al. (2015, (2.5)–(2.7), Lem. 3.5).
For (c), by construction,
so is pointwise norm-preserving:
The stated uniform boundedness on compact subsets then follows with constant . ∎
Objective regularity for the global target.
For and , define the local tilted conditional mean
Lemma 4 (Conditional regularity of the local tilted mean).
Assume for some finite constants . Let be open and let be compact. Assume
| (3) |
and
| (4) |
Then for almost every , the map is on . Moreover, there exist finite constants , depending only on , , and , such that with
we have almost surely
and for .
Proof.
Fix . By (3), the maps and are well defined almost surely, and differentiation under the conditional expectation is justified up to second order because dominates the first- and second-order directional derivatives of both integrands. Thus, for almost every ,
The polynomial growth bound on implies that there is a finite constant , depending only on and , such that almost surely
By (4), the quotient rule gives, for almost every ,
Therefore there exist finite constants , depending only on , , and , such that
Since , the envelopes are integrable. This proves the claim. ∎
Corollary 1 (Objective regularity and line search on Gelbrich level sets).
Let satisfy and set . Assume the boundary-separation condition
| (5) |
Choose such that
| (6) |
and define . Let and be as defined above, and let the Riemannian metric be the Euclidean metric on . Assume the conditions of Lemma 4 on an open neighbourhood of . Assume also the constraint regularity on :
| (7) |
Then is compact, and:
-
(a)
is on , with
and
Consequently, is –Lipschitz on .
-
(b)
Writing and , the Riemannian gradient is Lipschitz on , that is, there exists such that
-
(c)
Fix . For any and any descent direction with , there exists a step size in the domain of such that satisfies the weak Wolfe conditions at .
Proof.
By (6), every point of lies in . Since is continuous on and , the set
is closed in the compact set . Hence is compact.
For part (a), Lemma 4 provides an integrable envelope such that
The same lemma yields analogous integrable envelopes for and . Hence dominated differentiation applies to
which gives
Therefore
The Lipschitz property of on follows from the mean value theorem.
For ,
Since ,
Let . By (7), on , and on . Using
one obtains
Since ,
Let . Combining these bounds yields
so part (b) holds with .
For part (c), define the one-dimensional line-search function . By part (a) and the regularity of , the map is continuously differentiable on an interval containing . Because whenever the retraction is defined, and because is continuous on the compact set , the function is bounded below on its domain. Therefore there exists satisfying the weak Wolfe conditions along ; see Ring and Wirth (2012, Proposition 1). ∎
For an accepted step , define
The cautious RBFGS update of Huang et al. (2018, Algorithm 1) is then applied whenever the prescribed curvature condition holds; otherwise the current approximation is transported to the new tangent space without updating the Hessian surrogate.
Lemma 5 (Feasible initialization and containment).
Let satisfy and set . Assume the boundary-separation condition (5). Assume that satisfies the local quadratic control near the origin: there exist , and a symmetric positive–definite matrix with eigenvalues such that, for all ,
| (8) |
Assume also the hypotheses of Corollary 1.
-
(i)
For every prescribed local radius , there exists such that for any one can select a unit vector and with and .
-
(ii)
Let . Then is compact, is compact, and for any sequence generated by Riemannian BFGS on using and above with a weak Wolfe line search, one has and hence for all .
Proof.
For part (i), fix . Fix a unit vector and define on . Then is continuous, , and by (8),
Let . For any , there exists with by the intermediate value theorem. Then , proving part (i).
For part (ii), condition (5) implies the existence of such that for all with . Hence . By Corollary 1, the level set is compact and is continuous on . Hence is a closed subset of the compact set , so is compact. Now proceed by induction. Assume and let be the search direction. Let be produced by the weak Wolfe line search applied to and set . The Armijo condition yields
hence . Therefore and for all . ∎
Stationarity conclusion.
Under (5), Lemma 5 shows that the initial sublevel set is compact. Thus Assumption 4.1 in Huang et al. (2018) is satisfied for the RBFGS iterates started at . Together with Assumption 4.2, applied here with the scaled transport above, Theorem 4.2 yields
Since and is compact, the sequence admits an accumulation point . Continuity of implies that any accumulation point along a subsequence with is Riemannian stationary.
Appendix C Detailed Proof for Minimax Lower Bound
Standing Assumptions and Notation
Let with , , and . Write for the exposure density, and define the exponential tilt
Let , and denote
All bounds are derived for a general tilt and then specialized to .
Assumptions:
- (A1) Bounded outcomes
-
There exists such that a.s. Hence .
- (A2) Nondegenerate noise
-
a.s.
- (A3) Bounded tilt
-
and we require for some finite radius that . For the purpose of this proof, we assume there exists a fixed constant such that
- (A4) Model richness
-
The class of distributions is sufficiently rich. There exist constants such that for any , if with and , then for all , the density corresponds to a distribution .
- (A5) Pathwise differentiability
-
The functional is pathwise differentiable at any . The remainder from the von Mises expansion admits a uniform quadratic bound: there exists a constant such that for any perturbation with , the bound
holds uniformly for all and all valid .
Identification and target. Under standard consistency and conditional exchangeability,
C.1 Uniform second-order bounds
This subsection derives uniform second-order bounds for (i) the density ratio and (ii) the tilted conditional mean , valid over a finite-tilt regime. The bounds provide an explicit quadratic remainder for the expansion around and uniform control of second-order terms in the Le Cam two-point construction underlying the minimax lower bound.
Lemma 6 (Uniform bounds with explicit constants).
Suppose (A1)–(A3). Let . Then for all such ,
where one can take
Proof.
All vector norms are Euclidean, and all matrix norms are operator norms. Expectations and conditional expectations are under the baseline law unless explicitly indicated. Write by (A3), and by (A1). Let and define
Under (A3),
which yields uniform bounds with explicit constants.
Define
Since , we have
It follows that
Second-order expansion of and a uniform Hessian bound. Introduce the log-partition function
Then
Differentiating with respect to yields
and therefore
Using and , we obtain
A uniform bound for the Hessian holds on the ball . First,
so
Next, since ,
Combining these bounds with yields the uniform Hessian bound
| (9) |
Proof of (i). By Taylor’s theorem with integral remainder for the scalar function around ,
Since ,
Therefore,
By Taylor’s theorem with integral remainder and (9),
Hence,
which proves part (i) with .
Proof of (ii). By definition of the tilt,
Therefore,
where
Define
By Cauchy–Schwarz and ,
From part (i), with . Hence
Thus
which proves part (ii) with . ∎
C.2 First-order expansion of the EIF and its covariance
The second-order expansions above imply a first-order expansion of the efficient influence function for at , with leading term and a quadratic remainder.
Lemma 7 (EIF expansion from the three-component decomposition).
Let
Recall , and define
Assume (A1)–(A3) and . Then, with the constants from Lemma 6,
we have the expansion
and an explicit quadratic bound
Consequently,
We have . Moreover, the leading covariance of the influence function satisfies
where
In particular,
Equality holds if and only if is conditionally degenerate given .
Proof.
Preliminaries. Recall
By Lemma 6, for there exist remainders and such that
The tilt radius implies with , hence . We also use , , and .
Also,
Expansion of EIF components. Using the displays above,
where the remainder collects all quadratic and higher-order terms:
Collection of zero and first-order terms. Adding gives
Hence the first-order term is .
Explicit bounds for remainder terms. By Cauchy–Schwarz and (A2),
Since ,
Using ,
Moreover,
For the quadratic product,
Finally,
where we used . Combining all displays yields
with as stated.
Covariance of the gradient. Write
Set
so that . Because and is measurable with respect to , , hence
Moreover, conditioning on gives
and
Thus .
Finally, the law of total covariance yields
Mean zero property of . Let and . Using the definition of ,
The expectation decomposes as .
First,
Second, for ,
By definition , hence
so
Putting the pieces together,
∎
Corollary 2.
Proof.
By Lemma 7,
Write and , and center the remainder by . Then
Because , we have . Also by Cauchy–Schwarz.
By the triangle inequality and its reverse in ,
Using and and squaring both sides yields the additive bracket.
Suppose . Then, for all ,
Hence, for any ,
Dividing the additive bracket by yields the multiplicative squeeze with
Finally, because , the coarser lower bound follows. ∎
Lemma 8 (Hardest direction after orthogonalization).
Let and
Define
Then , , and . Let Assumption (A4) hold with constants . Consider with density , where . Then , , and . Moreover, for ,
where is the remainder in the linear expansion , and the remainders satisfy
with as in Lemma 7 and the differentiability modulus from (A5). Finally,
Proof.
Properties of . Write , so . By construction,
Since and , we have . Next,
Also,
Finally,
Because , we obtain
Validity of . By the definition of in (A4), for every with and , the measure with density lies in for all . Moreover, . If then , so .
First-order expansion of via (A5). By (A5), for ,
Subtracting gives
Evaluation of the leading inner product and remainder bound. By the linearization,
for all . Hence
∎
C.3 Minimax Lower Bound
Recall
Define the projected covariance (Schur complement) of the linear regression of onto the one-dimensional span of :
Theorem 4 (Minimax lower bound).
Assume (A1)–(A5). Fix some and assume
If is positive definite, then there exists a constant that depends only on and the constants in (A1)–(A5), but not on , such that
In particular, the lower bound is expressed in terms of and generally cannot be strengthened by replacing with a larger matrix.
Proof of Theorem 4.
Notation.
Let , , and .
Define .
Two-point construction. Fix and let be as in Lemma 8. For a given , choose
Then , and the perturbed model defined by lies in , integrates to , and satisfies .
Parameter gap and the Schur complement. By Lemma 8,
with
Moreover,
Therefore
Using and the assumption , we have
hence
Thus
With ,
Le Cam two-point inequality and asymptotic lower bound. Le Cam’s two-point inequality yields, for any estimator ,
Using the TV bound above,
Now apply the parameter gap bound and divide by :
The term is and does not contribute to the limit after normalization by . Therefore,
Since , taking and yields the same lower bound, proving the theorem with . ∎
On positive definiteness of .
Let
Then
The matrix is positive definite if and only if the residual is nondegenerate, that is
equivalently for every nonzero .
If for some , then there exist scalars such that
Recall and with and . Collecting coefficients of yields
By (A2), a.s., hence
Taking conditional expectation given and using gives and thus
Thus, failure of requires a nontrivial direction along which is almost surely degenerate.
Linear Incremental Effect
By explicitly specializing the general tilt function to the linear form , the projected covariance recovers the exact geometric structure defined in Section 4.1, thereby completing the minimax lower bound for the linear incremental effect evaluated in the main text.
Appendix D On convergence and normality
We first establish a key auxiliary result under the same assumptions and notation used in the proof of the minimax lower bound. We clarify how our rate conditions on the tilted nuisances translate into standard rates for the outcome regression and the exposure density .
Lemma 9 (Reduction to outcome regression and exposure density).
Assume (A1)–(A3), and let denote the conditional density of given , with support of finite Lebesgue measure . Suppose there exist constants such that
Let , and fix . For any with define
and
Let be any estimators of , and construct
To ensure the denominators in and are strictly bounded away from zero, we define the truncated estimator below. Later we show this regularization enforces positivity constraints consistent with the true without compromising the convergence rates inherited from the nuisance estimators.
Then there exist finite constants , depending only on and the bound on , such that
| (10) | ||||
| (11) |
where denotes the –norm with respect to the law of .
Proof.
Since and is bounded, there exists such that almost surely. Hence, for all such ,
In particular, is bounded away from and ; the same holds for by construction.
Control of . By definition,
Taking absolute values and using the bound on the exponential tilt gives
By Cauchy–Schwarz and finiteness of ,
Squaring both sides and integrating over with respect to gives
By definition,
Under the boundedness assumption on the exposure density, there exists such that for all . Hence, for any non-negative integrand we have
Applying this with yields
Substituting into the previous display, and noting that for any function depending only on we have
we obtain
and hence
Moreover, since for all and the truncation map is –Lipschitz, we have
Control of . We have
Using the finite- bounds on the numerator and denominators, there exists a constant such that
so that
with , which proves (10).
Control of . Using the product expansion , we find
Assumption (A1) implies , and the bounds on give . Hence
On an event whose probability tends to one we have , and another application of Cauchy–Schwarz yields
for some finite constant .
Control of . By the algebraic identity ,
Using again the uniform bounds on implied by (A1)–(A3), we obtain
for some finite constant . Taking –norms and combining the bounds above gives (11) with
∎
Corollary 3 (Rate condition in terms of ).
We now turn to the von Mises expansion and the second-order remainder bound that underpin the finite- central limit theorem in the main text. The results in this subsection are stated in terms of the tilted nuisances ; together with Corollary 3, they imply Theorem 3.
We work with the cross-fitted one-step estimator written directly in density-ratio form,
| (12) |
where, for a fixed tilt with ,
and
Here and are obtained from cross-fitted regressions of the transformed outcomes and on , respectively.
From the von Mises decomposition (see, e.g., Kennedy, 2024) and by adding and subtracting , , and on the right-hand side, we obtain
| (13) |
where
and the second-order remainder is
Lemma 10 (Second-order remainder).
For any fixed with ,
Proof.
Starting from the estimator representation (12), the definition of gives
The first term vanishes by iterated expectation:
and
Hence , and the Cauchy–Schwarz inequality yields
∎
Theorem 5 (Asymptotic normality at rate).
Assume (A1)–(A3). Let and suppose are i.i.d. draws from . Fix with , and use -fold cross-fitting to obtain the nuisance estimators . Suppose
and the product condition
A convenient sufficient condition is
which can be achieved by suitable cross-fitted learners under standard regularity conditions. Then
Consequently, for the incremental effect ,
Proof.
Control of the leading empirical process term. For any , boundedness of implies there exists such that
so that . Under (A1), almost surely, and hence as well. Therefore
Since are i.i.d., the classical Lindeberg–Feller CLT applies and yields
Control of the estimated influence term. From the definitions,
so that
Taking of both sides and noting that , we obtain
Using , the finite- bounds on and , and the Cauchy–Schwarz inequality, it follows that
under the assumed rates. In particular,
Conclusion. Combining (13) with the three displays above and applying Slutsky’s theorem yields
Moreover, applying the same argument at and using the multivariate Lindeberg–Feller CLT gives the joint convergence
Therefore, by the continuous mapping theorem,
∎
Appendix E Sensitivity analysis for unmeasured confounding
This section assesses the robustness of our incremental policy estimands to violations of the no-unmeasured-confounding assumption and follows the framework of (Chernozhukov et al., 2022).
E.1 Setup: long vs. short worlds
Let the observed data be drawn i.i.d. from . Assume there exists an unobserved confounder such that, with , the potential outcomes satisfy
| (14) |
together with SUTVA/consistency.
Define the long and short outcome regressions
Let denote the long conditional density of given , and the short conditional density of given (i.e. the observed conditional law of the exposure mixture given ).
Stochastic intervention fixed from observed data.
We keep the intervention rule identical to Section 2: for any fixed ,
| (15) |
Importantly, depends only on and is therefore implementable; it is well-defined regardless of whether exists.
Target (long) estimand.
Let denote the counterfactual outcome under the intervention that draws from (independently of given ). Under (14), the causal estimand is
| (16) |
Identified (short) estimand under ignorability given .
If one incorrectly assumes ignorability given only, the same g-formula yields the identified estimand
| (17) |
which is the estimand targeted by our estimators.
Our sensitivity analysis bounds the discrepancy as a function of interpretable sensitivity parameters.
E.2 Linear functional form and Riesz representers
For a fixed , define the functional on square-integrable functions :
| (18) |
Then and .
Proposition 6.
The functional is a linear functional.
Proof.
To establish linearity, we must verify that satisfies both the property of additivity and homogeneity. Let be arbitrary functions and let be a scalar constant.
1. Additivity.
By the linearity of the Lebesgue integral and the linearity of the expectation operator, we have:
2. Homogeneity.
Since both conditions are satisfied, is a linear functional. ∎
Weak overlap / continuity condition.
Assume is continuous on ; a sufficient condition required here is the “weak overlap” requirement
| (19) |
This condition is untestable without , but it is the standard integrability requirement needed for the Riesz representation.
Riesz representation.
Under (19), by the Riesz–Fréchet representation theorem there exists a unique (long) Riesz representer such that
Moreover, the representer has the Radon–Nikodym form
| (20) |
Likewise, the short Riesz representer for over is
| (21) |
Under the exponential tilt (15), simplifies to
| (22) |
Thus our policy estimand is a continuous linear functional of the long regression with a well-defined RR.
E.3 Exact OVB identity and sharp bound
Define the outcome regression error and the RR error:
Then the omitted-variable bias satisfies the exact identity
| (23) |
By Cauchy–Schwarz,
| (24) |
It is often useful to isolate the “degree of adversity”
so that (23) yields In our primary analysis and reported bounds, we focus on the worst-case scenario by considering adversarial confounding, which implicitly sets . This allows us to establish a conservative bound and subsequently omit the correlation term from our final operational formulas.
E.4 Reparameterization by interpretable partial
Following the general theory, the bound can be rewritten as a product of an identifiable scale and two sensitivity parameters with partial- interpretations.
Identifiable scale.
Let and . Define
| (25) |
which depends only on the observed-data law of and is therefore estimable.
Outcome-side sensitivity (partial ).
Define
| (26) |
i.e. the nonparametric partial of with given .
Exposure/RR-side sensitivity.
Because is the projection of onto , we have Define the (RR-space) :
and let
| (27) |
Thus is the fraction of RR variation generated by the latent confounder.
Final bound in form.
E.5 Incremental effect
If the scientific target is the incremental effect contrast , then the difference of continuous linear functionals is again a continuous linear functional. The same analysis applies with the RR replaced by the RR contrast:
and unchanged. One obtains an interval for by applying (29) to .
E.6 Implementation
For each fixed considered in the paper, the sensitivity analysis requires three estimated quantities:
-
1.
: our EIF-based (cross-fitted) estimator of the incremental effect from Section 3.
-
2.
using the same cross-fitted as in the main estimator.
-
3.
, where , and is obtained either as the estimated density ratio or via the closed form (22) by estimating with regression.
Then .
Finally, for any posited sensitivity parameters
the bias bound becomes
Accordingly, the estimated point identified set for the incremental effect is
When , the ratio
compares the benchmarked bias half-width with the magnitude of the estimated incremental effect. Table 2 reports this ratio for the main-text benchmark with and over the 10 positive Gelbrich targets on each of the 9 intervention paths. Values below one indicate that the benchmarked point bound is narrower than the estimated effect in magnitude, whereas values above one indicate that the benchmarked bias half-width exceeds the estimated effect itself.
| Scenario | 0.05 | 0.10 | 0.15 | 0.20 | 0.25 | 0.30 | 0.35 | 0.40 | 0.45 | 0.50 |
|---|---|---|---|---|---|---|---|---|---|---|
| BC | 0.22 | 0.23 | 0.25 | 0.27 | 0.28 | 0.30 | 0.32 | 0.33 | 0.35 | 0.37 |
| NO3 | 0.42 | 0.62 | 0.60 | 0.53 | 0.47 | 0.41 | 0.36 | 0.32 | 0.29 | 0.26 |
| OM | 0.21 | 0.24 | 0.26 | 0.28 | 0.30 | 0.33 | 0.35 | 0.37 | 0.39 | 0.41 |
| SO4 | 0.10 | 0.10 | 0.09 | 0.08 | 0.07 | 0.07 | 0.06 | 0.06 | 0.05 | 0.05 |
| NH4 | 0.13 | 0.18 | 0.21 | 0.21 | 0.20 | 0.19 | 0.17 | 0.16 | 0.15 | 0.14 |
| BC+OM | 0.84 | 0.98 | 1.24 | 1.72 | 2.67 | 3.73 | 7.09 | 50.85 | 12.16 | 12.16 |
| NO3+SO4+NH4 | 0.17 | 0.19 | 0.22 | 0.27 | 0.32 | 0.35 | 0.36 | 0.35 | 0.32 | 0.32 |
| All | 0.07 | 0.08 | 0.10 | 0.12 | 0.14 | 0.17 | 0.22 | 0.24 | 0.25 | 0.25 |
| BFGS | 0.03 | 0.21 | 0.23 | 0.20 | 0.16 | 0.09 | 0.08 | 0.07 | 0.07 | 0.07 |
Confidence bounds for the estimated endpoints.
The empirical results in the main text report confidence bounds for the two estimated endpoints above. Let denote the cross-fitted EIF contribution of . Define the centered plugin signals
and write
The delta method gives
Writing
the corresponding standard errors are
The sensitivity-adjusted interval is obtained by combining a lower confidence bound for and an upper confidence bound for :
Selection of sensitivity parameters.
The parameters and have direct interpretations: is the maximal fraction of residual outcome variance explainable by given , and is the maximal fraction of RR variation explainable by for the policy indexed by . These can be calibrated by subject-matter knowledge and by benchmarking against observed covariates, and then used in the bound and the endpoint confidence bounds above.
E.7 Formal benchmarking and calibration
Following Chernozhukov et al. (2022), we calibrate the sensitivity parameters on the scale induced by nested linear projections. This construction translates the observed contribution of a single covariate into a benchmark that is commensurate with the omitted-variable calibration in the bias bound. We carry out this comparison for each of the 22 observed covariates.
Outcome-side benchmark.
For each observed covariate , let
where denotes the observed covariate vector with removed. The associated benchmark statistics are
Thus is the observed partial for after adjusting for , and expresses the same gain relative to the residual variation in the full projection. Table 3 reports these quantities for all 22 covariates. The largest outcome-side benchmark is White, with and .
RR-side benchmark.
For the RR-side, we first evaluate the fitted short RR at every intervention target retained in the application analysis. For each and each target , we then compute the nested linear projections
The corresponding pointwise benchmark statistics are
This is the direct analogue of the outcome-side construction, with the fitted short RR taking the place of the observed outcome.
The benchmark reported in the main text is attached to an intervention path rather than to a single target. For each scenario, we therefore rank the 22 covariates by the average of over the positive Gelbrich targets on that path. This average is the RR-side score used in the formal benchmark. It summarizes how much adding improves the linear approximation to the fitted short RR along the displayed path, and it keeps the benchmark tied to an intervention curve that is actually reported in the application. Table 4 reports these scenario-level mean values for all 22 covariates.
With this aggregation, the selected RR-side benchmark covariates are Housing More People Units for BC, OM, and BC+OM; Cigarette Smoking for NO3, NH4, NO3+SO4+NH4, and All; Households Smartphone for SO4; and Housing Vacant for BFGS.
Benchmark calibration.
For the selected outcome-side covariate,
For the selected RR-side covariate in a given scenario,
These calibrated values are inserted directly into the point bounds and the endpoint confidence bounds above. The main application figure uses and , so the omitted confounder is calibrated to the observed benchmark strength on both the outcome side and the RR side.
| Covariate | ||
|---|---|---|
| White | 0.0642 | 0.0686 |
| Poverty | 0.0341 | 0.0353 |
| Physical Activity | 0.0322 | 0.0333 |
| Housing More People Units | 0.0151 | 0.0154 |
| Binge Drinking | 0.0132 | 0.0134 |
| Households No Internet | 0.0123 | 0.0124 |
| Housing No Vehicle | 0.0112 | 0.0113 |
| Housing 10 Units | 0.0087 | 0.0087 |
| Median Income | 0.0082 | 0.0083 |
| Cigarette Smoking | 0.0052 | 0.0052 |
| Low Education Computer No Internet | 0.0042 | 0.0042 |
| Percentage No Insurance | 0.0037 | 0.0037 |
| Male | 0.0032 | 0.0032 |
| Households Smartphone | 0.0023 | 0.0023 |
| Housing Renter | 0.0011 | 0.0011 |
| Housing Vacant | 0.0010 | 0.0010 |
| HS Higher | 0.0006 | 0.0006 |
| Housing Mobile | 0.0001 | 0.0001 |
| Obesity | 0.0001 | 0.0001 |
| Unemployed | 0.0001 | 0.0001 |
| Households Low Income No Internet | 0.0001 | 0.0001 |
| Households Only Smartphone | 0.0000 | 0.0000 |
| Covariate | BC | NO3 | OM | SO4 | NH4 | BC+OM | NO3+SO4+NH4 | All | BFGS |
|---|---|---|---|---|---|---|---|---|---|
| White | 0.0003 | 0.0049 | 0.0029 | 0.0017 | 0.0005 | 0.0002 | 0.0012 | 0.0008 | 0.0024 |
| Poverty | 0.0011 | 0.0171 | 0.0005 | 0.0015 | 0.0081 | 0.0012 | 0.0023 | 0.0026 | 0.0089 |
| Physical Activity | 0.0014 | 0.0018 | 0.0011 | 0.0003 | 0.0007 | 0.0014 | 0.0014 | 0.0001 | 0.0002 |
| Housing More People Units | 0.0089 | 0.0017 | 0.0124 | 0.0026 | 0.0007 | 0.0043 | 0.0007 | 0.0003 | 0.0043 |
| Binge Drinking | 0.0003 | 0.0002 | 0.0027 | 0.0002 | 0.0005 | 0.0004 | 0.0017 | 0.0002 | 0.0021 |
| Households No Internet | 0.0003 | 0.0008 | 0.0003 | 0.0002 | 0.0009 | 0.0017 | 0.0003 | 0.0000 | 0.0004 |
| Housing No Vehicle | 0.0021 | 0.0006 | 0.0054 | 0.0001 | 0.0006 | 0.0027 | 0.0125 | 0.0022 | 0.0006 |
| Housing 10 Units | 0.0004 | 0.0004 | 0.0023 | 0.0002 | 0.0001 | 0.0003 | 0.0010 | 0.0003 | 0.0003 |
| Median Income | 0.0018 | 0.0210 | 0.0005 | 0.0016 | 0.0058 | 0.0020 | 0.0025 | 0.0028 | 0.0071 |
| Cigarette Smoking | 0.0005 | 0.0259 | 0.0015 | 0.0022 | 0.0167 | 0.0014 | 0.0174 | 0.0079 | 0.0036 |
| Low Education Computer No Internet | 0.0001 | 0.0001 | 0.0003 | 0.0004 | 0.0008 | 0.0005 | 0.0009 | 0.0009 | 0.0038 |
| Percentage No Insurance | 0.0001 | 0.0015 | 0.0004 | 0.0000 | 0.0016 | 0.0002 | 0.0003 | 0.0002 | 0.0015 |
| Male | 0.0000 | 0.0008 | 0.0020 | 0.0008 | 0.0007 | 0.0001 | 0.0002 | 0.0006 | 0.0017 |
| Households Smartphone | 0.0009 | 0.0042 | 0.0004 | 0.0037 | 0.0029 | 0.0004 | 0.0009 | 0.0015 | 0.0006 |
| Housing Renter | 0.0026 | 0.0004 | 0.0001 | 0.0001 | 0.0001 | 0.0013 | 0.0008 | 0.0000 | 0.0002 |
| Housing Vacant | 0.0007 | 0.0224 | 0.0006 | 0.0010 | 0.0035 | 0.0011 | 0.0157 | 0.0064 | 0.0138 |
| HS Higher | 0.0003 | 0.0005 | 0.0007 | 0.0006 | 0.0000 | 0.0004 | 0.0004 | 0.0005 | 0.0025 |
| Housing Mobile | 0.0008 | 0.0060 | 0.0000 | 0.0006 | 0.0002 | 0.0015 | 0.0145 | 0.0072 | 0.0031 |
| Obesity | 0.0003 | 0.0001 | 0.0012 | 0.0000 | 0.0011 | 0.0009 | 0.0001 | 0.0003 | 0.0006 |
| Unemployed | 0.0002 | 0.0026 | 0.0001 | 0.0005 | 0.0003 | 0.0000 | 0.0008 | 0.0006 | 0.0029 |
| Households Low Income No Internet | 0.0002 | 0.0010 | 0.0000 | 0.0013 | 0.0002 | 0.0000 | 0.0016 | 0.0009 | 0.0005 |
| Households Only Smartphone | 0.0002 | 0.0004 | 0.0003 | 0.0009 | 0.0003 | 0.0001 | 0.0003 | 0.0002 | 0.0016 |