(Chapman & Hall_CRC Handbooks of Modern Statistical Methods) José R. Zubizarreta, Elizabeth A. Stuart, Dylan S. Small, Paul R. Rosenbaum - Handbook of Matching and Weighting Adjustments for Causal Inf
(Chapman & Hall_CRC Handbooks of Modern Statistical Methods) José R. Zubizarreta, Elizabeth A. Stuart, Dylan S. Small, Paul R. Rosenbaum - Handbook of Matching and Weighting Adjustments for Causal Inf
An observational study infers the effects caused by a treatment, policy, program, in-
tervention, or exposure in a context in which randomized experimentation is unethi-
cal or impractical. One task in an observational study is to adjust for visible pretreat-
ment differences between the treated and control groups. Multivariate matching and
weighting are two modern forms of adjustment. This handbook provides a compre-
hensive survey of the most recent methods of adjustment by matching, weighting,
outcome modeling, and their combinations. Three additional chapters introduce
the steps from association to causation that follow after adjustments are complete.
When used alone, matching and weighting do not use outcome information, so they are
part of the design of an observational study. When used in conjunction with models for
the outcome, matching and weighting may enhance the robustness of model-based ad-
justments. The book is for researchers in medicine, economics, public health, psychol-
ogy, epidemiology, public program evaluation, and statistics, who examine evidence of
the effects on human beings of treatments, policies, or exposures.
Chapman & Hall/CRC
Series Editor
Garrett Fitzmaurice, Department of Biostatistics, Harvard School of Public Health, Boston, MA, U.S.A.
The objective of the series is to provide high-quality volumes covering the state-of-the-art in the theory
and applications of statistical methodology. The books in the series are thoroughly edited and present
comprehensive, coherent, and unified summaries of specific methodological topics from statistics. The
chapters are written by the leading researchers in the field and present a good balance of theory and
application through a synthesis of the key methodological developments and examples and case studies
using real data.
Published Titles
Handbook of Quantile Regression
Roger Koenker, Victor Chernozhukov, Xuming He, and Limin Peng
Handbook of Meta-Analysis
Christopher H. Schmid, Theo Stijnen, and Ian White
Edited by
José R. Zubizarreta
Elizabeth A. Stuart
Dylan S. Small
Paul R. Rosenbaum
First edition published 2023
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
© 2023 selection and editorial matter, José Zubizarreta, Elizabeth A. Stuart, Dylan S. Small, Paul R. Rosenbaum; indi-
vidual chapters, the contributors
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as-
sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders
if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please
write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for
identification and explanation without intent to infringe.
DOI: 10.1201/9781003102670
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
To our families, students, and collaborators,
and the causal inference community.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Contents
Contributors ix
I Conceptual Issues 1
1 Overview of Methods for Adjustment and Applications in the Social and Behavioral
Sciences: The Role of Study Design 3
Ting-Hsuan Chang and Elizabeth A. Stuart
2 Propensity Score 21
Paul R. Rosenbaum
II Matching 61
4 Optimization Techniques in Multivariate Matching 63
Paul R. Rosenbaum and José R. Zubizarreta
vii
viii Contents
23 Bayesian Propensity Score Methods and Related Approaches for Confounding Adjust-
ment 501
Joseph Antonelli
Index 611
Contributors
ix
x Contributors
José Zubizarreta, PhD, is professor in the Department of Health Care Policy at Harvard Medical
School and in the Department of Biostatistics at Harvard T.H. Chan School of Public Health. He
is also a Faculty Affiliate in the Department of Statistics at Harvard University. He is a Fellow of
the American Statistical Association and is a recipient of the Kenneth Rothman Award, the William
Cochran Award, and the Tom Ten Have Memorial Award.
Elizabeth A. Stuart, PhD, is Bloomberg Professor of American Health in the Department of Mental
Health, the Department of Biostatistics and the Department of Health Policy and Management at
Johns Hopkins Bloomberg School of Public Health. She is a Fellow of the American Statistical
Association and the American Association for the Advancement of Science, and is a recipient of the
the Gertrude Cox Award for applied statistics, Harvard University’s Myrto Lefkopoulou Award for
excellence in Biostatistics, and the Society for Epidemiologic Research Marshall Joffe Epidemiologic
Methods award.
Dylan S. Small, PhD, is the Universal Furniture Professor in the Department of Statistics and Data
Science of the Wharton School of the University of Pennsylvania. He is a Fellow of the American
Statistical Association and an Institute of Mathematical Statistics Medallion Lecturer.
Paul R. Rosenbaum is emeritus professor of Statistics and Data Science at the Wharton School
of the University of Pennsylvania. From the Committee of Presidents of Statistical Societies, he
received the R. A. Fisher Award and the George W. Snedecor Award. He is the author of several books,
including Design of Observational Studies and Replication and Evidence Factors in Observational
Studies.
xi
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Part I
Conceptual Issues
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
1
Overview of Methods for Adjustment and
Applications in the Social and Behavioral Sciences:
The Role of Study Design
CONTENTS
1.1 Introduction to Causal Effects and Non-experimental Studies . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Metrics for Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Design Strategies for Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.3 Stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.4 Covariate adjustment in a model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.5 The role of balance checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Applications of Design Approaches for Covariate Adjustment . . . . . . . . . . . . . . . . . . . . . 10
1.4.1 Non-experimental studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.2 Non-response adjustments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.3 Generalizability of randomized trial results to target populations . . . . . . . . . . . 11
1.4.4 Mediation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Sensitivity Analyses to an Unobserve Confounder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
DOI: 10.1201/9781003102670-1 3
4 Overview of Methods for Adjustment and Applications in the Social and Behavioral Sciences
individual i, respectively. The causal effect for individual i is defined as Y1i − Y0i . This definition
of the individual causal effect entails the assumptions that an individual’s outcome is unaffected by
whether or not other individuals are exposed and that the exposure of interest is well-defined – these
assumptions together form the stable unit treatment value assumption, or SUTVA for short [3]. In
epidemiology the second piece of this is sometimes known as “consistency” and is encapsulated
by the assumption that the outcome observed when someone receives the treatment is actually
Y1i and the outcome observed when someone receives the comparison condition is actually Y0i :
Yi = Ti × Y1i + (1 − Ti ) × Y0i , where Y denotes the observed outcome and T denotes the exposure
status (T = 1 if exposed and T = 0 if not exposed).
The fundamental problem of causal inference refers to the fact that Y1i − Y0i can never be
directly observed because for each individual, only their potential outcome corresponding to their
actual exposure status is observed [2]. That is, if individual i were exposed, we would only observe
Y1i and not Y0i and vice versa for those who receive the comparison condition. This inability to
directly observe the quantity of interest is what distinguishes causal inference from standard statistical
inference. And because individual causal effects are inherently unobservable directly, the goal of
estimation is instead often an average treatment effect, for example, the average across all individuals
(average treatment effect; ATE), defined as E[Y1i − Y0i ]. Other estimands are also sometimes of
interest, such as the average treatment effect on the treated (ATT), defined as E[Y1i − Y0i |Ti = 1].
Randomized controlled trials (RCT) are often seen as the gold standard for estimating causal
effects because the ATE can be readily estimated from the study sample as the difference in the mean
outcome between the exposed and unexposed. This is justified by the property that in expectation
randomization produced two groups (exposed and unexposed, sometimes termed treated and control)
comparable with respect to all pre-exposure characteristics. Thus, the outcome of the exposed group
is expected to be representative of the outcome of the unexposed group (or all individuals in the study)
had they all been exposed, and vice versa; this property is often referred to as “exchangeability” in
the epidemiologic literature [4, 5]. E[Y1i ] can thus be estimated by the mean outcome among the
exposed, and similarly for E[Y0i ], and the difference in outcomes observed in the two groups yields
an unbiased estimate of the causal effect.
However, randomization is often hindered by ethical concerns and feasibility; for example, it is
unethical to randomize individuals to potentially harmful exposures, such as childhood maltreatment.
When randomization is not possible, non-experimental studies are often conducted to estimate causal
effects, but require addressing a set of challenges first. The main challenge, and also the most common
criticism of non-experimental studies, is that of confounding. In the presence of confounding, the
outcome among the exposed is not representative of the outcome among the unexposed had they
been exposed (for more formal definitions of confounding, see VanderWeele [5] and VanderWeele
and Shpitser [6]). This is driven by differences between the exposed and unexposed groups in the
distributions of covariates that predict both the exposure and outcome. For example, prognostic
factors often predict whether a patient would receive treatment as well as their outcome; thus, treated
individuals may be on average in worse health than untreated individuals and the contrast of outcomes
between the two groups may not accurately reflect the effect of treatment.
Non-experimental studies thus must grapple with confounding and use strategies to adjust for
potential confounders. Such approaches for dealing with confounding often rely on some degree of
substantive knowledge of the causal structure, specifically covariates that associate with both the
exposure and outcome. Causal graphs (or “causal directed acyclic graphs”) can be a useful visual
tool for identifying sources of confounding. A causal graph encodes the investigator’s assumptions
of the underlying relationships among the exposure, outcome, and other covariates (either measured
or unmeasured) relevant to the research question. In a causal graph, covariates are represented as
nodes and arrows represent causal relationships. Assuming the graph is correct, a set of rules have
been developed and mathematically proven to identify confounders (i.e., covariates that need to be
measured and adjusted for in order to obtain unconfounded effect estimates) from causal graphs [7,8].
Introduction to Causal Effects and Non-experimental Studies 5
then the implicit imputation of the missing potential outcomes under exposure for the unexposed
group is going to be based on a model of that potential outcome fit among people much younger
than the people for whom the prediction is being made (and vice versa for imputing the missing
potential outcome for the exposed group). While researchers are generally cautioned against out
of sample prediction and extrapolation, this can often happen in a causal inference context without
researchers being fully aware (see Imai et al. [18], for more discussion of this point). Rubin [15] lays
out the conditions under which regression adjustment for confounders is “trustworthy” (i.e., when a
design approach is needed in addition to an analysis approach) and shows that analysis methods are
sufficient only if the covariates differ across groups by less than 0.1 or 0.2 standard deviations.
This idea of a clear separation of design and outcome analysis traces back to the concepts
discussed by William G. Cochran [19], who stressed that non-experimental studies should be planned
carefully as if a RCT were conducted. These ideas have re-emerged in the epidemiology literature
through emphasis on “trial emulation,” which extends some of these basic study design ideas
to longitudinal settings, such as when using electronic health record data to study the effects of
healthcare interventions on outcomes [20].
It is also the case that combining the analysis- and design-based approaches – for example,
regression adjustment on the resulting sample of a design-based method – has been shown to
work better than either approach alone [21, 22]. This is similar to the concept of “doubly robust”
methods [23–25], in that regression adjustment may help reduce any remaining imbalance in the
sample after a design-based method is implemented [26]. Common analysis- and design-based
methods are presented in 1.3.
We now turn to more specifics on the approaches that can be used to adjust for confounders in a
design stage.
individuals using propensity scores often leads to better matched samples than using the Mahalanobis
distance [34]. However, because the propensity scores can only be estimated in non-experimental
studies – they are not directly observed or known –, investigators should check the resulting balance
from the proposed propensity score model and choose a model that achieves adequate balance.
For recent reviews on the use of propensity scores for adjustment in non-experimental studies,
see for example, Desai and Franklin [35] and Pan and Bai [36]. A more recent approach, entropy
balancing, uses an algorithm to find individual weights such that the weighted exposed and unexposed
groups satisfy a set of prespecified balance conditions, typically involving equalization of covariate
moments [37]. Thus, unlike the propensity score approach, entropy balancing does not require
iteratively checking covariate balance and refining models.
1.3.1 Matching
Matching methods generally involve selecting a subset of the sample such that the observed covariates
are balanced across exposure groups in this subset. The most common method is k:1 nearest neighbor
matching, which is usually done by matching k controls to an exposed individual based on the chosen
distance measure and discarding unmatched controls [26, 38]. Thus, k:1 nearest neighbor matching is
mostly used to estimate the ATT [26]. In addition to choosing a distance measure, there are several
additional specifications to consider with k:1 nearest neighbor matching, including the number of
matches (i.e., the value of k), the closeness of matches, the matching algorithm, and matching with or
without replacement. Here we provide a brief discussion of these specifications; for more details we
refer the reader to Stuart [26] and Greifer and Stuart [9]. We also refer the reader to Greifer [39] for a
comprehensive and updated introduction to the R package MatchIt, which provides implementation
of several matching methods as well as balance assessment [40].
1:1 nearest neighbor matching selects for each exposed individual a single control individual
who is most similar to them (e.g., has the closest propensity score value), but this may result in
many control individuals being discarded. Increasing k retains a larger sample size, which may
improve precision of the treatment effect estimate, but may lead to larger bias since some of the
matches will likely be of lower quality than the closest match [41, 42]. One way to mitigate this
bias-variance tradeoff is to perform variable ratio matching, which allows each exposed individual to
have a different number of matches [43]. To ensure the quality of matches, one may also impose a
prespecified caliper distance, usually on the difference in the propensity scores of matched individuals
(e.g., 0.2 of the standard deviation of the logit of the propensity score [44]); exposed individuals
with no controls falling within the caliper are discarded. This then will estimate a restricted ATT
– the effect of the treatment among the treated group with matches – which will affect for whom
inferences can be drawn.
One question is how exactly the matches are obtained. Two commonly used matching algorithms
are greedy matching and optimal matching. Greedy matching forms matched pairs/sets one exposed
individual at a time, and the order in which the exposed individuals are matched may affect the
quality of matches [26, 45]. Optimal matching is “optimal” in the sense that the matched sample
is formed such that the average distance across all matched pairs/sets is minimized [46], but it is
more computationally intensive and may be challenging to implement in large samples. Studies have
8 Overview of Methods for Adjustment and Applications in the Social and Behavioral Sciences
shown that the resulting matched sample usually does not differ much between the two matching
algorithms [34, 45].
Matching without replacement requires control individuals to be matched with at most one
exposed individual, whereas matching with replacement allows control individuals to be reused
in matching. Once again investigators are faced with a bias-variance tradeoff as matching with
replacement may decrease bias by selecting the closest match for each exposed individual, but
the matched sample may only retain a small number of controls [26]. Matching with replacement
may also complicate subsequent analyses since the repeated use of controls should be taken into
account [26, 47].
Additionally, optimization methods can be used to improve matching. A common optimization-
based matching method is cardinality matching, which maximizes the size of the matched sample
under covariate balance constraints specified by the investigator [48, 49]. Some of the recently
developed optimization-based matching methods can be found in Cho [50] and Sharma et al. [51].
1.3.2 Weighting
Weighting adjustments typically involve direct use of the estimated propensity scores to assign
weights for each individual. For estimating the ATE, inverse probability of treatment weights (IPTW)
are commonly used [52]; these are ê1i for those exposed and 1−ê 1
i
for those unexposed, where êi
represents the estimated propensity score for individual i. These weights are applied to weight
both exposure groups to resemble the combined group with respect to the observed covariates.
For estimating the ATT, exposed individuals receive a weight of 1, and unexposed individuals are
êi
assigned a weight equal to 1−ê i
, which weights the unexposed group to resemble the exposed group.
A potential drawback of the inverse probability weighting approach is that extreme weights (i.e.,
extremely small or extremely large propensity scores) may lead to bias and large variance in the
effect estimate [15, 26]. Extreme weights may signal that the propensity score model is severely
misspecified [24, 53], or that the exposed and exposed groups have a lack of overlap in the observed
covariates (i.e., exposed and exposed groups are characteristically very different) [54].
To address the problems from extreme weights, a common solution is weight trimming. Weight
trimming may refer to the exclusion of individuals with weights falling outside a pre-specified range
[55, 56], or the reduction of weights larger than some pre-specified maximum to that maximum [57].
Reducing large weights to some maximum value has been shown to help with reducing bias and
variance when the weights are estimated using logistic regression [53]. However, there is currently
little guidance on the optimal way to trim weights in the context of extreme propensity scores,
and it is suggested that more attention should be paid on the specification of the propensity score
model, with some evidence that some of the machine learning based estimation approaches are less
likely to lead to extreme weights – and thus less need for weight trimming, as compared to logistic
regression [53].
Overlap weighting is a more recently developed method to address some of the issues of inverse
probability weighting [54, 58]. The overlap weight is defined as the estimated probability of being
in the exposed group (êi ) for unexposed individuals, and being in the unexposed group (1 − êi )
for exposed individuals. Details on overlap weighting are discussed in Chapter 14. In brief, the
estimand of overlap weighting estimators is the average treatment effect in the overlap population
(i.e., the population that has both exposed and unexposed individuals), and overlap weights have been
shown to be advantageous in the presence of extreme propensity scores compared to other weighting
methods [54, 58]. If propensity score model misspecification (which may fail to achieve adequate
balance and lead to biased results) is of great concern, an alternative approach is to directly solve for
weights that target the balance of the sample at hand [59]. Details on such balancing approach to
weighting are discussed in Chapter 16.
Design Strategies for Adjustment 9
1.3.3 Stratification
Stratification refers to dividing the sample into mutually exclusive strata of individuals who are
similar, usually based on quantiles of the propensity scores. The distribution of the observed covariates
is thus expected to be roughly identical between the exposure groups within each strata. The number
of strata should be chosen such that the strata are small enough to have adequate balance, but large
enough to get stable within-stratum treatment effect estimates [26]. Five has been the most common
number of strata in applications [60, 61], but more work on stratification approaches is needed; recent
studies on the optimal number of strata can be found in, e.g., Neuhäuser et al. [62] and Orihara and
Hamada [63].
A method that overcomes the difficulty in choosing the number of strata for stratification, and
which generally provides better confounder adjustment, is full matching [64]. Unlike most matching
methods, which may discard many unmatched individuals, full matching forms subsets that contain
at least one treated and at least one control using all individuals in the data [65, 66]. Full matching
also differs from stratification in that it forms subsets in an optimal way, such that the number of
stratum and the stratification of individuals are optimized to minimize the average within-stratum
distance between exposure groups. Interestingly, full matching can be thought of as an approach
somewhat in between matching, weighting, and stratification, in that it is implemented by creating
many (small) strata, it is called “matching,” and effect estimation following full matching often
involves a weighted analysis, using weights determined by the full matching strata. More discussions
on full matching are presented in Chapter 5.
burnout prevalence among all employees (including both responders and non-responders), the survey
responses were weighted by the inverse of this stratum-specific response rate.
more. According to Rosenbaum [83], a well-designed study is one that “does more to address the
inevitable critical discussion of possible bias from unmeasured covariates” (p. 145).
Sensitivity analysis assesses the robustness of effect estimates to unobserved confounders. Various
techniques for sensitivity analysis have been developed over the years (e.g., [84–92]). Sensitivity
analysis considers how strong the associations between a hypothetical unobserved confounder and the
exposure and outcome would need to be to change the study conclusion. These techniques typically
require a number of assumptions (e.g., that the unobserved confounder is binary [84]) or user-specified
parameters such as sensitivity parameters (the associations between an unobserved confounder and
the exposure and outcome) and the prevalence of the unobserved confounder in the exposed and
unexposed groups (e.g., [85, 86, 90]). Newer approaches have been developed to accommodate more
general scenarios and require fewer parameter specifications. For example, the E-value is a measure
that represents the minimum strength of association between an unobserved confounder and the
exposure and outcome for the unobserved confounder to explain away the observed effect [93]; it
makes no assumptions regarding the type, distribution, or number of unobserved confounders and
can be computed with a variety of outcome types [94]. Another recent approach is one that seeks to
find the minimum proportion of confounded individuals such that the results indicate a null average
treatment effect, which has the benefit of capturing unmeasured confounding heterogeneity [92]. See
Chapter 25 for more discussion of these and related methods.
1.6 Conclusion
Although much focus has been on confounding in existing literature on non-experimental studies, the
validity of results from non-experimental studies hinges on many factors including the specification
of the causal question, design of the study, analysis and modeling choices, and data quality/usefulness
[95].
There are also a number of complications that arise in practice, such as missing data, multilevel
structures, and measurement challenges. Missing values can occur in the exposure, outcome, and
covariates. Although missing data is common in randomized studies as well, non-experimental
studies tend to be more vulnerable to missing data in that measurements of covariates are necessary
for confounding adjustments [96]. Causal diagrams can be extended to encode assumptions about
the missingness mechanisms (e.g., missing at random, missing not at random [97–99]) and studies
have shown the importance of choosing the appropriate method for dealing with missing data (e.g.,
complete case analysis or multiple imputation) based on different missingness mechanisms [96, 100].
In non-experimental studies with a multilevel structure, individuals are nested within one or more
clusters (e.g., patients within hospitals, students within classrooms) and covariates are often measured
at both the individual and cluster levels (see Chapter 10). The exposure may be assigned at either the
cluster or individual level, which would have different implications for adjustment approaches as
well as the underlying assumptions. Chapter 10 provides an introduction to matching methods in
multilevel settings, with a special focus on the effects of cluster-level exposures on individual-level
outcomes. Some studies have extended adjustment methods using propensity scores, including
matching [101, 102] and weighting [103], to multilevel settings with individual-level exposures.
Measurement can also be a particular challenge in the social and behavioral sciences, and
propensity score methods need to be adjusted for use when some of the covariates are latent variables,
measured through a set of observed items [104, 105].
The validity or plausibility of the key assumptions and the need to think about these assumptions
may differ across fields. For example, in non-experimental studies in fields such as psychology,
economics, and education, participants often self-select into the exposure condition (e.g., a job
training program), whereas in medical contexts, the treatment decision process is often based on
Conclusion 13
a set of factors (e.g., lab tests, patient age) in compliance with certain medical guidelines. In the
latter case, investigators may have more information regarding the assignment mechanism and thus
the “no unmeasured confounding” assumption may be more likely to hold compared to the former
case. The availability of data may also differ across fields. For example, when using administrative
data in educational settings to make causal inferences, covariate adjustment is often limited by the
variables that are available in the data. Researchers should also use their substantive knowledge
about the setting to assess the validity of the underlying assumptions. In short, different choices and
assumptions may need to be made depending on the scientific context and the data that is typically
available in that context, and it is crucial to assess the underlying assumptions within the context of
each specific study.
References
[1] Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized
studies. Journal of Educational Psychology, 66(5):688–701, 1974.
[2] Paul W Holland. Statistics and causal inference. Journal of the American Statistical Associa-
tion, 81(396):945–960, 1986.
[3] Donald B Rubin. Randomization analysis of experimental data: The fisher randomization test
comment. Journal of the American Statistical Association, 75(371):591–593, 1980.
[4] Sander Greenland and James M Robins. Identifiability, exchangeability, and epidemiological
confounding. International Journal of Epidemiology, 15(3):413–419, 1986.
[5] Tyler J VanderWeele. Confounding and effect modification: Distribution and measure. Epi-
demiologic Methods, 1(1):55–82, 2012.
[6] Tyler J VanderWeele and Ilya Shpitser. On the definition of a confounder. Annals of Statistics,
41(1):196–220, 2013.
[7] Judea Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669–688, 1995.
[8] Sander Greenland, Judea Pearl, and James M Robins. Causal diagrams for epidemiologic
research. Epidemiology, 10(1):37–48, 1999.
[9] Noah Greifer and Elizabeth A Stuart. Matching methods for confounder adjustment: An
addition to the epidemiologist’s toolbox. Epidemiologic Reviews, 43(1):118–129, 2021.
[10] Donald B Rubin. The design versus the analysis of observational studies for causal effects:
Parallels with the design of randomized trials. Statistics in Medicine, 26(1):20–36, 2007.
[11] Donald B Rubin. For objective causal inference, design trumps analysis. The Annals of
Applied Statistics, 2(3):808–840, 2008.
[12] Paul R Rosenbaum. Choice as an alternative to control in observational studies. Statistical
Science, 14(3):259–304, 1999.
[13] Paul R Rosenbaum. Observational Studies. Springer, New York, NY, 2002.
[14] Paul R. Rosenbaum. Design of Observational Studies. Springer-Verlag, New York, 2010.
14 Overview of Methods for Adjustment and Applications in the Social and Behavioral Sciences
[15] Donald B Rubin. Using propensity scores to help design observational studies: Application to
the tobacco litigation. Health Services and Outcomes Research Methodology, 2(3):169–188,
2001.
[16] Richard J Murnane and John B Willett. Methods Matter: Improving Causal Inference in
Educational and Social Science Research. New York, NY, Oxford University Press, 2010.
[17] Daniel E Ho, Kosuke Imai, Gary King, and Elizabeth A Stuart. Matching as nonparametric
preprocessing for reducing model dependence in parametric causal inference. Political
Analysis, 15(3):199–236, 2007.
[18] Kosuke Imai, Gary King, and Elizabeth A Stuart. Misunderstandings between experimentalists
and observationalists about causal inference. Journal of the Royal Statistical Society: Series A
(Statistics in Society), 171(2):481–502, 2008.
[19] William G Cochran and S Paul Chambers. The planning of observational studies of human
populations. Journal of the Royal Statistical Society. Series A (General), 128(2):234–266,
1965.
[20] Miguel A Hernán and James M Robins. Using big data to emulate a target trial when a
randomized trial is not available. American Journal of Epidemiology, 183(8):758–764, 2016.
[21] Donald B Rubin. The use of matched sampling and regression adjustment to remove bias in
observational studies. Biometrics, 29(1):185–203, 1973.
[22] Donald B Rubin and Neal Thomas. Combining propensity score matching with additional
adjustments for prognostic covariates. Journal of the American Statistical Association,
95(450):573–585, 2000.
[23] James M Robins and Andrea Rotnitzky. Semiparametric efficiency in multivariate regression
models with missing data. Journal of the American Statistical Association, 90(429):122–129,
1995.
[24] Joseph DY Kang and Joseph L Schafer. Demystifying double robustness: A comparison
of alternative strategies for estimating a population mean from incomplete data. Statistical
Science, 22(4):523–539, 2007.
[25] Michele Jonsson Funk, Daniel Westreich, Chris Wiesen, Til Stürmer, M Alan Brookhart,
and Marie Davidian. Doubly robust estimation of causal effects. American Journal of
Epidemiology, 173(7):761–767, 2011.
[26] Elizabeth A Stuart. Matching methods for causal inference: A review and a look forward.
Statistical Science: A Review Journal of the Institute of Mathematical Statistics, 25(1):1–21,
2010.
[27] Stefano M Iacus, Gary King, and Giuseppe Porro. Causal inference without balance checking:
Coarsened exact matching. Political Analysis, 20(1):1–24, 2012.
[28] Paul R Rosenbaum, Richard N Ross, and Jeffrey H Silber. Minimum distance matched
sampling with fine balance in an observational study of treatment for ovarian cancer. Journal
of the American Statistical Association, 102(477):75–83, 2007.
[29] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observa-
tional studies for causal effects. Biometrika, 70(1):41–55, 1983.
Conclusion 15
[30] Daniel F McCaffrey, Greg Ridgeway, and Andrew R Morral. Propensity score estimation
with boosted regression for evaluating causal effects in observational studies. Psychological
Methods, 9(4):403–425, 2004.
[31] Brian K Lee, Justin Lessler, and Elizabeth A Stuart. Improving propensity score weighting
using machine learning. Statistics in Medicine, 29(3):337–346, 2010.
[32] Soko Setoguchi, Sebastian Schneeweiss, M Alan Brookhart, Robert J Glynn, and E Francis
Cook. Evaluating uses of data mining techniques in propensity score estimation: A simulation
study. Pharmacoepidemiology and Drug Safety, 17(6):546–555, 2008.
[33] Romain Pirracchio, Maya L Petersen, and Mark Van Der Laan. Improving propensity score
estimators’ robustness to model misspecification using super learner. American Journal of
Epidemiology, 181(2):108–119, 2015.
[34] Xing Sam Gu and Paul R Rosenbaum. Comparison of multivariate matching methods:
Structures, distances, and algorithms. Journal of Computational and Graphical Statistics,
2(4):405–420, 1993.
[35] Rishi J Desai and Jessica M Franklin. Alternative approaches for confounding adjustment in
observational studies using weighting based on the propensity score: A primer for practitioners.
BMJ, 367, 2019.
[36] Wei Pan and Haiyan Bai. Propensity score methods for causal inference: An overview.
Behaviormetrika, 45(2):317–334, 2018.
[37] Jens Hainmueller. Entropy balancing for causal effects: A multivariate reweighting method to
produce balanced samples in observational studies. Political Analysis, 20(1):25–46, 2012.
[38] Donald B Rubin. Matching to remove bias in observational studies. Biometrics, 29(1):
159–183, 1973.
[39] Noah Greifer. Matchit: Getting started. http://cran.r-project.org/web/
packages/MatchIt/vignettes/MatchIt.html. Accessed: 2022-03-27.
[40] Daniel Ho, Kosuke Imai, Gary King, and Elizabeth A Stuart. Matchit: Nonparametric prepro-
cessing for parametric causal inference. Journal of Statistical Software, 42(8):1-28, 2011.
[41] Peter C Austin. Statistical criteria for selecting the optimal number of untreated subjects
matched to each treated subject when using many-to-one matching on the propensity score.
American Journal of Epidemiology, 172(9):1092–1097, 2010.
[42] Jeremy A Rassen, Abhi A Shelat, Jessica Myers, Robert J Glynn, Kenneth J Rothman, and
Sebastian Schneeweiss. One-to-many propensity score matching in cohort studies. Pharma-
coepidemiology and Drug Safety, 21:69–80, 2012.
[43] Kewei Ming and Paul R Rosenbaum. Substantial gains in bias reduction from matching with
a variable number of controls. Biometrics, 56(1):118–124, 2000.
[44] Peter C Austin. Optimal caliper widths for propensity-score matching when estimating
differences in means and differences in proportions in observational studies. Pharmaceutical
Statistics, 10(2):150–161, 2011.
[45] Peter C Austin. A comparison of 12 algorithms for matching on the propensity score. Statistics
in Medicine, 33(6):1057–1069, 2014.
16 Overview of Methods for Adjustment and Applications in the Social and Behavioral Sciences
[46] Paul R Rosenbaum. Optimal matching for observational studies. Journal of the American
Statistical Association, 84(408):1024–1032, 1989.
[47] Peter C Austin and Guy Cafri. Variance estimation when using propensity-score matching
with replacement with survival or time-to-event outcomes. Statistics in Medicine, 39(11):1623–
1640, 2020.
[48] José R Zubizarreta, Ricardo D Paredes, and Paul R Rosenbaum. Matching for balance, pairing
for heterogeneity in an observational study of the effectiveness of for-profit and not-for-profit
high schools in chile. The Annals of Applied Statistics, 8(1):204–231, 2014.
[49] Giancarlo Visconti and José R Zubizarreta. Handling limited overlap in observational studies
with cardinality matching. Observational Studies, 4(1):217–249, 2018.
[50] Wendy K Tam Cho. An evolutionary algorithm for subset selection in causal inference models.
Journal of the Operational Research Society, 69(4):630–644, 2018.
[51] Dhruv Sharma, Christopher Willy, and John Bischoff. Optimal subset selection for causal
inference using machine learning ensembles and particle swarm optimization. Complex &
Intelligent Systems, 7(1):41–59, 2021.
[52] Keisuke Hirano, Guido W Imbens, and Geert Ridder. Efficient estimation of average treatment
effects using the estimated propensity score. Econometrica, 71(4):1161–1189, 2003.
[53] Brian K Lee, Justin Lessler, and Elizabeth A Stuart. Weight trimming and propensity score
weighting. PloS One, 6(3):e18174, 2011.
[54] Fan Li, Laine E Thomas, and Fan Li. Addressing extreme propensity scores via the overlap
weights. American Journal of Epidemiology, 188(1):250–257, 2019.
[55] Richard K Crump, V Joseph Hotz, Guido W Imbens, and Oscar A Mitnik. Dealing with
limited overlap in estimation of average treatment effects. Biometrika, 96(1):187–199, 2009.
[56] Til Stürmer, Kenneth J Rothman, Jerry Avorn, and Robert J Glynn. Treatment effects in the
presence of unmeasured confounding: Dealing with observations in the tails of the propensity
score distribution—a simulation study. American Journal of Epidemiology, 172(7):843–854,
2010.
[57] Michael R Elliott. Model averaging methods for weight trimming in generalized linear
regression models. Journal of Official Statistics, 25(1):1–20, 2009.
[58] Fan Li, Kari Lock Morgan, and Alan M Zaslavsky. Balancing covariates via propensity score
weighting. Journal of the American Statistical Association, 113(521):390–400, 2018.
[59] Eli Ben-Michael, Avi Feller, David A. Hirshberg, and José R. Zubizarreta. The balancing act
in causal inference. arXiv preprint, 2021.
[60] Paul R Rosenbaum and Donald B Rubin. Reducing bias in observational studies using
subclassification on the propensity score. Journal of the American Statistical Association,
79(387):516–524, 1984.
[61] Jared K Lunceford and Marie Davidian. Stratification and weighting via the propensity
score in estimation of causal treatment effects: A comparative study. Statistics in Medicine,
23(19):2937–2960, 2004.
Conclusion 17
[62] Markus Neuhäuser, Matthias Thielmann, and Graeme D Ruxton. The number of strata in
propensity score stratification for a binary outcome. Archives of Medical Science: AMS,
14(3):695–700, 2018.
[63] Shunichiro Orihara and Etsuo Hamada. Determination of the optimal number of strata for
propensity score subclassification. Statistics & Probability Letters, 168:108951, 2021.
[64] Elizabeth A Stuart and Kerry M Green. Using full matching to estimate causal effects in
nonexperimental studies: Examining the relationship between adolescent marijuana use and
adult outcomes. Developmental Psychology, 44(2):395–406, 2008.
[65] Paul R Rosenbaum. A characterization of optimal designs for observational studies. Journal
of the Royal Statistical Society: Series B (Methodological), 53(3):597–610, 1991.
[66] Ben B Hansen. Full matching in an observational study of coaching for the sat. Journal of the
American Statistical Association, 99(467):609–618, 2004.
[67] Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal
inference models. Biometrics, 61(4):962–973, 2005.
[68] Jennifer L Hill. Bayesian nonparametric modeling for causal inference. Journal of Computa-
tional and Graphical Statistics, 20(1):217–240, 2011.
[69] Sharon-Lise T Normand, Mary Beth Landrum, Edward Guadagnoli, John Z Ayanian, Thomas J
Ryan, Paul D Cleary, and Barbara J McNeil. Validating recommendations for coronary
angiography following acute myocardial infarction in the elderly: A matched analysis using
propensity scores. Journal of Clinical Epidemiology, 54(4):387–398, 2001.
[70] Muhammad Mamdani, Kathy Sykora, Ping Li, Sharon-Lise T Normand, David L Streiner,
Peter C Austin, Paula A Rochon, and Geoffrey M Anderson. Reader’s guide to critical
appraisal of cohort studies: 2. assessing potential for confounding. BMJ, 330(7497):960–962,
2005.
[71] Peter C Austin. Balance diagnostics for comparing the distribution of baseline covariates
between treatment groups in propensity-score matched samples. Statistics in Medicine,
28(25):3083–3107, 2009.
[72] Joe Amoah, Elizabeth A Stuart, Sara E Cosgrove, Anthony D Harris, Jennifer H Han, Ebbing
Lautenbach, and Pranita D Tamma. Comparing propensity score methods versus traditional
regression analysis for the evaluation of observational data: A case study evaluating the
treatment of gram-negative bloodstream infections. Clinical Infectious Diseases, 71(9):e497–
e505, 2020.
[73] Roderick JA Little. Survey nonresponse adjustments for estimates of means. International
Statistical Review/Revue Internationale de Statistique, 54(2):139–157, 1986.
[74] Robert M. Groves, Don Dillman, Don A. Dillman, John L. Eltinge, and Roderick J. A. Little
Survey Nonresponse. Hoboken, NJ, Wiley, 2002.
[75] Joseph A Simonetti, Walter L Clinton, Leslie Taylor, Alaina Mori, Stephan D Fihn, Christian D
Helfrich, and Karin Nelson. The impact of survey nonresponse on estimates of healthcare
employee burnout. Healthcare, 8(3):100451, 2020.
[76] Stephen R Cole and Elizabeth A Stuart. Generalizing evidence from randomized clinical trials
to target populations: The actg 320 trial. American Journal of Epidemiology, 172(1):107–115,
2010.
18 Overview of Methods for Adjustment and Applications in the Social and Behavioral Sciences
[77] Elizabeth A Stuart, Stephen R Cole, Catherine P Bradshaw, and Philip J Leaf. The use of
propensity scores to assess the generalizability of results from randomized trials. Journal of
the Royal Statistical Society: Series A (Statistics in Society), 174(2):369–386, 2011.
[78] Ryoko Susukida, Rosa M Crum, Cyrus Ebnesajjad, Elizabeth A Stuart, and Ramin Mojtabai.
Generalizability of findings from randomized controlled trials: Application to the national
institute of drug abuse clinical trials network. Addiction, 112(7):1210–1219, 2017.
[79] Tyler VanderWeele. Explanation in Causal Inference: Methods for Mediation and Interaction.
New York, NY: Oxford University Press, 2015.
[80] Booil Jo, Elizabeth A Stuart, David P MacKinnon, and Amiram D Vinokur. The use of
propensity scores in mediation analysis. Multivariate Behavioral Research, 46(3):425–452,
2011.
[81] Paul R Rosenbaum. Design sensitivity in observational studies. Biometrika, 91(1):153–164,
2004.
[82] José R Zubizarreta, Magdalena Cerda, and Paul R Rosenbaum. Effect of the 2010 chilean
earthquake on posttraumatic stress reducing sensitivity to unmeasured bias through study
design. Epidemiology, 24(1):79–87, 2013.
[83] Paul R Rosenbaum. Modern algorithms for matching in observational studies. Annual Review
of Statistics and Its Application, 7:143–176, 2020.
[84] Paul R Rosenbaum and Donald B Rubin. Assessing sensitivity to an unobserved binary
covariate in an observational study with binary outcome. Journal of the Royal Statistical
Society: Series B (Methodological), 45(2):212–218, 1983.
[85] Sander Greenland. Basic methods for sensitivity analysis of biases. International Journal of
Epidemiology, 25(6):1107–1116, 1996.
[86] Danyu Y Lin, Bruce M Psaty, and Richard A Kronmal. Assessing the sensitivity of regression
results to unmeasured confounders in observational studies. Biometrics, 54(3):948–963, 1998.
[87] Joseph L Gastwirth, Abba M Krieger, and Paul R Rosenbaum. Dual and simultaneous
sensitivity analysis for matched pairs. Biometrika, 85(4):907–920, 1998.
[88] David J Harding. Counterfactual models of neighborhood effects: The effect of neighborhood
poverty on dropping out and teenage pregnancy. American Journal of Sociology, 109(3):676–
719, 2003.
[89] Onyebuchi A Arah, Yasutaka Chiba, and Sander Greenland. Bias formulas for external
adjustment and sensitivity analysis of unmeasured confounders. Annals of Epidemiology,
18(8):637–646, 2008.
[90] Tyler J VanderWeele and Onyebuchi A Arah. Bias formulas for sensitivity analysis of
unmeasured confounding for general outcomes, treatments, and confounders. Epidemiology,
22(1):42–52, 2011.
[91] Liangyuan Hu, Jungang Zou, Chenyang Gu, Jiayi Ji, Michael Lopez, and Minal Kale. A
flexible sensitivity analysis approach for unmeasured confounding with multiple treatments
and a binary outcome with application to seer-medicare lung cancer data. arXiv preprint
arXiv:2012.06093, 2020.
[92] Matteo Bonvini and Edward H Kennedy. Sensitivity analysis via the proportion of unmeasured
confounding. Journal of the American Statistical Association, 117(539):1540-1550, 2022.
Conclusion 19
[93] Tyler J VanderWeele and Peng Ding. Sensitivity analysis in observational research: Introducing
the e-value. Annals of Internal Medicine, 167(4):268–274, 2017.
[94] Maya B Mathur, Peng Ding, Corinne A Riddell, and Tyler J VanderWeele. Website and r
package for computing e-values. Epidemiology, 29(5):e45–e47, 2018.
[95] Steven N Goodman, Sebastian Schneeweiss, and Michael Baiocchi. Using design thinking to
differentiate useful from misleading evidence in observational research. JAMA, 317(7):705–
707, 2017.
[96] Neil J Perkins, Stephen R Cole, Ofer Harel, Eric J Tchetgen Tchetgen, BaoLuo Sun, Emily M
Mitchell, and Enrique F Schisterman. Principled approaches to missing data in epidemiologic
studies. American Journal of Epidemiology, 187(3):568–575, 2018.
[97] Rhian M Daniel, Michael G Kenward, Simon N Cousens, and Bianca L De Stavola. Using
causal diagrams to guide analysis in missing data problems. Statistical Methods in Medical
Research, 21(3):243–256, 2012.
[98] Karthika Mohan, Judea Pearl, and Jin Tian. Graphical models for inference with missing
data. In Proceedings of the 26th International Conference on Neural Information Processing
Systems, pages 1277–1285, 2013.
[99] Felix Thoemmes and Karthika Mohan. Graphical representation of missing data problems.
Structural Equation Modeling: A Multidisciplinary Journal, 22(4):631–642, 2015.
[100] Rachael A Hughes, Jon Heron, Jonathan AC Sterne, and Kate Tilling. Accounting for missing
data in statistical analyses: Multiple imputation is not always the answer. International Journal
of Epidemiology, 48(4):1294–1304, 2019.
[101] Bruno Arpino and Fabrizia Mealli. The specification of the propensity score in multilevel
observational studies. Computational Statistics & Data Analysis, 55(4):1770–1780, 2011.
[102] Bruno Arpino and Massimo Cannas. Propensity score matching with clustered data. an
application to the estimation of the impact of caesarean section on the apgar score. Statistics
in Medicine, 35(12):2074–2091, 2016.
[103] Fan Li, Alan M Zaslavsky, and Mary Beth Landrum. Propensity score weighting with
multilevel data. Statistics in Medicine, 32(19):3373–3387, 2013.
[104] Hwanhee Hong, David A Aaby, Juned Siddique, and Elizabeth A Stuart. Propensity score–
based estimators with multiple error-prone covariates. American Journal of Epidemiology,
188(1):222–230, 2019.
[105] Trang Quynh Nguyen and Elizabeth A Stuart. Propensity score analysis with latent covariates:
Measurement error bias correction using the covariate’s posterior mean, aka the inclusive
factor score. Journal of Educational and Behavioral Statistics, 45(5):598–636, 2020.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
2
Propensity Score
Paul R. Rosenbaum
CONTENTS
2.1 Goal of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Example: nursing and surgical mortality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2 Adjustments for covariates in observational studies . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.3 Dawid’s notation for conditional independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.4 The effects caused by treatments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.5 Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 The Propensity Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Definition of the propensity score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Balancing properties of the propensity score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Estimating treatment effects using propensity scores . . . . . . . . . . . . . . . . . . . . . . 25
2.3.4 Unmeasured covariates and the propensity score . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.5 Kullback-Leibler information and propensity scores . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Aspects of Estimated Propensity Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1 Matching and stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.2 Conditional inference given a sufficient statistic . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.3 Inverse probability weighting and post-stratification . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Limitations of Propensity Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Nursing and Surgical Mortality, Concluded . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
DOI: 10.1201/9781003102670-2 21
22 Propensity Score
2.2 Preliminaries
2.2.1 Example: nursing and surgical mortality
The following recent observational study will be mentioned at various points to make general concepts
tangible. In this study a propensity score was used in conjunction with other matching techniques.
For instance, the match also used a covariate distance, a network optimization algorithm, and an
externally estimated risk-of-death or prognostic score; see [5, 14, 15].
Is mortality following general surgery lower at hospitals that have superior nursing environments?
Silber et al. [16] compared mortality in 35 hospitals with superior nursing and 293 hospitals with
inferior nursing. They took the patients undergoing general surgery at the 35 superior hospitals
and matched each one with a different patient at one of the 293 inferior hospitals, forming 25,076
matched pairs. The matching paired exactly for 130 surgical procedures (i.e., 4-digit ICD-9 codes),
and it balanced a total of 172 covariates, including the patient’s other recorded health problems, such
as congestive heart failure or diabetes, plus age, sex, emergency admission or not, and so on. The
data were from the US Medicare program.
By definition, a hospital had superior nursing if it was so designated by the Magnet Nursing
Services Recognition Program and also had a nurse-to-bed ratio of at least 1; see Aiken et al. [17]. A
hospital had inferior nursing if it had neither of these attributes. For brevity, surgery at a hospital
with superior nursing is called “treated,” while surgery at a hospital with inferior nursing is called
“control.” Mortality refers to death within 30 days of surgery.
The question being asked is: Would your risk of death be lower if your operation were performed
at one of the 35 superior hospitals, rather than one of the 293 inferior hospitals? It is the effect of
going to one existing hospital in lieu of going to another existing hospital. The question is not about
the effect of rebuilding hospitals to have attributes that they do not currently have. An important
subquestion of the main question is whether patients are being channeled to the appropriate hospitals,
in particular, whether high-risk patients are being channeled to the most capable hospitals, and
whether that matters for surgical mortality.
Hospitals with superior nursing disproportionately have other desirable attributes as well, at-
tributes that are not part of the explicit definition of a superior hospital. Nonetheless, these are
genuine features of a superior hospital, and for the question under study these features should not
be removed by adjustments. Superior hospitals tended to have a larger proportion of nurses with
advanced degrees, and a larger proportion of registered nurses. Superior hospitals were more often
major teaching hospitals (21.5% for the superior group versus 5.7% for the inferior group). Superior
hospitals were somewhat larger on average (595 beds versus 430 beds).
Blocking (or matching) and adjustments in the analysis are frequently used. There is, how-
ever, an important difference between the demands made on blocking or adjustments in
controlled experiments and in observational studies. In controlled experiments, the skillful
use of randomization protects against most types of bias arising from disturbing variables.
Consequently, the function of blocking or adjustment is to increase precision rather than
to guard against bias. In observational studies, in which no random assignment of subjects
to comparison groups is possible, blocking and adjustment take on the additional role of
protecting against bias. Indeed, this is often their primary role.
Preliminaries 23
A precursor of propensity scores was proposed by Cochran and Rubin [19, §6.3]. They observed
that if covariates had a multivariate Normal distribution in treated and control groups, with different
expectations but the same covariance matrix, then it was possible to linearly transform the covariates
so that all of the bias is captured by the first coordinate – the linear discriminant – and the remaining
coordinates have the same distribution in treated and control groups. In this multivariate Normal
case, Cochran’s “primary role” for blocking and matching, namely, bias reduction, involves a unidi-
mensional random variable, while the secondary role of blocking and matching, namely, increasing
precision, involves all of the coordinates. In a sense, the basic results about propensity scores say that
these properties are true in general and require none of the following: Normal distributions, linear
transformations, equal covariance matrices. In the Normal case with equal covariance matrices, the
linear discriminant reflects the likelihood ratio – the ratio of the density of the covariate in treated
and control groups – and in all cases, by Bayes theorem, this likelihood ratio is a strictly monotone
function of the propensity score.
general, that difference estimates something else, not the effect caused by the treatment. To see
this, note that the expected response of a treated individual is E ( R | Z = 1) = E ( rT | Z = 1),
because a treated individual has Z = 1 and observed response R = rT if Z = 1. In parallel,
the expected response of a control is E ( R | Z = 0) = E ( rC | Z = 0). In general, the expected
treatment effect is not the difference in expectations in the treated and control groups; that is,
in general, E (rT − rC ) 6= E ( rT | Z = 1) − E ( rC | Z = 0). The two quantities are equal if
treatments are assigned by flipping a fair coin; then Z | | (rT , rC ), precisely because the coin is
fair and the probability of a head does not change with (rT , rC ). Because Z | | (rT , rC ) in a
completely randomized experiment, E ( rT | Z = 1) = E (rT ) and E ( rC | Z = 0) = E (rC ), so we
can estimate the expected treatment effect by comparing the mean outcomes in treated and control
groups; that is, in this case, E ( rT | Z = 1) − E ( rC | Z = 0) = E (rT ) − E (rC ) = E (rT − rC ).
This is one of the simplest ways of saying that randomization plays a critical role in causal inference
in randomized experiments. If people picked their own treatments, deciding Z for themselves, then
there is no reason to expect Z | | (rT , rC ). In §2.2.1, if high-risk patients were channeled to more
capable hospitals, then there is no reason to expect Z | | (rT , rC ).
2.2.5 Covariates
A covariate describes a person prior to treatment assignment, so that, unlike an outcome, there is only
one version of a covariate. Changing the treatment to which a person is assigned does not change
the value of a covariate.
Each individual is described by an observed covariate x. Central to observational studies is the
concern that people who appear comparable in terms of the observed x may not be comparable in
terms of a covariate u that was not observed. In §2.2.1, matching balanced an x containing 172
covariates, but inevitably critical discussion focused on the 173rd covariate, the one that was not
measured.
λ = λ(x) = Pr( Z = 1 | x), tends to balance the 172-dimensional covariate x. Two individuals with
the same λ, one treated with Z = 1, the other a control with Z = 0, may have different values of x,
say x∗ and x∗∗ with λ (x∗ ) = λ (x∗∗ ), but it is just as likely that the treated individual has x∗ and
the control has x∗∗ as it is that the treated individual has x∗∗ and the control has x∗ ; so, over many
pairs, x tends to balance. Stated informally, taking care of λ (x) takes care of all of x, but if there
are important unmeasured covariates, then taking care of all of x may not be enough.
Proposition 2.1 is from Rosenbaum and Rubin [3], and it states the balancing property.
Proposition 2.1. If λ = λ(x) = Pr( Z = 1 | x), then
Z | | x λ, (2.1)
Remark 2.1. Condition (2.1) can be read in two ways. As is, (2.1) says that the propensity score
λ (x) contains all of the information in the observed covariates x that is useful in predicting treatment
assignment Z. Rewriting (2.1) as x | | Z λ, it says Pr( x | Z = 1, λ) = Pr( x | Z = 0, λ), so
treated and control individuals with the same λ have the same distribution of x. As is, (2.1) is
agnostic about whether x and λ are random variables; that is, perhaps only Z is random.
Remark 2.2. Condition (2.2) says that if you match on the propensity score plus other aspects of
x, then the balancing property is preserved. In the example in §2.2.1, in asking whether high-risk
patients benefit more from more capable hospitals, it was important for pairs to be close not only on
the propensity score, λ (x), but also on the externally estimated risk-of-death score or prognostic
score [14], which is another function f (x) of x; see [16, Table 5]. Condition (2.2) says x will
balance if you match for both scores.
Proof The proof is simple. It suffices to prove (2.2). Trivially, λ = Pr( Z = 1 | x) =
Pr{ Z = 1 | x, λ, f (x)} because λ = λ (x) and f (x) are functions of x. Also,
After matching for the propensity score, it is standard and recommended practice to check whether
the covariates x are balanced, whether the treated and matched control groups look comparable in
terms of x, that is, to check condition (2.1) or (2.2). This check for balance of x can be viewed
as a diagnostic check of the model for the propensity score λ (x). For the example in §2.2.1, the
check was done in Silber et al. [16, Table 2 and Appendix]. A general approach to checking covariate
balance is discussed by Yu [26].
Z | | (x, rT , rC ) ζ, (2.3)
Suppose that we sampled a value of ζ, then sampled a treated individual, Z = 1, and a con-
trol individual, Z = 0, with this value of ζ, and we took the treated-minus-control difference
in their responses. That is, we sample ζ from its marginal distribution, then sample rT from
Pr ( R | Z = 1, ζ) = Pr ( rT | Z = 1, ζ) and rC from Pr ( R | Z = 0, ζ) = Pr ( rC | Z = 0, ζ), tak-
ing the treated-minus-control difference of the two responses matched for ζ. Given ζ, the expected
difference in responses is
E ( rT | Z = 1, ζ) − E ( rC | Z = 0, ζ) = E ( rT | ζ) − E ( rC | ζ) (2.6)
= E ( rT − rC | ζ) ,
using (rT , rC ) | | Z ζ from (2.3). However, ζ was picked at random from its marginal distribution,
so the unconditional expectation of the treated-minus-control difference in the matched pair is the
expectation of E ( rT − rC | ζ) with respect to the marginal distribution of ζ, so it is E (rT − rC ),
which is the expected treatment effect.
In brief, if we could match for ζ, instead of matching for the propensity score λ, then we would
have an unbiased estimate of the expected effect caused by the treatment, E (rT − rC ). If we could
match for ζ, causal inference without randomization would be straightforward. There is the difficulty,
alas not inconsequential, that we have no access to ζ, and so cannot match for it.
In [3], treatment assignment is said to be (strongly) ignorable given x if
Condition (2.7) says two things. First, (2.7) says the principal unobserved covariate and the propensity
score are equal, or equivalently that (rT , rC ) | | Z x. Second, (2.7) adds that treated and control
individuals, Z = 1 and Z = 0, occur at every x. That is, (2.7) says:
Again, form (2.8) takes (rT , rC ) to be random variables, while form (2.7) is agnostic about whether
(rT , rC ) are random variables, or fixed quantities as in Fisher’s [28] randomization inferences.
If treatment assignment is ignorable given x in the sense that (2.7) holds, then matching for the
propensity score is matching for the principal unobserved covariate, and the above argument then
shows that matching for the propensity score provides an unbiased estimate of the expected treatment
effect, E (rT − rC ). In brief, if treatment assignment is ignorable given x, then it suffices to adjust
for all of x, but it also suffices to adjust for the scalar propensity score λ(x) alone.
The Propensity Score 27
There are many minor variations on this theme. If (2.7) holds, matching for x instead of ζ yields
an unbiased estimate of E (rT − rC ). If (2.7) holds, matching for {λ, f (x)} yields an unbiased
estimate of E { rT − rC | λ, f (x)} and of E { rT − rC | f (x)}. In §2.2.1, a treated individual was
sampled, her value of λ was noted, and she was matched to a control with the same value of λ, leading
to an estimate of the average effect of the treatment on the treated group, E ( rT − rC | Z = 1), if
(2.7) holds [6]. If (2.7) holds, the response surface E ( rT | Z = 1, x) of R on x in the treated group
equals E ( rT | x), the response surface E ( rC | Z = 0, x) of R on x in the control group estimates
E ( rC | x), and differencing yields the causal effect E ( rT − rC | x); however, all of this is still true
if x is replaced by {λ, f (x)}.
In brief, in estimating expected causal effects, say E (rT − rC ) or E { rT − rC | f (x)}, if it
suffices to adjust for x then it suffices to adjust for the propensity score, λ = Pr ( Z = 1 | x). In
§2.2.1, it if suffices to adjust for 172 covariates in x, then it suffices to adjust for one covariate, λ.
Adjustments for all of x, for λ alone, or for λ and f (x), work if treatment assignment is ignorable
give x, that is, if (2.7) or (2.8) are true, and they work only by accident otherwise. The problem is
that (2.7) or (2.8) may not be true.
Pr ( Z = 1 | x, ζ, rT , rC ) = Pr ( Z = 1 | x, rT , rC ) = ζ. (2.10)
Pr ( Z = 1 | x, ζ) = E { Pr ( Z = 1 | x, ζ, rT , rC ) | x, ζ} (2.11)
= E ( ζ | x, ζ) = ζ = Pr ( Z = 1 | x, ζ, rT , rC ) ,
Frangakis and Rubin [27] refer to conditioning upon (rT , rC ) as forming principal strata. In
recognition of this, we might reasonably refer to the u = ζ in Proposition 2.3 as the principal
unobserved covariate. Proposition 2.1 says that the aspects of an observed covariate x that bias
treatment assignment may be summarized in a scalar covariate, namely the propensity score, λ =
Pr ( Z = 1 | x). Proposition 2.3 says that the aspects of unobserved covariates that bias treatment
assignment may be summarized in a scalar, namely, the principal unobserved covariate.
The sensitivity analysis in Rosenbaum [7, 36] refers to the principal unobserved covariate.
The principal unobserved covariate equals u = ζ = Pr ( Z = 1 | x, u) using (2.11) and satisfies
0 ≤ u ≤ 1 because u is a probability. The sensitivity analysis quantifies the influence of u on Z by
a parameter Γ ≥ 1 expressed in terms of the principal unobserved covariate u. Specifically, it says:
any two individuals, i and j, with the same value of the observed covariate, xi = xj , may differ in
their odds of treatment by at most a factor of Γ:
1 Pr ( Z = 1 | xi , ui ) Pr ( Z = 0 | xj , ui ) ui (1 − uj )
≤ = ≤ Γ if xi = xj . (2.12)
Γ Pr ( Z = 1 | xj , ui ) Pr ( Z = 0 | xi , uj ) uj (1 − ui )
In [36], the parameter γ = log(Γ) is the coefficient of u in a logit model that predicts treatment Z
from x and u. For any sensitivity analysis expressed in terms of the principal unobserved covariate u
in (2.12), the amplification in Rosenbaum and Silber [37] and Rosenbaum [38, Table 9.1] reexpresses
the sensitivity analysis in terms of other covariates with different properties that have exactly the
same impact on the study’s conclusions.
As noted by Kullback and Leibler, J (1 : 0; x) ≥ 0 with equality if and only if the distribution of
x is the same in treated and control groups. Using Bayes theorem, J (1 : 0; x) may be rewritten in
terms of the propensity score:
Z
λ (x)
J (1 : 0; x) = {Pr ( x | Z = 1) − Pr ( x | Z = 0)} log dx,
1 − λ (x)
so J (1 : 0; x) is the difference in the expectations of the log-odds of the propensity score in treated
and control groups. Yu et al. [40] propose estimating J (1 : 0; x) by the difference in sample means
of the estimated log-odds of the propensity score, and they use J (1 : 0; x) as an accounting tool that
keeps track of magnitudes of bias before matching and in various matched samples.
As a measure of covariate imbalance, J (1 : 0; x) has several attractions. It is applicable when
some covariates are continuous, others are nominal, and others are ordinal. If the covariates are
partitioned into two sets, then the total divergence may be written as the sum of a marginal divergence
from the first set and a residual or conditional divergence from the second set. The divergence in all of
x equals the divergence in the propensity score λ (x) alone; that is, J (1 : 0; x) = J (1 : 0; λ). The
divergence reduces to familiar measures when Pr ( x | Z = z) is multivariate Normal with common
covariance matrix for z = 1 and z = 0. These properties are discussed by Yu et al. [40].
Aspects of Estimated Propensity Scores 29
are discussed, an exact test that works when the coordinates of xi are discrete and coarse, and a
large-sample test that applies generally. It is assumed that X “contains” a constant term, meaning
that the I-dimensional vector of 1s is in the space spanned by the columns of X.
For the exact method, what are examples of a discrete, coarse X? The simplest example is
M mutually exclusive strata, so thatPM X has M columns, and xim = 1 if individual i is in stratum
m, xim = 0 otherwise, with 1 = m=1 xim for each i, say m = 1 for female, m = 2 for male.
Instead, the first two columns of X could bePa stratification of individuals by gender, followed by four
2 P6
columns that stratify by ethnicity, so 1 = m=1 xim and 1 = m=3 xim for each i. The seventh
column of X might be age rounded to the nearest year. For the exact test, the columns of X need not
be linearly independent. If C is a finite set, then write |C| for the number of elements of C.
Under the logit model, S = XT Z is a sufficient statistic for β, so the conditional distribution of
Z given XT Z does not depend upon the unknown β. If treatment assignment is ignorable given X,
then using both ignorability and sufficiency,
Pr Z = z | X, rT , rC , β, XT Z = s = Pr Z = z | X, XT Z = s ,
(2.13)
and Pr Z = z | X, XT Z = s is constant on the set Cs = z : XT z = s of vectors of binary
treatment assignments z that give rise to s as the value of the sufficient statistic, so
1
Pr Z = z | X, rT , rC , β, XT Z = s =
for each z ∈ Cs . (2.14)
|Cs |
Under Fisher’s hypothesis H0 of no effect, R = rT = rC is fixed by conditioning in (2.13) and
(2.14), so that t (Z, R) = t (Z, rC ) for any test statistic t (Z, R). The distribution of t (Z, R) =
t (Z, rC ) under H0 is a known, exact permutation distribution,
test may be obtained by replacing rCi in κi = xTi β + ω rCi by the rank of rCi , ranking from 1 to I.
Whether rCi or its rank is used, ω equals zero when treatment assignment is ignorable and Hτ is
true. Instead of ranks, when testing Hτ , rCi may be replaced by residuals of R − τ Z = rC when
robustly regressed on xi ; see [45] and [46, §2.3].
The material in this subsection is from [1]. See also [47], [7, §3.4-§3.6] and [45].
|{z ∈ Cs : zi = 1}|
λi = .
|Cs |
That is, λi is known, even though β is not. Note also that λi ≥ 1/ |Cs | > 0 whenever Zi = 1, and
similarly λi < 1 whenever Zi = 0, because Z ∈ Cs for s = XT Z. If 0 < λi < 1,
Zi rT i (1 − Zi ) rCi T
E − X, rT , rC , X Z = s = rT i − rCi .
λi 1 − λi
so δb is unbiased for δ.
In practice, these calculations work for discrete, coarse covariates, and even then computation of
λi can be difficult, so the maximum-likelihood estimate λ bi of λi is commonly used instead. With
this substitution, strict unbiasedness is lost. Here, as before, it is assumed that treatment assignment
is ignorable and that λi follows a linear logit model, κ = Xβ; so, both λi and λ bi are estimators
T
of λi that are functions of the sufficient statistic, X Z = s and X. In this case, E ( Zi | xi ) =
Pr ( Zi = 1 | xi ) = Pr ( Zi = 1 | rT i , rCi , xi ) = λi , so Zi is a poor but unbiased estimate of λi , and
λi = E Zi | X, rT , rC , XT Z = s is the corresponding Rao-Blackwell estimator of λi .
In the simplest case, X encodes precisely M mutually exclusive strata, where M is small, as in
the simplest case in §2.4.2, and the estimator δb is called poststratification; see [48]. In this simplest
32 Propensity Score
case, the logit model logit model, κ = Xβ, says that the propensity score λi is constant for all
individuals in the same stratum, and λi is the sample proportion of treated individuals in that stratum.
In this simplest case, if each λi is neither 0 nor 1, then λi is also the maximum-likelihood estimate
bi of λi . When the true λi are constant, λ1 = · · · = λI , it is known that the poststratified estimate,
λ
δ,
b which allows λi = λ bi to vary among strata, tends to be more efficient than δ ∗ , because δb corrects
for chance imbalances across strata while δ ∗ does not; see [48]. In other words, in this simplest case,
estimated propensity scores, λi = λ bi , often outperform the true propensity score λi .
This subsection is based on [2].
PI The example inP that paper uses maximum likelihood to estimate
I
not just the expectations of I −1 i=1 rT i and I −1 i=1 rCi , but also the cumulative distributions
of rT i and rCi adjusting for biased selection. The adjustments are checked by applying them to
the covariates in X, where there is no effect, so the estimator is known to be trying to estimate an
average treatment effect of zero.
Table 2.1 shows covariate means or percentages in matched pairs. The standardized difference
(Std. Dif.) is a traditional measure of covariate imbalance: it is the difference in means after matching
divided by a pooled within-group standard deviation before matching [19]. The pooled standard
deviation is the square root of the equally weighted average of the two within-group variances. Of
course, the Std. Dif. is zero for the 130 Princpal Procedure codes, as the difference in percentages for
each code is zero. Additional detail is found in the Appendix to [16].
This example is typical. The propensity score is one of several mutually supporting techniques
used in matching. In addition to matching for the propensity score, other techniques used were exact
matching for principal procedures, an externally estimated prognostic score, and optimization of a
covariate distance.
Matching for the prognostic score played an important role in the analysis: pairs were separated
into five groups of pairs based on the quintile of the estimated probability of death. The treatment
effect was estimated for all pairs, and for pairs in each quintile of risk.
For all 25,076 pairs, the 30-day mortality rate was 4.8% in hospitals with superior nursing and
5.8% in hospitals with inferior nursing. As discussed in §2.2.1, this is an attempt to estimate the
effect of having surgery in one group of existing hospitals rather than another group of existing
hospitals, not the effect of changing the nursing environment in any hospital. The difference in
mortality rates is not plausibly due to chance, with P -value less than 0.001 from McNemar’s test. The
within-pair odds ratio was 0.79 with 95% confidence interval 0.73 to 0.86. An unobserved covariate
that doubled the odds of death and increased by 50% the odds of treatment is insufficient to produce
this association in the absence of a treatment effect, but larger unmeasured biases could explain it;
see [16, eAppendix 10].
Dividing the 25,076 pairs into five groups of roughly 5,015 pairs based on the quintile of
the prognostic score yields five estimates of treatment effect for lower and higher risk patients;
34 Propensity Score
see [16, Table 5]. For the lowest three quintiles, the difference in mortality rates was small and
not significantly different from zero, although the point estimate was negative, with slightly lower
mortality rates in hospital with superior nursing. The two highest risk quintiles had lower rates of
death in hospitals with superior nursing. In the highest risk quintile, the mortality rates were 17.3% at
hospitals with superior nursing and 19.9% at hospitals with inferior nursing, a difference of −2.6%,
with P -value less than 0.001 from McNemar’s test. This analysis raises the possibility that mortality
rates in the population might be lower if high-risk patients were directed to have surgery at hospitals
with superior nursing.
2.7 Summary
This chapter has reviewed, with an example, the basic properties of propensity scores, as developed in
the 1980’s in joint work with Donald B. Rubin [1–6]. The discussion has focused on role of propen-
sity scores in matching, stratification, inverse probability weighting, and conditional permutation
inference. Section 2.5 discussed limitations of propensity scores, and various methods for addressing
them.
References
[1] Paul R Rosenbaum. Conditional permutation tests and the propensity score in observational
studies. Journal of the American Statistical Association, 79(387):565–574, 1984.
[2] Paul R Rosenbaum. Model-based direct adjustment. Journal of the American Statistical
Association, 82(398):387–394, 1987.
[3] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observa-
tional studies for causal effects. Biometrika, 70(1):41–55, 1983.
[4] Paul R Rosenbaum and Donald B Rubin. Reducing bias in observational studies using
subclassification on the propensity score. Journal of the American Statistical Association,
79(387):516–524, 1984.
[5] Paul R Rosenbaum and Donald B Rubin. Constructing a control group using multivariate
matched sampling methods that incorporate the propensity score. The American Statistician,
39(1):33–38, 1985.
[6] Paul R Rosenbaum and Donald B Rubin. The bias due to incomplete matching. Biometrics,
41:103–116, 1985.
[7] Paul R Rosenbaum. Observational Studies. Springer, New York, 2nd edition, 2002.
[8] Donald B Rubin. Estimating causal effects from large data sets using propensity scores.
Annals of Internal Medicine, 127(8):757–763, 1997.
[9] Donald B Rubin. Matched Sampling for Causal Effects. Cambridge University Press, New
York, 2006.
[10] Donald B Rubin and Richard P Waterman. Estimating the causal effects of marketing
interventions using propensity score methodology. Statistical Science, 21:206–222, 2006.
Summary 35
[11] Noah Greifer and Elizabeth A Stuart. Matching methods for confounder adjustment: An
addition to the epidemiologist’s toolbox. Epidemiologic Reviews, 43(1):118-129, 2021.
[12] Paul R Rosenbaum. Design of Observational Studies. Springer, New York, 2nd edition, 2020.
[13] Paul R Rosenbaum. Modern algorithms for matching in observational studies. Annual Review
of Statistics and Its Application, 7:143–176, 2020.
[14] Ben B Hansen. The prognostic analogue of the propensity score. Biometrika, 95(2):481–488,
2008.
[15] Paul R Rosenbaum. Optimal matching for observational studies. Journal of the American
Statistical Association, 84(408):1024–1032, 1989.
[16] Jeffrey H Silber, Paul R Rosenbaum, Matthew D McHugh, Justin M Ludwig, Herbert L Smith,
Bijan A Niknam, Orit Even-Shoshan, Lee A Fleisher, Rachel R Kelz, and Linda H Aiken.
Comparison of the value of nursing work environments in hospitals across different levels of
patient risk. JAMA Surgery, 151(6):527–536, 2016.
[17] Linda H Aiken, Donna S Havens, and Douglas M Sloane. The magnet nursing services
recognition program: A comparison of two groups of magnet hospitals. AJN The American
Journal of Nursing, 100(3):26–36, 2000.
[18] William G Cochran. The planning of observational studies of human populations. Journal of
the Royal Statistical Society, A, 128(2):234–266, 1965.
[19] William G Cochran and Donald B Rubin. Controlling bias in observational studies: A review.
Sankhyā, Series A, 35(4): 417–446, 1973.
[20] A Philip Dawid. Conditional independence in statistical theory. Journal of the Royal Statistical
Society B, 41(1):1–15, 1979.
[21] Jerzy Neyman. On the application of probability theory to agricultural experiments. (English
translation of Neyman (1923). Statistical Science, 5(4): 465–472, 1990.
[22] Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized
studies. Journal of Educational Psychology, 66(5):688, 1974.
[23] David R Cox. The interpretation of the effects of non-additivity in the latin square. Biometrika,
45:69–73, 1958.
[24] B L Welch. On the z-test in randomized blocks and latin squares. Biometrika, 29(1/2):21–52,
1937.
[25] Martin B Wilk and Oscar Kempthorne. Some aspects of the analysis of factorial experiments
in a completely randomized design. Annals of Mathematical Statistics, 27(4):950–985, 1956.
[26] Ruoqi Yu. Evaluating and improving a matched comparison of antidepressants and bone
density. Biometrics, 77(4):1276-1288, 2021.
[27] Constantine E Frangakis and Donald B Rubin. Principal stratification in causal inference.
Biometrics, 58(1):21–29, 2002.
[28] R.A. Fisher. Design of Experiments. Oliver and Boyd, Edinburgh, 1935.
[29] Irwin DJ Bross. Statistical criticism. Cancer, 13(2):394–400, 1960.
36 Propensity Score
[30] Paul R Rosenbaum. From association to causation in observational studies: The role of tests
of strongly ignorable treatment assignment. Journal of the American Statistical Association,
79(385):41–48, 1984.
[31] Paul R Rosenbaum. The role of a second control group in an observational study. Statistical
Science, 2(3):292–306, 1987.
[32] Paul R Rosenbaum. On permutation tests for hidden biases in observational studies. Annals of
Statistics, 17(2): 643–653, 1989.
[33] Paul R Rosenbaum. The role of known effects in observational studies. Biometrics, 45:557–
569, 1989.
[34] Eric Tchetgen Tchetgen. The control outcome calibration approach for causal inference with
unobserved confounding. American Journal of Epidemiology, 179(5):633–640, 2014.
[35] Jerome Cornfield, William Haenszel, E Cuyler Hammond, Abraham M Lilienfeld, Michael B
Shimkin, and Ernst L Wynder. Smoking and lung cancer: Recent evidence and a discussion
of some questions. (Reprint of a paper from 1959 with four new comments). International
Journal of Epidemiology, 38(5):1175–1201, 2009.
[36] Paul R Rosenbaum. Sensitivity analysis for certain permutation inferences in matched
observational studies. Biometrika, 74(1):13–26, 1987.
[37] Paul R Rosenbaum and Jeffrey H Silber. Amplification of sensitivity analysis in matched
observational studies. Journal of the American Statistical Association, 104(488):1398–1405,
2009.
[38] Paul R Rosenbaum. Observation and Experiment: An Introduction to Causal Inference.
Harvard University Press, Cambridge, MA, 2017.
[39] Solomon Kullback and Richard A Leibler. On information and sufficiency. Annals of
Mathematical Statistics, 22(1):79–86, 1951.
[40] Ruoqi Yu, Jeffrey H Silber, and Paul R Rosenbaum. The information in covariate imbalance
in studies of hormone replacement therapy. Annals of Applied Statistics, 15, 2021.
[41] Donald B Rubin. The design versus the analysis of observational studies for causal effects:
Parallels with the design of randomized trials. Statistics in Medicine, 26(1):20–36, 2007.
[42] M W Birch. The detection of partial association, i: The 2× 2 case. Journal of the Royal
Statistical Society B, 26(2):313–324, 1964.
[43] J L Hodges and E L Lehmann. Rank methods for combination of independent experiments in
analysis of variance. The Annals of Mathematical Statistics, 33:482–497, 1962.
[44] David R Cox. Analysis of Binary Data. Methuen, London, 1970.
[45] Paul R Rosenbaum. Covariance adjustment in randomized experiments and observational
studies. Statistical Science, 17(3):286–327, 2002.
[46] Paul R Rosenbaum. Replication and Evidence Factors in Observational Studies. Chapman
and Hall/CRC, New York, 2021.
[47] James M Robins, Steven D Mark, and Whitney K Newey. Estimating exposure effects by
modelling the expectation of exposure conditional on confounders. Biometrics, 48:479–495,
1992.
Summary 37
[48] David Holt and T M Fred Smith. Post stratification. Journal of the Royal Statistical Society A,
142(1):33–46, 1979.
[49] Donald B Rubin and Neal Thomas. Combining propensity score matching with additional
adjustments for prognostic covariates. Journal of the American Statistical Association,
95(450):573–585, 2000.
[50] Samuel D Pimentel, Rachel R Kelz, Jeffrey H Silber, and Paul R Rosenbaum. Large, sparse op-
timal matching with refined covariate balance in an observational study of the health outcomes
produced by new surgeons. Journal of the American Statistical Association, 110(510):515–527,
2015.
[51] Paul R Rosenbaum, Richard N Ross, and Jeffrey H Silber. Minimum distance matched
sampling with fine balance in an observational study of treatment for ovarian cancer. Journal
of the American Statistical Association, 102(477):75–83, 2007.
[52] Ruoqi Yu, Jeffrey H Silber, and Paul R Rosenbaum. Matching methods for observational
studies derived from large administrative databases. Statistical Science, 35(3):338–355, 2020.
[53] José R Zubizarreta. Using mixed integer programming for matching in an observational study
of kidney failure after surgery. Journal of the American Statistical Association, 107(500):1360–
1371, 2012.
[54] José R Zubizarreta, Ricardo D Paredes, and Paul R Rosenbaum. Matching for balance, pairing
for heterogeneity in an observational study of the effectiveness of for-profit and not-for-profit
high schools in chile. The Annals of Applied Statistics, 8(1):204–231, 2014.
[55] Ruoqi Yu and Paul R Rosenbaum. Directional penalties for optimal matching in observational
studies. Biometrics, 75(4):1380–1390, 2019.
[56] Paul R Rosenbaum. How to see more in observational studies: Some new quasi-experimental
devices. Annual Review of Statistics and Its Application, 2:21–48, 2015.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
3
Generalizability and Transportability
CONTENTS
3.1 The Generalization and Transportation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1.1 Validity concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1.2 Target validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Data and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Estimate sample selection probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.2 Assess similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.3 When positivity fails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.4 Estimate the PATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.4.1 Subclassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.4.2 Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.4.3 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.4.4 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.4.5 Doubly robust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.4.6 Changes to setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.5 Evaluate PATE estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.1 Interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.2 Sign-generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.3 Heterogeneity and moderators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.4 Meta-analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5 Planning Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5.1 Planning a single study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5.2 Prospective meta-analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.7 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
In practical uses of causal inference methods, the goal is to not only estimate the causal effect of a
treatment in the data in hand, but to use this estimate to infer the causal effect in a broader population.
In medicine, for example, doctors or public health officials may want to know how to apply findings
of a study to their population of patients or in their community. Or, in education or social welfare,
school officials or policy-makers may want to know how to apply the findings to students in their
schools, school district, or state. If unit specific treatment effects are all identical, these inferences
DOI: 10.1201/9781003102670-3 39
40 Generalizability and Transportability
from the sample to the population are straightforward. However, when treatment impacts vary, the
causal effect estimated in a sample of data may not directly generalize to the target population of
interest.
In the causal inference literature, this is referred to as the generalizability or transportability of
causal effects from samples to target populations. The distinction between these two concepts has to
do with the relationship between the sample and target population. If the sample can be conceived of
as a sample from the target population, this inference is referred to as generalization. If, on the other
hand, the sample is not from the target population, this inference is referred to as one of transportation.
Of course, we note that in practice this distinction can be difficult, since it is not always clear how a
sample was generated, making questions of its target population of origin difficult to surmise.
The goal of this chapter is to provide an introduction to the problems of generalization and
transportation and methods for addressing these concerns. We begin by providing an overview
of validity concerns, then introduce a framework for addressing questions of generalization and
transportation, methods for estimating population-level causal effects, extensions to other estimands,
and approaches to planning studies that address these concerns as part of the study design. Note that
this chapter is not comprehensive, and that there are several other field specific reviews that are also
relevant [1–6].
study may be described as coming from “low socioeconomic status households” (“low-SES”), but
how this is operationalized matters. While various complex measures of SES exist – including family
income, assets and parental education - in practice the operationalization that is far more common
in U.S. education studies is an indicator that a student receives a Free or Reduced Priced Lunch
(FRPL); unlike these more complex measures, this only takes into account income and household
size and requires families to apply. Similarly, these same concerns hold with outcomes in a study.
For example, the purpose of a treatment may be to reduce depression. However, the construct of
“depression” also requires operationalization, often including answering 21 questions from Beck’s
Depression Inventory.
In the remainder of this chapter, we will focus on methods for making inferences from the variety
of units and settings found in a study to those found in a target population. Throughout, we will
assume that the treatment and outcomes, as defined in the study, are the focus in the target population.
Similarly, we will put aside deeper questions of construct validity in this work. However, we do
so only because these have not been the focus of this literature, not because these issues are not
important.
In a randomized experiment, for example, τS can be estimated unbiasedly using the simple
difference in means estimator.
Suppose we are interested instead in the Population Average Treatment Effect (PATE),
N
1 X
τP = P AT E = δi (c∗ ) (3.3)
N i=1
where this target population P includes N units all with context c∗ . Note that if the sample S ⊂ P
and c = c∗ then this is a question of generalizability, while if S 6⊂ P or c∗ is not included in the
sample, this is a question of transportability.
Now, assume a general estimator τ̂ of this PATE. The total bias in this estimator (∆) can be
decomposed into four parts [10]:
∆ = τ̂ − τP
= ∆S + ∆T
= ∆Sx + ∆Su + ∆Tx + ∆Tu
where ∆S = τS − τP is the difference between the SATE and the PATE and ∆T = τ̂ − τS is the bias
of the estimator for the SATE. In the third line, note that each of these biases are further divided into
42 Generalizability and Transportability
TABLE 3.1
Overview of validity types
two parts – those based on observed covariates X and those resulting from unobserved covariates U .
Olsen and colleagues provide a similar decomposition, though from a design-based perspective [11].
By focusing on total bias – also called target validity [12] – it is clear that bias results both from
how units selected into or were assigned to treatment (∆T ) and from how these units selected or were
sampled into the study (∆S ). This also clarifies the benefits and costs of different study designs. For
example, in a randomized experiment with no attrition, we can expect that ∆T = 0, but, depending
upon study inclusion criteria and the degree of heterogeneity in treatment effects across units, ∆S is
likely non-zero and could be large. In converse, in an observational study using large administrative
data on the entire population of students in a school system, it is likely that ∆S = 0, while, since
Data and Assumptions 43
treatment as not randomly assigned, depending upon the adjustments and covariates accounted for,
∆T is likely non-zero. This target validity framework makes clear that there are strong trade-offs
between internal and external validity and that focusing on one without the other can severely limit
the validity of inferences from the study to its target population.
3.2.1 Data
To begin, we require data on both a sample S of units i = 1, . . . , n and on a target population P with
units i = 1, . . . , N . In the sample, let Zi indicate if unit i receives the treatment (versus comparison),
and let Yi = Yi (1, c)Zi + Yi (0, c)(1 − Zi ) be the observed outcome. In the target population, we
do not observe Zi nor do we require that the same Yi is observed (though when it is observed, this
information is useful; see Section 3.3.5).
Let P ∗ = P ∪ S be the combined data across the sample and target population. When the
0
sample is a subset of the target population (S ⊂ P ), this combined data includes N = N rows and
moving from the sample to the target population is referred to as generalization. When the sample
0
is not a subset of the target population (S 6⊂ P ), the combined data has N = N + n rows and
moving from the sample to the target population is referred to as transportation [13, 14] or synthetic
generalization [15].
For all units in P ∗ , let Wi indicate if unit i is in the sample. Finally, for each unit in this combined
∗
P , a vector of covariates is also required xi = (x1i , x2i , . . . , xpi ). Importantly, these covariates
need to be measured the same way in the sample and target population, a criteria that can be difficult
to meet in many studies (see [16] for a case study).
3.2.2 Assumptions
In order for an estimator τ̂ to be unbiased for the causal effect τP in the target population, the
following assumptions are required.
Assumption 1 (Strongly ignorable treatment assignment [15, 17]). For all units with W = 1, let
X = (X1 , X2 , . . . , Xp ) be a vector of covariates that are observed for each. Let Z indicate if each
of these units receives the treatment. Then the treatment assignment is strongly ignorable if,
In practice, this condition is met without covariates in a randomized experiment without attrition,
or in an observational study in which X includes all covariates that are both related to the outcome
and to treatment selection. Moving forward, we will assume that Assumption 1 is met, since the
remainder of the chapter focuses on generalization and transportation.
Assumption 2 (Strongly ignorable sample selection [15, 17]). In the combined population P ∗ , the
sample selection process is strongly ignorable given these covariates X if
Notice that this condition is of similar form to Assumption 1 but differs in two regards. First, it is
the unit-specific treatment effects (δ = Y (1, c) − Y (0, c)) that must be conditionally orthogonal to
selection, not the vector of potential outcomes ( Y (1, c), Y (0, c) ). This means that we are worried
here with identifying covariates that explain variation in treatment effects in the population, or put
another way, the subset of those covariates X that are related to the outcome Y that also exhibit a
different relationship with Y under treatment and comparison. The second difference here is with
regard to the final inequality. Whereas in order to estimate the SATE, every unit needs to have some
probability of being in either treatment or control, when estimating the PATE, it is perfectly fine for
some population units to be deterministically included in the sample.
In practice, meeting Assumption 2 requires access to a rich set of covariates. Here the covariates
that are essential are those that moderate or explain variation in treatment effects across units in the
target population. The identification of potential moderators requires scientific knowledge and theory
regarding the mechanism through which the intervention changes outcomes. This is a place in which
substantive expertise plays an important role.
Researchers can often exert more control over what variables are collected in the sample than
which variables are available in the target population. Egami and Hartman [18] provide a data-driven
method for selecting covariates for generalization that relies only on sample data. Maintaining the
common assumption that the researcher can specify the variables related to how the sample was
selected, they estimate a Markov random field [19, 20] which is then used to estimate which pre-
treatment covariates are sufficient for meeting Assumption 2. The algorithm can include constraints
on what variables are measurable in the target population, which allows researchers to determine
if generalization is feasible, and if so, what variables to use in the estimation techniques for the
PATE we describe in Section 3.3.4. By focusing on estimating a set that is sufficient for meeting
Assumption 2, the method allows researchers to estimate the PATE in scenarios where methods that
require measurement of all variables related to sampling cannot.
Regardless of approach, researchers should keep in mind that in many cases – particularly in
randomized experiments – the samples are just large enough to have adequate statistical power for
testing hypotheses regarding the SATE and are quite under-powered for tests of treatment effect
moderators. Thus, moderators that are detected should clearly be included, but those that are not
precisely estimated should not be excluded based on this alone. Furthermore, in highly selected
samples, it is possible that there are some moderators that simply do not vary within the sample at all
and thus cannot be tested.
All of this means that it may be hard to empirically verify that this condition has been met,
and instead the warrant for meeting Assumption 2 may fall on the logic of the intervention and
theoretical considerations. In this vein, one approach to determining what variables to adjust for
relies on a directed acyclic graphical (DAG) approach [14, 21]. Whereas the method in Egami and
Hartman [18] empirically estimates the variables sufficient for meeting Assumption 2, Pearl and
Bareinboim [14, 21] and related approaches first fully specify the DAG and then analytically select
variables that address Assumption 2. The graphical approach also allows for alternative approaches
to identification, under the assumption the DAG is fully specified.
Finally, regardless of the approach taken when identifying potential moderators, it is wise to
conduct analyses regarding how sensitive the results are to an unobserved moderator. See Section 3.3.5
for more on this approach.
Assumption 3 (Contextual Exclusion Restriction [18]). When generalizing or transporting to settings
or contexts different from the experimental setting, we must assume
Y (Z = z, M = m, c) = Y (Z = z, M = m, c = c∗ )
where we expand the potential outcome with a vector of potential context-moderators M(c), and fix
M(c) = m.
Estimation 45
When we change the setting or context, the concern is that the causal effect is different across
settings even for the same units. For example, would the effect for student i be the same or different
in a public vs. a private high school? At its core, this is a question about differences in mechanisms
across settings. To address concerns about changes to settings, researchers must be able to adjust for
context-moderators that capture the reasons why causal effects differ across settings. Assumption 3
states that, conditional on these (pre-treatment) context-moderators, the treatment effect for each
unit is the same across settings. Much like the exclusion restriction in instrumental variables, to be
plausible the researcher must assume that, conditional on the context-moderators M = m, there are
no other pathways through which setting directly affects the treatment effect. As with instrumental
variables, this assumption cannot be guaranteed by the design of the study, and must be justified with
domain knowledge. This strong assumption emphasizes the role of strong theory when considering
generalizability and transportability across settings [22, 23].
Assumption 4 (Stable Unit Treatment Value Assumptions (SUTVA) [15]). There are two SUTVA
conditions that must be met, SUTVA(S) and SUTVA(P). Again, let W indicate if a unit in P ∗ is in the
sample and let Z indicate if a unit receives the treatment.
1. SUTVA (S): For all possible pairs of treatment assignment vectors Z = (Z1 , . . . , ZN 0 ) and Z0 =
(Z10 , . . . , ZN
0 0 0
0 ), SUTVA (S) holds if when Z = Z then Y (Z, W = 1, c) = Y (Z , W = 1, c).
3.3 Estimation
In order to estimate the PATE, there are five steps, which we describe here. First, sample selection
probabilities must be estimated, and based upon the distributions of these probabilities, the ability
to generalize needs to be assessed. In many cases problems of common support lead to a need
to redefine the target population; this is a nuanced problem, since doing so involves changing the
question of practical interest. Then the PATE can be estimated using one of several approaches.
Finally, the PATE estimate needs to be evaluated based upon its ability to remove bias.
In this section we focus on estimation when generalizing to a broader, or different, population of
units, and we assume that our target setting is the same as that in the experimental setting, i.e. c = c∗ .
In effect this means that Assumption 3 holds for all units i, which allows us to collapse our potential
46 Generalizability and Transportability
outcomes as Yi (w), w ∈ (0, 1), which are no longer dependent on context. We return to estimation
under changes to settings at the end of the section.
Here the threat is that there are some units in the target population P that have zero probability
of being in the sample S. Tipton [15] refers to this problem as a coverage error, or under-coverage.
This can be diagnosed by comparing the distributions of the estimated sample selection probabilities
(or their logits) in the sample S and the target population P . Here it can be helpful to focus on the
5th and 95th percentiles of these distributions, instead of on the full range, given sampling error.
Beyond determining if there is under-coverage, there are many situations in which it is helpful to
summarize the similarity of these distributions. When these distributions are highly similar to one
another, this suggests that the estimator of the PATE may not differ much from the SATE and may
have similar precision. However, when these are highly different, considerable adjustments may be
required, resulting in possible extrapolations and significantly less precision. This dissimilarity can
arise either because the sample and population are truly very different or because the treatment effect
heterogeneity is not strongly correlated with sample selection (see [27]). Thus, diagnostic measures
are helpful when determining if it is reasonable to proceed to estimate the PATE or if, instead, a more
credible sub-population (or different covariate set) needs to be defined (see Step 3).
One such metric for summarizing similarity is called the B-index or the generalizability index
[28]. This index is defined as the geometric mean of the densities of the logits of the sample
Estimation 47
selection probabilities (i.e., l(x) = logit(s(x))) in the sample fs = fs (l(x)) and target population
fp = fp (l(x)), Z
p
β= fs fp dfp . (3.6)
This index is bounded 0 ≤ β ≤ 1, where larger values indicate greater similarity between the
sample and target population. Tipton shows that the value of β is highly diagnostic of the degree of
adjustment required by the PATE estimator and its effect on the standard error. Simulations indicate
that when β > 0.7 or so, the sample is similar enough to the target population that adjustments have
minimal effect on precision, and – in the case of cluster-randomized field trials – when β > 0.9ish
the sample is as similar to the target population as a random sample of the same size [29]. Similarly,
when β ≤ 0.5, typically there is significant under-coverage and the sample and target population
exhibit large differences, thus requiring extrapolations in order to estimate the PATE [28]. In cases
in which the distributions of the logits are normally distributed, this reduces to a function of the
standardized mean difference of the logits – another diagnostic – defined as
ˆl(x|W = 0) − ˆl(x|W = 1)
SM D = . (3.7)
sd
where here sd indicates the standard deviation of the logits distribution in the target population
[17].
The approaches described so far for diagnosing similarity only require covariate information in
both the sample and target population. When the same outcome measure, Y , is available in both
the sample and target population; however, this information can also be used to assess similarity
through the use of placebo tests. Stuart and colleagues [17] describe a test that relies only on the
sample outcomes observed by those units in the control condition (i.e., Yi (0, c)) and comparing
these values with those observed for all units in the target population, assumed to be unexposed to
treatment, under different estimation strategies. Hartman and colleagues [26] describe a placebo test
for verifying the identifying assumption for the population average treatment effect on the treated
(PATT) that compares the adjusted sample treated units to treated units in the target population. We
will return to this option at the end of Step 4.
Tipton and colleagues [31] provides a case study from a situation in which the resulting samples
from two separate randomized trials were quite different than the target populations the studies
initially intended to represent. In these studies, the eligibility criteria used by the recruitment team
were not clear. They then explored possible eligibility criteria – based on one, two, or three covariates
– to define a series of possible sub-populations. The sample was then compared to each of these
sub-populations, and for each a generalizability index was calculated. The final target populations
were then selected based on this index. This enabled the target population for the study to be clearly
defined – using clear eligibility criteria that practitioners could apply – and for PATEs to be estimated.
Finally, an alternative to changing the estimand is to rely on stronger modeling assumptions. For
example, balancing weights do not rely on a positivity assumption for identification of the PATE,
rather the research must assume that either the treatment effect heterogeneity model is linear or
the sample selection model is link-linear in the included X used to construct the weights [32, 33],
allowing researchers to use weighting to interpolate or extrapolate beyond the area of common
support. Cross-design synthesis, which combine data treatment and outcome data collected in
both trial and observational target population data using model-based adjustments to project onto
stratum with no observations, provides an alternative modeling approach to estimating generalized
effects [34]. These approaches are particularly useful when considering issues of transportability,
where the study might have had strict exclusion criterion for some units. Of course, the stronger
the degree of extrapolation, such as in the studies described above, the more a researcher must
rely on these modeling assumptions. Researchers should carefully consider the trade offs between
restricting the target population and relying on modelling assumptions when considering failures
of the positivity assumption. As discussed in Section 3.5, better study design, where feasible, can
mitigate some of these issues.
3.3.4.1 Subclassification
where wj is the proportion of the population P in stratum j (so that Σwj = 1) and τ̂j is an estimate
of the SATE in stratum j. In a simple randomized experiment, for example, τ̂j = Y tj − Y cj . Here
typically strata are defined in relation to the distribution of the estimated sampling probabilities in the
target population P so that in each stratum wj = 1/k; as few as k = 5 strata can successfully reduce
most, but not all, of the bias (Cochran, 1968). The variance of this estimator can be estimated using,
k
X
V̂ (τ̂ps ) = wj2 SE(τ̂j )2 (3.9)
j=1
There are several extensions to this estimator that have been proposed, including the use of full
matching instead [17] and estimators that combine subclassification with small area estimation [35].
3.3.4.2 Weighting
Another approach is weighting-based, in which weights are constructed that make the sample
representative of the target population. When treatment assignment probabilities are equal across
Estimation 49
where recall that Wi indicates inclusion in the sample, Zi indicates a unit is in the treatment group,
and ŵi is the estimated probability that a unit is in the sample if generalizing, or odds of inclusion
if transporting. To improve precision and reduce the impact of extreme weights, the denominator
normalizes the weights to sum to 1 [27]. This also ensures that the outcome is bounded by the convex
hull of the observed sample outcomes.
For generalizability, in which the sample is a subset of the target population, weights are estimated
as inverse probability weights [12, 17, 36],
1
wi,ipw = (3.11)
ŝi
When transporting to a target population disjoint from the experimental sample, i.e. with Zi = 0,
the weights are estimated as the odds of inclusion.
1 1 − ŝi
wi,odds = × (3.12)
ŝi Pr(Wi = 0)
A benefit of this weighting-based estimator, τ̂w is that it weights the outcomes in the treatment and
comparison groups separately; in comparison, the subclassification estimator cannot be calculated if
one or more strata have either zero treatment or control units. Variance estimation is trickier with
the IPW estimator, however. Buchanan and colleagues [36] provide a sandwich estimator for the
variance of this IPW estimator, as well as code for implementing this in R. The bootstrap can also be
used to construct variance estimators [37].
A related method for estimating the weights for generalizability and transportability is through
calibration, or balancing weights [26, 38], which ensure the sample is representative of the target
population on important descriptive moments, such as matching population means. This can be
beneficial for precision, as well an alternative approach to estimation when positivity fails.
3.3.4.3 Matching
A third strategy is to match population and sample units. One approach would be to caliper match
with replacement based on the estimated sampling. In this approach the goal would be to find the
“nearest” match to each target population unit in the sample, keeping in mind that the same sample unit
would likely be matched to multiple population units. Another approach, building off of optimization
approaches for causal inference more generally uses mixed integer programming to directly match
sample units to each target population unit. This approach does not require propensity scores, instead
seeking balance (between the sample and target population) directly [28].
3.3.4.4 Modeling
In the previous approaches, the strata, weights, or matches are designed without any outcome data.
Another approach is to model the outcome directly. The simplest version of this is linear regression,
using a model like,
which explicitly models the heterogeneity in treatment effects via interaction terms. This model could
be fit using either only data from the sample, or, when data on the outcome Y is available in the target
50 Generalizability and Transportability
population, this can be used as well [40]. Based on this model, predicted outcomes are generated for
each unit in the target population and these are then averaged and differenced to estimate the PATE.
In addition to this approach Kern and colleagues [43] investigated two additional approaches,
with a focus on transporting a causal effect from a study to a new population. One of these approaches
combines weighting (by the odds) with a specified outcome model, and they explore several different
approaches to estimating the sampling probabilities (e.g., random forests). The other approach
they investigate is the use of Bayesian Additive Regression Trees (BART), which does not require
the assumptions of linear regression. Their simulations suggest that BART or IPW+RF work best,
when sample ignorability holds, though linear regression performed fairly similarly. More recent
work in this area further investigates the use of Bayesian methods using multilevel modeling and
post-stratification in combination [42].
and for the weighted sample versus target population (after). Ideally, the same assessment strategies
used for the use of propensity scores in observational studies can be applied as well, e.g., that these
values should be 0.25 or smaller.
Additionally, when the same outcome Y is available in the sample and target population, balance
can also be assessed in the same way on with regard to the outcome, a form of placebo test. To do so,
apply the estimator (e.g., weights) to units in the treatment group (W = 1) in the sample (W = 1)
and then compare this estimate to that of the population average of the treated units [26]. A similar
placebo test that relies only on control units is described by Stuart and colleagues [17].
Importantly, the approaches given focus only on establishing balance with regard to observed
moderators. Even when balance has been achieved on these moderators, it is possible that there re-
mains an unobserved moderator for which balance is not achieved. One way to approach this problem
is by asking how sensitive the estimate is to an unobserved moderator. Nguyen and colleagues [47]
discuss a sensitivity analysis for partially observed confounders, in which researchers have measured
important treatment effect modifiers in the experimental sample but not in the target population.
Researchers can assess sensitivity to exclusion of such a variable from either the weights or outcome
model by specifying a sensitivity parameter, namely plausible values for the distribution of the par-
tially observed modifier in the target population. Dahabreh and colleagues [48] provide a sensitivity
analysis that bypasses the need for knowledge about partially or fully unmeasured confounders
by directly specifying a bias function. Andrews and colleagues [49] provide an approximation to
possible bias when researchers can assume that residual confounding from unobservables is small
once adjustment for observables has occurred.
3.4 Extensions
The previous section focused on point estimation of the PATE in either the target population of origin
(generalization) or in a new target population (transportation). But when treatment effects vary, other
estimands may also be of interest.
τpU = τs . (3.15)
where τs is the SATE. Thus, if the outcome is dichotomous, Y ∈ (0, 1) and Pr(W = 1) = n/N =
0.05 then τp ∈ (0.05τs − 0.95, τs ). However, these bounds are also not typically very tight and, for
this reason, Chan suggests that they are perhaps most useful when implemented in combination with
other approaches such as subclassification.
52 Generalizability and Transportability
3.4.2 Sign-generalization
In between the point- and interval-estimation approaches is a focus on the sign of the PATE. In this
approach, researchers hypothesize the direction, but not the magnitude, of the average treatment effect
in a different population or setting. Sign-generalization is more limited in the strength of the possible
claim about the PATE, but it answers an important aspect of external validity for many researchers.
It may also be a practical compromise when the required assumptions for point-identification are
implausible.
Egami and Hartman [45] provide a statistical test of sign-generalization. The authors describe the
design of purposive variations, where researchers include variations of units, treatments, outcomes, or
settings/contexts in their study sample. These variations are designed to meet an overlap assumption,
which states that the PATE in the target population and setting lies within the convex hull of the
average treatment effects observed in the study across purposive variations. These average treatment
effect estimates are combined using a partial conjunction test [51, 52] to provide statistical evidence
for how many purposive variations support hypothesized sign of the PATE. An advantage of this
method, in addition to weaker assumptions, is that it allows researchers to gain some leverage on
answering questions regarding target validity with respect to changes in treatments and outcomes.
3.4.4 Meta-analysis
A final approach, often used for generalizing causal effects, is to do so by combining the results of
multiple studies using meta-analysis [65]. In this approach, the goal is to estimate the ATE (and
variation) in the population of studies not people or other units. To do so, outcomes are converted to
effect sizes and weights for each are calculated (typically inverse-variance). By conceiving of the
Planning Studies 53
observed studies as a random sample of the super-population of possible studies, the study average
effect size and the degree of variation in these effect sizes within- and across-studies can be estimated.
When covariates are also encoded for each effect size (or study), meta-regression models can be used
to estimate the degree to which these moderate the treatment effects.
When individual participant data is available for all units in each of these primary studies, [66]
provides an approach that allows for the PATE to be estimated for the population of units, as in the
remainder of this chapter. This requires each study to include the same set of moderator data, though
it does not require the positivity assumption to hold in every study.
itself not be of direct interest, doing so may result in better estimates for each of the actual target
populations of interest (i.e., subsets of the broad population).
If treatment effects vary, when possible it may be important to not only design the study to estimate
a PATE, but also to predict unit level treatment effects or to test hypotheses about moderators. Like
the considerations for transportability, this means that there are multiple estimands of interest – e.g.,
the PATE, moderators, and CATEs. Tipton and colleagues [71] provide an example in which the
goal was to estimate both an overall PATE and several subgroup PATEs. The problem here is that
the optimal sampling probabilities or design may be different for different estimands. Similarly,
estimands other than averages may also be of interest, including tests of treatment effect heterogeneity
and moderators. Tipton [64] provides an approach to sample selection in this case using a response
surface model framework.
3.6 Conclusion
In this chapter, we have provided an overview of methods for generalizing and transporting causal
effects from a sample to one or more target populations. These involve approaching bias from
the sample selection process in a similar vein to the bias from treatment selection. Like those for
addressing treatment selection, these require careful consideration of assumptions, the right covariate
(moderator) data to be available, and the need to clearly identify the estimand of interest.
We close by reminding readers that these approaches address but one part of the problems of
external validity that apply to all research. Questions of generalization and transportation call to
question the reasons that research is conducted, the questions asked, and the ways in which research
will be used for decision-making for both individuals and policies. For this reason, perhaps the most
important step researchers can take – whether using the methods described in this chapter or not – is
to clearly define characteristics of the sample, the limits to their study, and where and under what
conditions they expect results might generalize (or not), and to hypothesize the mechanism through
which the treatment may work.
3.7 Glossary
Internal validity: describes the extent to which a cause-and-effect relationship established in a
study cannot be explained by other factors.
External validity: is the extent to which you can generalize the findings of a study to other situations,
people, settings and measures.
Glossary 55
Construct validity: concerns the extent to which your test or measure accurately assesses what it’s
supposed to,
Statistical conclusion: validity concerns the degree to which the conclusions drawn from statistical
analyses of data are accurate and appropriate.
Generalization: concerns inferences about the value of a parameter estimated in a sample to a target
population that the sample is a part of.
References
[1] Benjamin Ackerman, Ian Schmid, Kara E Rudolph, Marissa J Seamans, Ryoko Susukida,
Ramin Mojtabai, and Elizabeth A Stuart. Implementing statistical methods for generalizing
randomized trial findings to a target population. Addictive Behaviors, 94:124–132, 2019.
[2] Irina Degtiar and Sherri Rose. A review of generalizability and transportability, 2021.
[3] Michael G. Findley, Kyosuke Kikuta, and Michael Denly. External validity. Annual Review of
Political Science, 24(1):365–393, 2021.
[4] Hartman, Erin. “Generalizing Experimental Results.” Advances in Experimental Political
Science, edited by James N. Druckman and Donald P. Green, Cambridge University Press,
Cambridge, 2021, pp. 385–410.
[5] Catherine R Lesko, Ashley L Buchanan, Daniel Westreich, Jessie K Edwards, Michael G
Hudgens, and Stephen R Cole. Generalizing study results: A potential outcomes perspective.
Epidemiology (Cambridge, Mass.), 28(4):553, 2017.
[6] Elizabeth Tipton and Robert B Olsen. A review of statistical methods for generalizing from
evaluations of educational interventions. Educational Researcher, 47(8):516–524, 2018.
[7] Thomas D Cook, Donald Thomas Campbell, and William Shadish. Experimental and quasi-
experimental designs for generalized causal inference. Houghton Mifflin, Boston, MA, 2002.
[8] Lee J Cronbach and Karen Shapiro. Designing evaluations of educational and social programs.
Jossey-Bass, San Francisco, CA, 1982.
[9] Paul W Holland. Statistics and causal inference. Journal of the American statistical Association,
81(396):945–960, 1986.
[10] Kosuke Imai, Gary King, and Elizabeth A Stuart. Misunderstandings between experimentalists
and observationalists about causal inference. Journal of the Royal Statistical Society: Series A
(Statistics in Society), 171(2):481–502, 2008.
[11] Robert B Olsen, Larry L Orr, Stephen H Bell, and Elizabeth A Stuart. External validity in
policy evaluations that choose sites purposively. Journal of Policy Analysis and Management,
32(1):107–121, 2013.
56 Generalizability and Transportability
[12] Daniel Westreich, Jessie K Edwards, Catherine R Lesko, Stephen R Cole, and Elizabeth A
Stuart. Target validity and the hierarchy of study designs. American Journal of Epidemiology,
188(2):438–443, 2019.
[13] Elias Bareinboim and Judea Pearl. A general algorithm for deciding transportability of experi-
mental results. Journal of Causal Inference, 1(1):107–134, 2013.
[14] Judea Pearl and Elias Bareinboim. External validity: From do-calculus to transportability across
populations. Statistical Science, 29(4):579–595, 2014.
[15] Elizabeth Tipton. Improving generalizations from experiments using propensity score sub-
classification: Assumptions, properties, and contexts. Journal of Educational and Behavioral
Statistics, 38(3):239–266, 2013.
[16] Elizabeth A Stuart and Anna Rhodes. Generalizing treatment effect estimates from sample
to population: A case study in the difficulties of finding sufficient data. Evaluation Review,
41(4):357–388, 2017.
[17] Elizabeth A Stuart, Stephen R Cole, Catherine P Bradshaw, and Philip J Leaf. The use of
propensity scores to assess the generalizability of results from randomized trials. Journal of the
Royal Statistical Society: Series A (Statistics in Society), 174(2):369–386, 2011.
[18] Naoki Egami and Erin Hartman. Covariate selection for generalizing experimental results:
Application to a large-scale development program in uganda*. Journal of the Royal Statistical
Society: Series A (Statistics in Society), n/a(n/a), n/a.
[19] Steffen L Lauritzen. Graphical models. Clarendon Press, Oxford, 1996.
[20] Jonas Haslbeck and Lourens J Waldorp. mgm: Estimating time-varying mixed graphical models
in high-dimensional data. Journal of Statistical Software, 93(8):1–46, 2020.
[21] Elias Bareinboim and Judea Pearl. Causal inference and the data-fusion problem. Proceedings
of the National Academy of Sciences, 113(27):7345–7352, 2016.
[22] Wilke, Anna M. and Macartan Humphreys. Field experiments, theory, and external valid- ity.
In: The SAGE Handbook of Research Methods in Political Science and International Relations.
2: 55 London: SAGE Publications Ltd, 2020.
[23] Carlos Cinelli and Judea Pearl. Generalizing experimental results by leveraging knowledge of
mechanisms. European Journal of Epidemiology, 36(2):149–164, 2021.
[24] Donald B Rubin. Bayesian inference for causal effects: The role of randomization. The Annals
of Statistics, 6(1): 34–58, 1978.
[25] Donald B Rubin. Formal mode of statistical inference for causal effects. Journal of Statistical
Planning and Inference, 25(3):279–292, 1990.
[26] Erin Hartman, Richard Grieve, Roland Ramsahai, and Jasjeet S Sekhon. From sample average
treatment effect to population average treatment effect on the treated: Combining experimental
with observational studies to estimate population treatment effects. Journal of the Royal
Statistical Society: Series A (Statistics in Society), 178(3):757–778, 2015.
[27] Luke W. Miratrix, Jasjeet S. Sekhon, Alexander G. Theodoridis, and Luis F. Campos. Worth
weighting? how to think about and use weights in survey experiments. Political Analysis,
26(3):275–291, 2018.
Glossary 57
[28] Elizabeth Tipton. How generalizable is your experiment? an index for comparing experimental
samples and populations. Journal of Educational and Behavioral Statistics, 39(6):478–501,
2014.
[29] Elizabeth Tipton, Kelly Hallberg, Larry V Hedges, and Wendy Chan. Implications of small
samples for generalization: Adjustments and rules of thumb. Evaluation Review, 41(5):472–505,
2017.
[30] Colm O’Muircheartaigh and Larry V Hedges. Generalizing from unrepresentative experiments:
A stratified propensity score approach. Journal of the Royal Statistical Society: Series C:
Applied Statistics, 63(2), 195–210, 2014.
[31] Elizabeth Tipton, Lauren Fellers, Sarah Caverly, Michael Vaden-Kiernan, Geoffrey Borman,
Kate Sullivan, and Veronica Ruiz de Castilla. Site selection in experiments: An assessment of
site recruitment and generalizability in two scale-up studies. Journal of Research on Educational
Effectiveness, 9(sup1):209–228, 2016.
[32] Qingyuan Zhao and Daniel Percival. Entropy balancing is doubly robust. Journal of Causal
Inference, 5(1): 20160010, 2016.
[33] Erin Hartman, Chad Hazlett, and Ciara Sterbenz. Kpop: A kernel balancing approach for
reducing specification assumptions in survey weighting. arXiv preprint arXiv:2107.08075,
2021.
[34] Eloise E Kaizar. Estimating treatment effect via simple cross design synthesis. Statistics in
Medicine, 30(25):2986–3009, 2011.
[35] Wendy Chan. The sensitivity of small area estimates under propensity score subclassification
for generalization. Journal of Research on Educational Effectiveness, 15(1), 178-215, 2021.
[36] Ashley L Buchanan, Michael G Hudgens, Stephen R Cole, Katie R Mollan, Paul E Sax, Eric S
Daar, Adaora A Adimora, Joseph J Eron, and Michael J Mugavero. Generalizing evidence
from randomized trials using inverse probability of sampling weights. Journal of the Royal
Statistical Society. Series A,(Statistics in Society), 181(4):1193, 2018.
[37] Ziyue Chen and Eloise Kaizar. On variance estimation for generalizing from a trial to a target
population, 2017.
[38] Kevin P Josey, Fan Yang, Debashis Ghosh, and Sridharan Raghavan. A calibration approach to
transportability with observational data. arXiv preprint arXiv:2008.06615, 2020.
[39] Magdalena Bennett, Juan Pablo Vielma, and José R Zubizarreta. Building representative
matched samples with multi-valued treatments in large observational studies. Journal of
Computational and Graphical Statistics, 29(4):744–757, 2020.
[40] Melody Huang, Naoki Egami, Erin Hartman, and Luke Miratrix. Leveraging population
outcomes to improve the generalization of experimental results, 2021.
[41] Holger L Kern, Elizabeth A Stuart, Jennifer Hill, and Donald P Green. Assessing methods
for generalizing experimental impact estimates to target populations. Journal of Research on
Educational Effectiveness, 9(1):103–127, 2016.
[42] Lauren Kennedy and Andrew Gelman. Know your population and know your model: Using
model-based regression and poststratification to generalize findings beyond the observed sample.
Psychological Methods, 48(6), 3283-3311, 2021.
58 Generalizability and Transportability
[43] Issa J Dahabreh, Sarah E Robertson, Jon A Steingrimsson, Elizabeth A Stuart, and Miguel A
Hernan. Extending inferences from a randomized trial to a new target population. Statistics in
Medicine, 39(14):1999–2014, 2020.
[44] Nianbo Dong, Elizabeth A Stuart, David Lenis, and Trang Quynh Nguyen. Using propensity
score analysis of survey data to estimate population average treatment effects: A case study
comparing different methods. Evaluation Review, 44(1):84–108, 2020.
[45] Naoki Egami and Erin Hartman. Elements of external validity: Framework, design, and analysis.
American Political Science Review (2022) 1–19.
[46] Kara E Rudolph and Mark J van der Laan. Robust estimation of encouragement-design
intervention effects transported across sites. Journal of the Royal Statistical Society. Series B,
Statistical Methodology, 79(5):1509, 2017.
[47] Trang Quynh Nguyen, Benjamin Ackerman, Ian Schmid, Stephen R Cole, and Elizabeth A
Stuart. Sensitivity analyses for effect modifiers not observed in the target population when
generalizing treatment effects from a randomized controlled trial: Assumptions, models, effect
scales, data scenarios, and implementation details. PloS One, 13(12):e0208795, 2018.
[48] Issa J Dahabreh, James M Robins, Sebastien JP Haneuse, Iman Saeed, Sarah E Robertson, Elisa-
beth A Stuart, and Miguel A Hernán. Sensitivity analysis using bias functions for studies extend-
ing inferences from a randomized trial to a target population. arXiv preprint arXiv:1905.10684,
2019.
[49] Isaiah Andrews and Emily Oster. Weighting for external validity. NBER working paper,
(w23826), 2017.
[50] Wendy Chan. Partially identified treatment effects for generalizability. Journal of Research on
Educational Effectiveness, 10(3):646–669, 2017.
[51] Yoav Benjamini and Ruth Heller. Screening for Partial Conjunction Hypotheses. Biometrics,
64(4):1215–1222, 2008.
[52] Bikram Karmakar and Dylan S. Small. Assessement of the extent of corroboration of an
elaborate theory of a causal hypothesis using partial conjunctions of evidence factors. Annals
of Statistics, 2020.
[53] David S Yeager, Paul Hanselman, Gregory M Walton, Jared S Murray, Robert Crosnoe,
Chandra Muller, Elizabeth Tipton, Barbara Schneider, Chris S Hulleman, Cintia P Hinojosa,
et al. A national experiment reveals where a growth mindset improves achievement. Nature,
573(7774):364–369, 2019.
[54] Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects.
Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016.
[55] Kosuke Imai and Marc Ratkovic. Estimating treatment effect heterogeneity in randomized
program evaluation. The Annals of Applied Statistics, 7(1):443–470, 2013.
[56] Hugh A Chipman, Edward I George, and Robert E McCulloch. Bart: Bayesian additive
regression trees. The Annals of Applied Statistics, 4(1):266–298, 2010.
[57] Donald P Green and Holger L Kern. Modeling heterogeneous treatment effects in survey
experiments with bayesian additive regression trees. Public Opinion Quarterly, 76(3):491–511,
2012.
Glossary 59
[58] P Richard Hahn, Jared S Murray, and Carlos M Carvalho. Bayesian regression tree models for
causal inference: Regularization, confounding, and heterogeneous effects (with discussion).
Bayesian Analysis, 15(3):965–1056, 2020.
[59] Jennifer L Hill. Bayesian nonparametric modeling for causal inference. Journal of Computa-
tional and Graphical Statistics, 20(1):217–240, 2011.
[60] Richard K Crump, V Joseph Hotz, Guido W Imbens, and Oscar A Mitnik. Nonparametric tests
for treatment effect heterogeneity. The Review of Economics and Statistics, 90(3):389–405,
2008.
[61] Peng Ding, Avi Feller, and Luke Miratrix. Randomization inference for treatment effect
variation. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 78(3),
655-671, 2016.
[62] P Richard Hahn, Vincent Dorie, and Jared S Murray. Atlantic causal inference conference (acic)
data analysis challenge 2017. arXiv preprint arXiv:1905.09515, 2019.
[63] Elizabeth Tipton and Larry V Hedges. The role of the sample in estimating and explaining
treatment effect heterogeneity. Journal of Research on Educational Effectiveness, 10(4):903–
906, 2017.
[64] Elizabeth Tipton. Beyond generalization of the ate: Designing randomized trials to understand
treatment effect heterogeneity. Journal of the Royal Statistical Society: Series A (Statistics in
Society), 184(2):504–521, 2021.
[65] Larry V Hedges and Ingram Olkin. Statistical methods for meta-analysis. Academic Press,
Orlando, FL, 1985.
[66] Issa J Dahabreh, Lucia C Petito, Sarah E Robertson, Miguel A Hernán, and Jon A Steingrimsson.
Towards causally interpretable meta-analysis: Transporting inferences from multiple studies to
a target population. arXiv preprint arXiv:1903.11455, 2019.
[67] Robert B Olsen and Larry L Orr. On the “where” of social experiments: Selecting more
representative samples to inform policy. New Directions for Evaluation, 2016(152):61–71,
2016.
[68] Elizabeth Tipton. Stratified sampling using cluster analysis: A sample selection strategy for
improved generalizations from experiments. Evaluation Review, 37(2):109–139, 2014.
[69] Elizabeth Tipton and Laura R Peck. A design-based approach to improve external validity in
welfare policy evaluations. Evaluation Review, 41(4):326–356, 2017.
[70] Elizabeth Tipton. Sample selection in randomized trials with multiple target populations.
American Journal of Evaluation, 43(1), 70-89, 2022. Forthcoming.
[71] Elizabeth Tipton, David S Yeager, Ronaldo Iachan, and Barbara Schneider. Designing probabil-
ity samples to study treatment effect heterogeneity. Experimental methods in survey research:
Techniques that combine random sampling with random assignment, pages, edited by Lavrakas,
Paul J., Michael W. Traugott, Courtney Kennedy, Allyson L. Holbrook, Edith D. de Leeuw, and
Brady T. West, John Wiley & Sons, 435–456, 2019.
[72] Elizabeth Tipton, Larry Hedges, David Yeager, Jared Murray, and Maithreyi Gopalan.
Global mindset initiative paper 4: Research infrastructure and study design. SSRN, 2021.
https://ssrn.com/abstract=3911643.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Part II
Matching
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
4
Optimization Techniques in Multivariate Matching
CONTENTS
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.1 Goal of this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.2 The role of optimization in matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Optimal Assignment: Pair Matching to Minimize a Distance . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1 Notation for the assignment problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.2 Greedy algorithms can produce poor solutions to the assignment problem . 66
4.2.3 Nearest neighbor matching with replacement can waste controls . . . . . . . . . . . 69
4.2.4 Finding an optimal assignment by the auction algorithm . . . . . . . . . . . . . . . . . . 70
4.2.5 Simple variations on optimal assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3 The Speed of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Optimal Matching by Minimum Cost Flow in a Network . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.1 Minimum cost flow and optimal assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.2 The minimum cost flow problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.3 Simple examples of matching by minimum cost flow . . . . . . . . . . . . . . . . . . . . . 73
4.4.4 Solving the minimum cost flow problem in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.5 Size and speed of minimum cost flow problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5 Tactics for Matching by Minimum Cost Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5.1 What are tactics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5.2 Fine, near-fine balance and refined balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5.3 Adjusting edge costs to favor some types of pairs . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5.4 Using more controls or fewer treated individuals . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5.5 Threshold algorithms that remove edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.6 Optimal Matching using Mixed Integer Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.6.1 Working without a worst-case time bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.6.2 Linear side constraints for minimum distance matching . . . . . . . . . . . . . . . . . . . 79
4.6.3 Cardinality matching: Matching without pairing . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
DOI: 10.1201/9781003102670-4 63
64 Optimization Techniques in Multivariate Matching
4.1 Introduction
4.1.1 Goal of this chapter
Optimization techniques for multivariate matching in observational studies are widely used and
extensively developed, and the subject continues to grow at a quick pace. The typical research
article published today presumes acquaintance with some technical background that is well-known
in computer science and operations research, but is not standardly included in doctoral education
in statistics. The relevant technical material is interesting and not especially difficult, but it is
not curated in a form that permits relatively quick study by someone who wishes to develop new
techniques for matching in statistics. The goal of this chapter is to review some of this technical
background at a level appropriate for a PhD student in statistics. This chapter assumes no background
in combinatorial optimization, and the chapter may serve as an entry point to the literature on
combinatorial optimization for a student of statistics. In contrast, a scientist seeking a review of
methods for constructing a matched sample in an empirical article might turn to references [1, 2].
Matching is used to achieve a variety of goals. Mostly commonly, it is used to reduce confounding
from measured covariates in observational studies [3]. Matching may be used to strengthen an
instrument or instrumental variable [4–7]. In addition, it may be used to depict the processes that
produce racial or gender disparities in health or economic outcomes [8,9]. Close pairing for covariates
highly predictive of the outcome can also reduce sensitivity of causal inferences to biases from
unmeasured covariates [10, 11].
For instance, Neuman and colleagues [6] used matching to reduce confounding from measured
covariates, while also strengthening an instrument intended to address certain unmeasured covariates,
in a study of the effects of regional versus general anesthesia in hip fracture surgery. Some hospitals
commonly use regional/local anesthesia for hip fracture surgery, while other hospitals commonly use
general anesthesia. Someone who breaks their hip is likely to taken to an emergency room, and to
receive during hip fracture surgery the type of anesthesia commonly used in that hospital. A possible
instrument for regional-versus-general anesthesia is the differential distance to the nearest hospital
that commonly uses each type of anesthesia. In New York State as a whole, this instrument is weak:
in the densely populated parts of the State, such as New York City with its many hospitals in a small
space, geography exerts only a slight nudge, pushing a patient towards one type of anesthesia or
the other. In contrast, in the sparsely populated parts of upstate New York, there may be only one
hospital in a mid-sized town, and geography strongly influences the type of anesthesia that a patient
receives. This distinction is vividly displayed on a map of New York State in Figure 2 of Neuman
et al. [6]. Matching sought to form pairs close in term of measured covarates in which one person
was much closer to a hospital using regional anesthesia, and the other person was much closer to a
hospital using general anesthesia. In this way, a large study with a weak instrument is replaced by a
smaller study with a strong instrument [4, 5]. Using data from New York’s Statewide Planning and
Research Cooperative System, Neuman and colleges [6] built a closely matched study with a strong
instrument consisting of 10,757 matched pairs of two patients undergoing hip fracture surgery, one
close only to a hospital using regional anesthesia, the other close only to a hospital using general
anesthesia. Numerous covariates were controlled by matching.
analysis is conducted when outcomes are examined for the first time. A single primary analysis
does not preclude secondary analyses or exploratory analyses; rather, it distinguishes among such
analyses.
Optimization is a framework within which matched samples may be constructed and compared.
An optimal match is best in terms of a certain criterion subject to certain requirements or constraints.
However, optimization of one criterion is not recommendation; rather, human judgement is commonly
required to compare multiple criteria. An optimal match is one study design that may be compared
with other study designs optimal in terms of altered criteria or constraints, as a recommended design
is chosen. It is difficult to compare study designs that lack definite properties, and optimization gives
each study design certain definite properties.
In truth, we rarely need exact optimality except as a foundation for comparison. Wald [13] proved
the consistency of maximum likelihood estimators under extremely weak conditions by showing
that all estimators with sufficiently high likelihoods are consistent. Huber [14] showed that there
are estimators that are very nearly the equal of maximum likelihood estimators when the underlying
model is true, and that are much better when the model is false in particular ways. Both Wald and
Huber asserted that a suboptimal estimator was adequate or in some way superior, but they could
only say this because they could compare its performance with an optimal procedure. Optimality is
a benchmark, not a recommendation.
Typically, an investigator has several objectives for a matched sample, some of which cannot
be measured by quantitative criteria. For instance, the investigator wants the match to be correctly
and appropriately persuasive to its target audience – perhaps scientists or medical researchers in a
particular field, or a regulatory agency, or a judge and jury – but it is difficult give a mathematical
definition of persuasive. A persuasive study design is public, open to view, open to assessment, and
open to criticism by its target audience. The design has to be intelligible to its target audience in
a way that permits criticism by experts in that audience. It can be a powerful endorsement of a
study and its findings if the design is intelligible to experts and the only criticisms that arise seem
inconsequential. The audience must understand the design if it is to judge certain criticisms as
inconsequential. Because a study design is proposed before outcomes are examined, the design may
be shared with its target audience before it is used, so that reasonable criticisms may be addressed
before any analysis begins.
In brief, optimality is a framework within which competing study designs may be compared.
Within such a framework, one quickly discovers new ways to improve study designs, replacing a
design optimal in an impoverished sense by a new design optimal in an enriched sense.
4.1.3 Outline
Section 4.2 begins with an old problem that had been well-solved in 1955. This optimal assignment
problem forms matched treated-control pairs to minimize the total covariate distance within the pairs.
This problem is not yet a practical problem from the perspective of statistics, but it is a good place
to clarify certain issues. Section 4.3 discusses the way computer scientists traditionally measure
the speed of an algorithm, including the limitations in statistics of this traditional measure. The
assignment algorithm generalizes to the problem of minimum cost flow in a network, as discussed in
§4.4, leading to many practical methods for statistical problems. Optimal assignment and minimum
cost flow problems of the sizes commonly encountered can be solved quite quickly by the traditional
standard of computational speed in §4.3. Network techniques are easily enhanced by various tactics,
as discussed in §4.5. Section 4.6 greatly expands the set of possible matched designs using mixed
integer programming. Unlike minimum cost flow problems, integer programs can be difficult to
solve by the traditional standard of computational speed in §4.3; however, this may be a limitation
of the traditional standard when used in this context, not a limitation of matching using integer
programming; see §4.6.
66 Optimization Techniques in Multivariate Matching
subject to constraints
C
X T
X
atc = 1 for each t, and atc ≤ 1 for each c, with each atc ∈ {0, 1} . (4.2)
c=1 t=1
PC PT
In (4.2), c=1 atc = 1 insists that τt is paired with exactly one control γc , and t=1 atc ≤ 1 insists
that each control is used at most once.
Two misconceptions are common. The assignment problem can appear to be either trivial or
impossible, but it is neither. It can seem trivial if you assign to each τt to the control γi who is
closest, the one with δti = minc∈C δtc ; however, that fails because it may use the same control many
times, and might even produce a control group consisting of a single control rather than T distinct
controls. Upon realizing that the assignment problem is not trivial, it can seem impossible: there are
C!/ (C − T )! possible solutions to the constraints (4.2), or 3.1 × 1093 solutions for a tiny matching
problem with T = 50 and C = 100. It seems impossible to evaluate (4.1) for C!/ (C − T )! solutions
to find the best solution.
4.2.2 Greedy algorithms can produce poor solutions to the assignment problem
A greedy algorithm pairs the two closest individuals, i and j, with δij = mint∈T minc∈C δtc , and
removes i and j from further consideration. From the individuals who remain, the greedy algorithm
pairs the two closest individuals, and so on. There is a small but interesting class of problems for
which greedy algorithms provide optimal solutions [15, Chapter 7]; however, the assignment problem
is not one of these problems.
Greedy and optimal matching are compared in Table 4.1, where T = C = 3, and β 5ε >
ε > 0. Here, β is for a “big” distance, much larger than 5ε. Greedy would form pairs (τ1 , γ2 ) and
(τ2 , γ3 ) with a total within-pair distance of δ12 + δ23 = 0 + 0 = 0, but then would be forced to
add the pair (τ3 , γ1 ), making the total distance δ12 + δ23 + δ31 = 0 + 0 + β = β. In Table 4.1,
greedy matching starts strong, with very small distances, but paints itself into a corner, and its final
match is poor. The optimal matching is (τ1 , γ1 ), (τ2 , γ2 ), (τ3 , γ3 ), with a total within pair distance
of ε + ε + ε = 3ε β. As β can be any large number or even ∞, it follows that β − 3ε can be any
large number or ∞, so the total distance produced by greedy matching can be vastly worse than the
total distance produced by optimal matching. The optimization in optimal matching is needed to
recognize that τ1 had to be paired with γ1 to avoid pairing τ2 or τ3 with γ1 .
Optimal Assignment: Pair Matching to Minimize a Distance 67
TABLE 4.1
A T × C = 3 × 3 distance matrix, δtc .
γ1 γ2 γ3
τ1 ε 0 5×ε
τ2 β ε 0
τ3 β 5×ε ε
8
δtc = (xt − xc)2
6
4
2
0
Greedy Optimal
Matching Method
FIGURE 4.1
Boxplots of 300 within pair distances, δtc = (xt − xc )2 , for greedy and optimal matching. Thirty
greedy distances, 30/300 = 10%, exceed the maximum distance, 0.4373, for optimal matching, and
in all 30 pairs, the covariate difference is positive, (xt − xc ) > 0.
The typical distance matrix in current use is broadly similar in structure to Table 4.1. There are
some big distances, like β, and some small distances, like ε or 5ε. The big distances may reflect
penalties imposed to try to enforce some constraint, while the small distances reflect close pairs that
satisfy the constraint. For instance, [3] recommended imposing a caliper on the propensity score and,
within that caliper, matching using the Mahalanobis distance. Large distances like β would then
refer to individuals whose propensities were far apart, violating the caliper, while small distances like
ε or 5ε would distinguish people with similar propensity scores but differing patterns of covariates.
In this context, in Table 4.1, greedy matching would produce one pair of people with very different
propensity scores, while optimal matching would form no pairs with very different propensity scores.
Figure 4.1 is a small simulated example, in which T = 300 treated individuals are matched with
C = 600 potential controls to form 300 matched pairs. There is a single covariate x drawn from
2
N (0.75, 1) for treated individuals and from N (0, 1) for controls. The distance is δtc = (xt − xc ) ,
and 300 within-pair distances are shown in Figure 4.1 for greedy and for optimal matching. Prior to
2
matching, for a randomly chosen treated individual and a control, xt −xc is N (0.75, 2) so (xt − xc )
has expectation 0.752 + 2 = 2.56, and has sample mean 2.59 is the simulation. As in Table 4.1, in
Figure 4.2 greedy matching starts strong, finding many close pairs, but paints itself into a corner
and finally accepts some very poor pairs. These poorly matched pairs could have been avoided and
were avoided by optimal matching. The largest distance δtc for a pair matched by the greedy method
is 9.39, while the largest distance by the optimal method is 0.44. In total, 30/300 = 10% of the
greedy pairs have larger distances than the largest distance, .44, produced by optimal matching. This
is shown in greater detail in Table 4.2 which lists the total of 300 within-pair distances, their mean,
median, and upper percentiles.
68 Optimization Techniques in Multivariate Matching
TABLE 4.2
Summary of 300 within-pair distances, δtc = (xt − xc )2 , for greedy and optimal matching. Also
included is the summary of the 300 × 600 distances before matching.
Greedy
Optimal
−2
−4
n=300 n=600
T C
FIGURE 4.2
The 300th of 300 pairs formed by greedy matching (solid line) compared to the pair for the same
treated individual formed by optimal matching (dashed line).
Also important in Figure 4.1 are the signs of the 300 within-pair differences, xt − xc . Before
matching, the bias was positive, E(xt − xc ) = 0.75. In greedy matching, when the distance,
2
δtc = (xt − xc ) , is large, the difference xt − xc tends to be a large positive number, not a large
2
negative number, so the original bias is intact. All of the 30/300 = 10% largest δtc = (xt − xc )
from greedy matching are positive differences, xt − xc > 0, often with differences larger than 0.75.
Figure 4.2 shows where greedy matching went wrong. It shows the 300th pair formed by greedy
matching, and the corresponding match for optimal matching. The treated individual with the
highest xt was matched last by greedy matching, when its good controls were gone, whereas optimal
matching paired the highest xt to the highest xc .
Are Figures 4.1 and 4.2 and Table 4.2 typical? Yes and no. When there are many covariates, x is
replaced by the propensity score, so Figures 4.1 and 4.2 and Table 4.2 remain relevant. However,
some matching problems are impossible, say xt ∼ N (5, 1) and xc ∼ N (0, 1), and in impossible
problems neither greedy nor optimal matching is useful. Similarly, some matching problems are easy,
say pair matching with xt ∼ N (0, 1) and xc ∼ N (0, 1) and C/T = 5; then, greedy and optimal
matching both work well. Optimal matching outperforms when matching is neither impossible nor
easy, when there is serious competition among treated individuals for the same control, as in Figure
4.2 and Table 4.2. Even in the easy case just described, with C/T = 5, we might prefer to match
two or three controls to each treated individual, so an easy pair matching problem becomes a more
challenging 3-to-1 matching problem, and optimal matching would outperform greedy matching,
perhaps substantially. For instance, a poor 3-to-1 greedy match may lead an investigator to settle
Optimal Assignment: Pair Matching to Minimize a Distance 69
for a 2-to-1 match, when a good 3-to-1 optimal match is possible. Perhaps equally important, an
investigator using greedy with a good 2-to-1 greedy match and a poor 3-to-1 greedy match does
not know whether a good 3-to-1 match exists, precisely because no criterion of success has been
optimized.
TABLE 4.3
Reuse of the same control in 300 matched pairs by nearest neighbor matching with replacement. The
300 controls are represented by only 184 distinct controls.
Unlike randomized treatment assignment, matching with replacement focuses on the closeness of
matched individuals, ignoring the comparability of treated and matched control groups as a whole.
Like randomized experiments, modern matching methods are more concerned with the comparability
of treated and matched control groups as whole groups, not with close pairing for many covariates;
see §4.5.2, §4.6.2 and §4.6.3. Close pairing for a few important predictors of the outcome does affect
70 Optimization Techniques in Multivariate Matching
design sensitivity; however, close pairing for all covariates is not required to pair closely for a few
important predictors of the outcome while also balancing many more covariates [10, 11].
A practical alternative to using fewer than 300 controls in matching with replacement is to use
more than 300 controls in full matching. Full matching without replacement can match to use every
available control, and even when it is allowed to discard some distant controls, it can often use many
more controls than pair matching. In full matching, each matched set contains either one treated
individual and one or more controls, or one control and one or more treated individuals. The sets do
not overlap, unlike matching with replacement. Full matching can produce smaller distances and use
more controls than pair matching, but weights are needed. The increased sample size may partially
offset the need for weights. For discussion of full matching, see [16–25].
current price plus $50. That is, τ` might have been willing to pay pj + /2 for γj , but does not
ultimately take γj from τt , because τ` is not willing to pay pj + .
Bertsekas [28, Proposition 2.6, p. 47] shows that the auction terminates in an assignment that
minimizes the total within pair distance with an error of at most T . If the δtc are nonnegative
integers and < 1/T , then the auction finds an optimal assignment.
The description of the auction algorithm given here is a conceptual sketch. There are many
technical details omitted from this sketch that are important to the performance of the auction
algorithm. As discussed in §4.4, a good implementation is available inside R.
subject to constraints
A flow f = (fe , e ∈ E) is feasible if it satisfies the constraints (4.4) and (4.5). A feasible flow
may or may not exist. A feasible flow is optimal if it minimizes the cost (4.3) among all feasible
flows. There may be more than one optimal feasible flow.
γ1
τ1
γ2
τ2 ω
γ3
τ3
γ4
FIGURE 4.3
Network representation of the pair-match or assignment problem. Every edge has capacity 1. The
three τt each supply one unit of flow, the sink ω absorbs three units of flow, and the γc pass along all
flow they receive. The node supplies and edge capacities mean that a feasible flow pairs each treated
τt to a different control γc . A minimum cost flow discards one control and pairs the rest to minimize
the total of the three within-pair distances.
A moment’s thought shows that every feasible flow has precisely T edges e = (τt , γc ) with
fe = 1, and these comprise the T matched pairs, and it has T C − T = T (C − 1) edges e = (τt , γc )
with fe = 0. Moreover, every feasible flow has matched C distinct controls, specifically the controls
γc with fe = 1 for e = (γc , ω), and it has discarded C − T controls with fe = 0 for e = (γc , ω). So
each feasible flow corresponds with one of the C!/ (C − T )! possible pair matchings in §4.2.
If we set cost (e) = δtc for e = (τt , γc ) and cost (e) = 0 for e = (γc , ω), then the cost of a
feasible flow is the total within pair distance, as in §4.2. A minimum cost feasible flow is the solution
to the assignment problem, that is, a minimum distance pair matching.
If C ≥ 2T , to match two controls to each treated individual, as in §4.2.5, then retain the
same network but change the supplies to sτt = 2 for t = 1, . . . , T , sγc = 0, for c = 1, . . . , C,
sω = −2T . Unlike the assignment representation of the 2-to-1 matching problem in §4.2.5, the
network representation of 2-to-1 matching does not involve duplication of distance matrices or treated
individuals.
γ1
τ1
ωF
γ2
τ2
γ3
ωM
τ3
γ4
FIGURE 4.4
Fine balance for gender is achieved by splitting the sink into a female sink, ωF and a male sink ωM ,
and controls γc are connected to the sink that represents their own gender. If two treated individuals
were female and one was male, the supply of ωF would be −2 and the supply of ωM would be −1,
forcing the selection of two female controls and one male control, without forcing women to be
paired with women, or men to be paired with men.
that the deviation from fine balance, though not zero, is minimized; see [33]. Near-fine balance
is implemented by adding one more sink to Figure 4.4. Because there is a deficit of women, the
female sink has supply equal to the negative of the number of women in the control group, while the
male sink has supply equal to the number of men in the treated group. An additional sink absorbs
precisely the flow that cannot pass through the category sinks: it has supply equal to the negative of
the total deficit in the category sinks.
In forming 38,841 matched pairs of two children, one receiving surgery at a children’s hospital,
the other at a conventional hospital, Ruoqi Yu and colleagues [34] constrained the match to achieve
near-fine balance for 973 Principal Diagnoses, while matching exactly for 463 Principal Surgical
Procedures. That is, children were paired for surgical procedure, but diagnoses were balanced
without necessarily being paired. Subject to those constraints, a covariate distance was minimized.
When fine balance is infeasible, it may not be enough to minimize the extent of the imbalance.
Often, nominal categories have a structure, so that some categories are similar to one another. In this
case, if fine balance is infeasible, we prefer to counterbalance with a control in a near-by category.
Refined balance replaces the two sinks in Figure 4.4 by a tree-structure of sinks, whose categories
are the leaves of the tree, preferring to counterbalance with categories that are close together on the
same branch; see Pimentel and colleagues [35] who balance 2.8 million categories.
Fine balance and its variants apply in a similar way when matching in a fixed ratio, say 1-to-2.
Pimentel, Yoon and Keele [36] extend the concept of fine balance to matching with a variable number
of controls.
Bo Zhang and colleagues [37] extend the concept of fine balance. In fine balance, a match must
meet two criteria, one defined in terms of balancing a nominal covariate, the other in terms of close
pairing of treated and control individuals. The balance criterion refers to the control group as a whole,
not to who is paired with whom. In parallel, Zhang and colleagues [37] have two criteria, but the
balance criteria need not be confined to a nominal covariate. In their approach, the balance criteria
do not affect who is paired with whom, but they do affect who is selected for the control group as
Tactics for Matching by Minimum Cost Flow 77
a whole. For instance, the balance criteria might include a narrow caliper on the propensity score,
ensuring that the entire distribution of propensity scores is similar in treated and matched control
groups, but individual pairs might have only a loose caliper on the propensity score, so that pairs are
also close in terms of a few covariates highly predictive of the outcome.
A problem with removing edges – even very bad edges – from a network is that, if too many
edges are removed, or the wrong edges are removed, then there may be no feasible flow, hence no
possible match. Let ν : T × C → R assign a real number, ν (e), to each e = (τt , γc ). A threshold
algorithm starts with a candidate threshold, say κ, and removes all edges with ν (e) > κ, and
finds a minimum cost flow in the resulting network. If the problem is infeasible, it increases κ and
tries again. If the problem is feasible, the algorithm reduces κ and tries again. A binary search
for the smallest feasible κ quickly produces a short interval of values of κ such that feasibility is
assured for the upper endpoint of the interval and is lacking at the lower endpoint. If the number of
edges is O (C log C) for each κ, then a dozen iterations of the threshold algorithm may take much
less time and space than a single optimization of a dense network with O C 2 edges; see §4.4.5.
Garfinkel [40] used a threshold algorithm for the so called bottleneck assignment problem.
A simple version adds a constraint to the match in Figure 4.3. The constraint finds the minimum
total cost pair-match among all pair-matches that minimize the maximum value ν (e) for all e =
(τt , γc ) who are matched; see [41].
The threshold algorithm is particularly fast when the requirement that ν (e) ≤ κ creates a doubly
convex bipartite graph from T × C. For example, this occurs when ν (e) is the absolute difference in
propensity scores. It also occurs if ν (e) = ∞ for e = (τt , γc ) when τt and γc belong to different
categories of a nominal covariate, but if they are in the same category then ν (e) is the absolute
difference in propensity scores. In these and similar cases, Glover’s [42] algorithm may be applied
in each iteration of the binary search for the smallest feasible κ; see also [43]. Ruoqi Yu and
colleagues [34] use this method to determine the smallest feasible caliper for the propensity score in
large matching problems that also employ exact match requirements, a covariate distance and fine
balance. An extension of her method can discard edges using complex distances [44].
useful guide to the difficulty of a general class of problems, but sometimes the worst-case situation is
quite peculiar and unlikely to correspond with the situation that you happen to have.
For a statistician with a matching problem, ask: How important is it to have a worst-case bound
on computation time of the form O(C λ ) as C → ∞? Arguably, the statistician is concerned with
a particular matching problem with a particular size C, and if the computations are feasible for
that problem, then that is all that the statistician requires. If the computations are not feasible for
that problem, then perhaps the original matching problem can be replaced by a different matching
problem that is computationally feasible. The worst that can happen if you attempt to build a match
without a worst-case time bound is that the algorithm will run for longer than you can tolerate, you
will cancel the algorithm’s execution and reformulate the problem as one that is computationally
easier. Indeed, the same thing can happen when there is a worst-case time bound like O(C λ ) if C is
large and λ is not small. In practice, worst-case time bounds tend to label a computation problem as
tractable or not; however, they do not identify the best algorithm for solution of that problem.
In practice, in designing an observational study, it is common to build several matched
samples – always without access to outcomes – and then select the one matched sample that
best balances various objectives. Because the investigator does not have access to outcomes – only to
treatments and observed covariates – this process of considering alternative designs is analogous to
considering several designs for an experiment before selecting one for actual use: it cannot bias the
study, because outcomes are considered in only one study design. In this context it is not a major
inconvenience if there are one or two failed attempts to build certain matched samples, because the
matching algorithm did not complete its task in a tolerable amount of time. Commonly, with T + C
equal to a few thousand, an optimal match is completed in the time it takes to leave your computer
to refill your coffee mug. With T + C equal to tens or hundreds of thousands, greater attention to
computational efficiency is required [34, 37, 44, 46].
There are several methods in common use for matching problems that are not minimum cost
flow problems. First, certain matching problems are very similar to minimum cost flow problems,
but they have some extra linear constraints. This class of problems has been extensively studied;
see Ahuja, Magnanti and Orlan [47, Chapter 16] and Bertsekas [48, Chapter 10]. Second, there
are some theorems for very specific situations that show that solving a linear program, rather than
an integer program, and randomly rounding the solution to obtains integers, can be only slightly
worse than finding the optimal integer solution [49, 50]. Finally, there are several commercial solvers,
including CPLEX and Gurobi, that use a variety of techniques to solve mixed integer programs. These
commercial solvers are often free to academics and highly effective, the principle downside being
that they are black boxes so far as the user is concerned – you cannot look at the code or see how
the solution was produced. The R packages designmatch and mipmatch use these commercial
solvers to produce matched samples, and they have been widely used in practice [6, 9, 51–58].
table of frequencies for the joint distribution of cut-age and cut-income in the treated group; then,
linear constraints on the flows fe can force the same counts in the 4 × 4 table for the matched control
group. Indeed, any multiway contingency table describing the treated group can be balanced in the
matched control group. Obviously, if too many side constraints are imposed upon the flows, then
there may be no flow fe that satisfies them all; that is, (4.3)–(4.5) plus the side constraints may define
an infeasible optimization problem. If you request the impossible, you will have to settle for less
than you requested.
Notice that the constraints in the previous paragraph are constraints on the marginal distributions
of the covariates in treated and control groups as two whole groups, not constraints on who is paired
with whom. In this sense these linear side constraints resemble fine balance constraints in §4.5.2.
The important difference is that a fine balance constraint in §4.5.2 can be implemented as part of a
minimum cost flow problem (4.3)-(4.5) for a network without side constraints – see Figure 4.4 – so
the O(C 3 ) time bound applies to matching with fine balance, but that time bound does not apply to
minimum cost flow problems with general linear side constraints.
Suppose that there are K binary covariates. Within the framework of a minimum cost flow
optimization (4.3)-(4.5), one could finely balance the 2K categories formed from these K covariates.
If K = 30, then 2K is more than a billion, so exact fine balance with a billion categories may
be infeasible even with millions of people. Some of the variants of fine balance in §4.5.2, such
as refined balance, might be practical in this case. However, an attractive alternative is to avoid
balancing the enormous 2K table for K = 30, and instead to balance many of its marginal tables.
For example, linear side constraints on the flows fe can balance each of the K covariates one at a
time. Or, one could balance the K(K − 1)/2 tables of size 2 × 2 recording the joint distribution of
pairs of covariates [59].
In a pair match, the mean age in the matched control group is T −1 times the sum of fe times
the age of control γc over edges e = (τt , γc ). So, the mean age in the control group is another linear
function of the flows fe . It is usually unwise to constrain the equality of the means of a continuous
covariate in treated and control groups – in all likelihood that is infeasible. However, an inequality
constraint is possible. For instance, one can insist that the mean age in the control group be at most 1
year older than the treated group and at most one year younger than the treated group – that becomes
two linear inequality constraints. By constraining the means of squares of centered ages, one can
constrain the variances of age to be close in treated and matched control groups. In parallel, one can
constrain correlation or skewness to be similar in treated and matched control groups.
A direct and effective way to implement all of these ideas uses a commericial solver such
as CPLEX or Gurobi, as proposed by Zubizarreta [60] and as implemented in the R pack-
age designmatch. Although no worst-case time bound is available, experience suggests that
designmatch using Gurobi solves certain constrained matching problems about as fast as mini-
mum cost flow algorithms solve unconstrained problems.
A common tactic for solving minimum cost flow problems with linear side constraints uses
Lagrangian relaxation; see Fisher [61], Ahuja, Magnanti and Orlan [47, Chapter 16] and Bertsekas [48,
Chapter 10]. In this approach cost(e) in (4.3) is altered to penalize violations of a linear constraint,
and the altered minimum cost flow problem (4.3)-(4.5) is solved. At successive iterations, the penalties
are adjusted to gradually enforce the constraint. For instance, if the treated group has a higher median
age than the control group, then a penalty is added to cost(e) for e = (τt , γc ) whenever control γc
has an age below the treated group median age. By solving (4.3)-(4.5) while adjusting the size of the
penalty, one can often force the medians to be the same. This tactic has long been used informally in
matching under the name “directional penalty” [38]. A Lagrangian penalty is directional in that it
pushes in a direction; here, it pushes to select older controls. If the directional penalty is too large, the
control group may become too old, but if the directional penalty is not large enough, it may not equate
the median ages. The penalty must be adjusted to produce the desired effect. It is quite possible that,
inside their black boxes, commercial solvers, such as CPLEX or Gurobi, have automated the use of
Lagrangians and other standard techniques of integer programming.
Discussion 81
4.7 Discussion
As mentioned in §4.1.2, optimization is a framework within which statistical matching problems may
be framed, solved and the solutions compared. In practice, a matched study design has several or
82 Optimization Techniques in Multivariate Matching
many objectives, so we do not automatically pick the best design for a single objective. Nonetheless,
possessing the optimal designs formulated in terms of several different objectives is almost always
helpful in selecting one design for implementation. For example, the mean within-pair covariate
distance will be smaller with 1-to-1 matching than with 1-to-2 matching, but under simple models
the standard error of the mean difference in outcomes will be larger [24], and the design sensitivity
will be worse [67]. Having both the best 1-to-1 match, the best 1-to-2 match, and the best match with
variable numbers of controls in hand is helpful in making an informed choice between them.
References
[1] Paul R Rosenbaum. Design of Observational Studies. Springer, New York, 2 edition, 2020.
[2] Paul R Rosenbaum. Modern algorithms for matching in observational studies. Annual Review
of Statistics and Its Application, 7:143–176, 2020.
[3] Paul R Rosenbaum and Donald B Rubin. Constructing a control group using multivariate
matched sampling methods that incorporate the propensity score. The American Statistician,
39(1):33–38, 1985.
[4] Mike Baiocchi, Dylan S Small, Scott Lorch, and Paul R Rosenbaum. Building a stronger
instrument in an observational study of perinatal care for premature infants. Journal of the
American Statistical Association, 105(492):1285–1296, 2010.
[5] Ashkan Ertefaie, Dylan S Small, and Paul R Rosenbaum. Quantitative evaluation of the trade-off
of strengthened instruments and sample size in observational studies. Journal of the American
Statistical Association, 113(523):1122–1134, 2018.
[6] Mark D Neuman, Paul R Rosenbaum, Justin M Ludwig, Jose R Zubizarreta, and Jeffrey H
Silber. Anesthesia technique, mortality, and length of stay after hip fracture surgery. Journal of
the American Medical Association, 311(24):2508–2517, 2014.
[7] José R Zubizarreta, Dylan S Small, Neera K Goyal, Scott Lorch, and Paul R Rosenbaum.
Stronger instruments via integer programming in an observational study of late preterm birth
outcomes. The Annals of Applied Statistics, 7: 25–50, 2013.
[8] Paul R Rosenbaum and Jeffrey H Silber. Using the exterior match to compare two entwined
matched control groups. American Statistician, 67(2):67–75, 2013.
[9] Jeffrey H Silber, Paul R Rosenbaum, Amy S Clark, Bruce J Giantonio, Richard N Ross,
Yun Teng, Min Wang, Bijan A Niknam, Justin M Ludwig, Wei Wang, and Keven R Fox.
Characteristics associated with differences in survival among black and white women with
breast cancer. Journal of the American Medical Association, 310(4):389–397, 2013.
[10] Paul R Rosenbaum. Heterogeneity and causality: Unit heterogeneity and design sensitivity in
observational studies. The American Statistician, 59(2):147–152, 2005.
[11] José R Zubizarreta, Ricardo D Paredes, Paul R Rosenbaum, et al. Matching for balance, pairing
for heterogeneity in an observational study of the effectiveness of for-profit and not-for-profit
high schools in chile. Annals of Applied Statistics, 8(1):204–231, 2014.
[12] Donald B Rubin. The design versus the analysis of observational studies for causal effects:
Parallels with the design of randomized trials. Statistics in Medicine, 26(1):20–36, 2007.
Discussion 83
[13] Abraham Wald. Note on the consistency of the maximum likelihood estimate. Annals of
Mathematical Statistics, 20(4):595–601, 1949.
[14] Peter J Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics,
35(1):73–101, 1964.
[15] Eugene L Lawler. Combinatorial Optimization: Networks and Matroids. Dover, Mineola, NY,
2001.
[16] Peter C Austin and Elizabeth A Stuart. Optimal full matching for survival outcomes: A method
that merits more widespread use. Statistics in Medicine, 34(30):3949–3967, 2015.
[17] Peter C Austin and Elizabeth A Stuart. The performance of inverse probability of treatment
weighting and full matching on the propensity score in the presence of model misspecification
when estimating the effect of treatment on survival outcomes. Statistical Methods in Medical
Research, 26(4):1654–1670, 2017.
[18] Ben B Hansen. Full matching in an observational study of coaching for the sat. Journal of the
American Statistical Association, 99(467):609–618, 2004.
[19] Ben B Hansen and Stephanie Olsen Klopfer. Optimal full matching and related designs via
network flows. Journal of Computational and Graphical Statistics, 15(3):609–627, 2006.
[20] Paul R Rosenbaum. A characterization of optimal designs for observational studies. Journal of
the Royal Statistical Society: Series B (Methodological), 53(3):597–610, 1991.
[21] Elizabeth A Stuart and Kerry M Green. Using full matching to estimate causal effects in
nonexperimental studies: Examining the relationship between adolescent marijuana use and
adult outcomes. Developmental Psychology, 44(2):395, 2008.
[22] Ben B Hansen. Optmatch: Flexible, optimal matching for observational studies. New Functions
for Multivariate Analysis, 7(2):18–24, 2007.
[23] Cinar Kilcioglu, José R Zubizarreta, et al. Maximizing the information content of a balanced
matched sample in a study of the economic performance of green buildings. The Annals of
Applied Statistics, 10(4):1997–2020, 2016.
[24] Kewei Ming and Paul R Rosenbaum. Substantial gains in bias reduction from matching with a
variable number of controls. Biometrics, 56(1):118–124, 2000.
[25] Fredrik Sävje, Michael J Higgins, and Jasjeet S Sekhon. Generalized full matching. Political
Analysis, 29(4):423–447, 2021.
[26] Dimitri P Bertsekas. A new algorithm for the assignment problem. Mathematical Programming,
21(1):152–171, 1981.
[27] Dimitri P Bertsekas and Paul Tseng. The relax codes for linear minimum cost network flow
problems. Annals of Operations Research, 13(1):125–190, 1988.
[28] Dimitri P Bertsekas. Linear Network Optimization. MIT Press, Cambridge, MA, 1991.
[29] Kewei Ming and Paul R Rosenbaum. A note on optimal matching with variable controls using
the assignment algorithm. Journal of Computational and Graphical Statistics, 10(3):455–463,
2001.
[30] Robert Sedgewick and Philippe Flajolet. An Introduction to the Analysis of Algorithms. Addison-
Wesley, New York, 1996.
84 Optimization Techniques in Multivariate Matching
[31] Bernhard H Korte and Jens Vygen. Combinatorial Optimization. New York: Springer, 5 edition,
2012.
[32] Paul R Rosenbaum. Optimal matching for observational studies. Journal of the American
Statistical Association, 84(408):1024–1032, 1989.
[33] Dan Yang, Dylan S Small, Jeffrey H Silber, and Paul R Rosenbaum. Optimal matching with
minimal deviation from fine balance in a study of obesity and surgical outcomes. Biometrics,
68(2):628–636, 2012.
[34] Ruoqi Yu, Jeffrey H Silber, and Paul R Rosenbaum. Matching methods for observational
studies derived from large administrative databases (with Discussion). Statistical Science,
35(3):338–355, 2020.
[35] Samuel D Pimentel, Rachel R Kelz, Jeffrey H Silber, and Paul R Rosenbaum. Large, sparse
optimal matching with refined covariate balance in an observational study of the health outcomes
produced by new surgeons. Journal of the American Statistical Association, 110(510):515–527,
2015.
[36] Samuel D Pimentel, Frank Yoon, and Luke Keele. Variable-ratio matching with fine balance in
a study of the peer health exchange. Statistics in Medicine, 34(30):4070–4082, 2015.
[37] Bo Zhang, Dylan S Small, Karen B Lasater, Matt McHugh, Jeffrey H Silber, and P R Rosenbaum.
Matching one sample according to two criteria in observational studies. Journal of the American
Statistical Association, DOI:10.1080/01621459.2021.1981337, 2022.
[38] Ruoqi Yu and Paul R Rosenbaum. Directional penalties for optimal matching in observational
studies. Biometrics, 75(4):1380–1390, 2019.
[39] Paul R Rosenbaum. Optimal matching of an optimally chosen subset in observational studies.
Journal of Computational and Graphical Statistics, 21(1):57–71, 2012.
[40] Robert S Garfinkel. An improved algorithm for the bottleneck assignment problem. Operations
Research, 19(7):1747–1751, 1971.
[41] Paul R Rosenbaum. Imposing minimax and quantile constraints on optimal matching in
observational studies. Journal of Computational and Graphical Statistics, 26(1):66–78, 2017.
[42] Fred Glover. Maximum matching in a convex bipartite graph. Naval Research Logistics
Quarterly, 14(3):313–316, 1967.
[43] George Steiner and Julian Scott Yeomans. A linear time algorithm for maximum matchings in
convex, bipartite graphs. Computers & Mathematics with Applications, 31(12):91–96, 1996.
[44] Ruoqi Yu and Paul R Rosenbaum. Graded matching for large observational studies. Journal of
Computational and Graphical Statistics, 2022.
[45] Alexander Schrijver. Theory of Linear and Integer Programming. John Wiley & Sons, New
York, 1998.
[46] Magdalena Bennett, Juan Pablo Vielma, and José R Zubizarreta. Building representative
matched samples with multi-valued treatments in large observational studies. Journal of
Computational and Graphical Statistics, 29(4):744–757, 2020.
[47] Ravindra K Ahuja, Thomas L Magnanti, and James B Orlin. Network Flows. Prentice Hall,
Upper Saddle River, New Jersey, 1993.
Discussion 85
[48] Dimitri Bertsekas. Network Optimization: Continuous and Discrete Models, volume 8. Athena
Scientific, Nashua, NH, 1998.
[49] Vijay V Vazirani. Approximation Algorithms. Springer, New York, 2001.
[50] David P Williamson and David B Shmoys. The Design of Approximation Algorithms. Cam-
bridge University Press, New York, 2011.
[51] Wenqi Hu, Carri W Chan, José R Zubizarreta, and Gabriel J Escobar. Incorporating longitudinal
comorbidity and acute physiology data in template matching for assessing hospital quality.
Medical Care, 56(5):448–454, 2018.
[52] Rachel R Kelz, Caroline E Reinke, José R Zubizarreta, Min Wang, Philip Saynisch, Orit
Even-Shoshan, Peter P Reese, Lee A Fleisher, and Jeffrey H Silber. Acute kidney injury, renal
function, and the elderly obese surgical patient: A matched case-control study. Annals of
Surgery, 258(2):359, 2013.
[53] Neel Koyawala, Jeffrey H Silber, Paul R Rosenbaum, Wei Wang, Alexander S Hill, Joseph G
Reiter, Bijan A Niknam, Orit Even-Shoshan, Roy D Bloom, Deirdre Sawinski, and Peter P.
Reese. Comparing outcomes between antibody induction therapies in kidney transplantation.
Journal of the American Society of Nephrology, 28(7):2188–2200, 2017.
[54] Caroline E Reinke, Rachel R Kelz, Jose R Zubizarreta, Lanyu Mi, Philip Saynisch, Fabienne A
Kyle, Orit Even-Shoshan, Lee A Fleisher, and Jeffrey H Silber. Obesity and readmission in
elderly surgical patients. Surgery, 152(3):355–362, 2012.
[55] Jeffrey H Silber, Paul R Rosenbaum, Wei Wang, Justin M Ludwig, Shawna Calhoun, James P
Guevara, Joseph J Zorc, Ashley Zeigler, and Orit Even-Shoshan. Auditing practice style
variation in pediatric inpatient asthma care. JAMA Pediatrics, 170(9):878–886, 2016.
[56] Jeffrey H Silber, Paul R Rosenbaum, Richard N Ross, Joseph G Reiter, Bijan A Niknam,
Alexander S Hill, Diana M Bongiorno, Shivani A Shah, Lauren L Hochman, ORIT Even-
Shoshan, and Keven R Fox. Disparities in breast cancer survival by socioeconomic status
despite medicare and medicaid insurance. The Milbank Quarterly, 96(4):706–754, 2018.
[57] Giancarlo Visconti. Economic perceptions and electoral choices: A design-based approach.
Political Science Research and Methods, 7(4):795–813, 2019.
[58] José R Zubizarreta, Magdalena Cerdá, and Paul R Rosenbaum. Effect of the 2010 chilean
earthquake on posttraumatic stress reducing sensitivity to unmeasured bias through study design.
Epidemiology (Cambridge, Mass.), 24(1):79, 2013.
[59] Jesse Y Hsu, José R Zubizarreta, Dylan S Small, and Paul R Rosenbaum. Strong control of the
familywise error rate in observational studies that discover effect modification by exploratory
methods. Biometrika, 102(4):767–782, 2015.
[60] José R Zubizarreta. Using mixed integer programming for matching in an observational study
of kidney failure after surgery. Journal of the American Statistical Association, 107(500):1360–
1371, 2012.
[61] Marshall L Fisher. The Lagrangian relaxation method for solving integer programming prob-
lems. Management Science, 27(1):1–18, 1981.
[62] Eric R Cohn and Jose R Zubizarreta. Profile matching for the generalization and personalization
of causal inferences. Epidemiology, 33(5):678–688, 2022.
86 Optimization Techniques in Multivariate Matching
[63] Marı́a de los Angeles Resa and José R Zubizarreta. Evaluation of subset matching methods and
forms of covariate balance. Statistics in Medicine, 35(27):4961–4979, 2016.
[64] Bijan A Niknam and Jose R Zubizarreta. Using cardinality matching to design balanced and
representative samples for observational studies. Journal of the American Medical Association,
327(2):173–174, 2022.
[65] José R Zubizarreta and Luke Keele. Optimal multilevel matching in clustered observational
studies: A case study of the effectiveness of private schools under a large-scale voucher system.
Journal of the American Statistical Association, 112(518):547–560, 2017.
[66] Katherine Brumberg, Dylan S Small, and Paul R Rosenbaum. Using randomized rounding of
linear programs to obtain unweighted natural strata that balance many covariates. Journal of
the Royal Statistical Society, Series A, 2022. DOI:10.1111/rssa.12848
[67] Paul R Rosenbaum. Impact of multiple matched controls on design sensitivity in observational
studies. Biometrics, 69(1):118–127, 2013.
5
Optimal Full Matching
CONTENTS
5.1 Adjusting Outcomes for Confounding with Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Optimal Full Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3 Algorithms for Optimal Full Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4 Controlling Local and Global Features of Matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.5 Inference for Treatment Effects After Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.6 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.7 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.8 Related Recent Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
In an observational study the distribution of outcomes and covariates for units receiving the treatment
condition may differ greatly from the distribution of outcomes and covariates for units receiving
the control condition. Simple estimates of treatment effects or tests of hypotheses for causal effects
may be confounded if appropriate adjustment is not applied. In this chapter, we discuss optimal
full matching, a particular form of statistical matching to control for background imbalances in
covariates or treatment probabilities. Ideally, the matched sets would be homogeneous on background
variables and the probability of receiving the treatment condition. In practice, such precise matching
is frequently not possible, which suggests matches that minimize observed discrepancies. Optimal
full matches can be found efficiently in moderate sized problems on modern hardware, and the
algorithm permits researcher control of matched sets that can be used to control which units are
matched and also reduce the complexity of large problems. The algorithms for optimal full matches
are readily available in existing software packages and are easily extended to include additional
constraints on acceptable matches. We conclude with a brief discussion of the use of matches in
outcome analysis, a simulation study, and an overview of recent developments in the field.
Additional material on matching is available in the reviews by Sekhon [1], Stuart [2], Gangl [3],
and Rosenbaum [4]. Chapters on optimal matching can be found in the book by Imbens and Rubin [5]
and on optimal full matching particularly in the books by Rosenbaum [6, 7].
DOI: 10.1201/9781003102670-5 87
88 Optimal Full Matching
of potential outcomes Y1 and Y0 representing the responses to each of the treatment levels. The
observed outcome is one of the two potential outcomes, depending only on the treatment assignment:
Y = W Y1 + (1 − W )Y0 . Inherent in this notation is the assumption that for each unit in the study,
the potential outcomes depend only on the treatment level received by the unit itself and not any
spillover effect from other units, the Stable Unit Treatment Value Assumption (SUTVA) [8]. In the
following discussion this assumption is not required for the key results, but it does greatly simplify
the exposition.1 One particular implication of SUTVA is limiting treatment effects to comparisons
between unit level potential outcomes under different treatment levels, such as τ = Y1 − Y0 .
In the observational context the potential outcomes and treatment assignment may not be inde-
pendent, and the conditional distributions of Y1 | W = 1 and Y0 | W = 0 may be quite different
from the marginal distributions of Y1 and Y0 , imparting bias to impact estimates and hypothesis tests
that do not employ corrective steps. In addition to the observed outcomes {Y1i | i ≤ n, Wi = 1}
and {Y0i | i ≤ n, Wi = 0}, we also observe covariates X, attributes fixed at the time that units are
assigned to treatment conditions. Classical approaches to causal inference employ a parametric model
of connecting the covariates to the potential outcomes Y1 , Y0 | X, but the validity of this approach
depends on the correctness of the parametric model. More generally, if treatment assignment can
be assumed strongly ignorable given covariates X = x for all x, W ⊥ ⊥ (Y0 , Y1 ) | X = x and
PrW = 1 | X = x ∈ (0, 1), then conditioning on X = x is sufficient to establish unbiased esti-
mates of treatment effects and valid hypothesis tests – with or without a model for Y1 , Y0 | X [14].
This insight suggests that researchers could stratify units by X to make inferences about treatment
effects for observational data. The practical difficulty is that unless X is rather coarse, there may be
no treated or control units for certain levels of X, making such an endeavor infeasible [15].
One solution to this problem is to forgo perfect stratification for an approximate stratification in
which all units within each stratum do not differ very much on the observed characteristics X (or
some suitable transformation of X) and each stratum includes at least one treated and one control
unit. Define a match M = {S1 , S2 , . . . , Ss } as a partition of nM ≤ n units in the sample into s
disjoint strata that include at least one treated unit (W = 1) and at least one control unit (W = 0),
writing i ∈ Sb if unit i is in stratum b. In some situations, in may be necessary to discard units in
the original sample so that nM < n. Define a distance function d(Xi , Xj ) such that d(Xi , Xj ) is
small for units with similar covariates. Overall, the total distance between treated and control units is
given in Equation (5.1):
s
X X
d(M) = Wi (1 − Wj )d(Xi , Xj ). (5.1)
b=1 i,j∈Sb
One commonly used distance function is the Mahalanobis distance between units i and j:
where S(X) is the sample variance-covariance matrix [16]. Mahalanobis distance rescales the
observed data such that all variables are uncorrelated and have unit variance and can be viewed
as finding Euclidean distance on the principle component scores. Mahalanobis distance matching
provides for equal percent bias reduction, meaning matched samples have comparable reductions in
the difference of means between treated and control groups across all covariates [17].
As an alternative to matching directly on covariate values, Rosenbaum and Rubin [14] show
conditioning on the one-dimensional propensity score, the conditional probability of treatment
assignment e(x) = P (W = 1 | X = x), is equivalent to conditioning on X = x directly.
Therefore stratifying units such that e(Xi ) = e(Xj ) for all i and j in the same stratum achieves the
1 For discussions of relaxing this assumption see, Rosenbaum [9], Hudgens and Halloran [10], Bowers et al. [11], Aronow
d(Xi , Xj ) = | log(e(Xi )/(1 − e(Xi ))) − log(e(Xj )/(1 − e(Xj )))|. (5.3)
With propensity score matching, units sharing a stratum may have quite different observed charac-
teristics, even if the probability of treatment is comparable. In practice, the true propensity score
will not be known and must be estimated, typically using logistic regression, but errors induced by
this estimation have not been found to be problematic [18, 19]. Of course, it may still be difficult to
find units with exactly the same (estimated) propensity scores. Practical propensity score matching
generally allows pairs (i, j) with e(Xi ) 6= e(Xj ) with the intention that e(X)-variation will be far
greater between than within strata.
If matched sets are homogeneous in X or e(X) and strong ignorability holds, treatment effect
estimates will be unconfounded. It may, however, be the case that one or both of these conditions are
not met: sets may have variation in X and e(X) in a way that correlates with treatment status, and
conditioning on X may not be sufficient to make the potential outcomes independent of the treatment
assignment. The degree to which the study design is robust to violations of these assumptions is
termed the design sensitivity [20]. Observational studies supporting inferences that are stable under
modest deviations from homogeneity in treatment probabilities are said to be less sensitive. An
important factor of design sensitivity rests on minimizing variation in the units under study [21], so
even in the presence of unmeasured confounders and difficult to compare subjects, finding ways to
arrange subjects such that Equation 5.1 is minimized remains an important goal.
techniques for matching without explicitly computing all distances when d(Xi , Xj ) is suitably well behaved.
90 Optimal Full Matching
TABLE 5.1
A distance matrix for three treated units (A, B, C) and three control units (X, Y, Z).
X Y Z
A 8 1 3
B 5 0 1
C 9 2 9
Doing so, however, forces A and C to be matched either to X and Z or to Z and X, respectively,
with corresponding distances 8 and 9 or 3 and 9, respectively. The resulting total distance (5.1) is no
less than 12. However, foregoing the choice to begin “greedily” – joining the most favorable pair
without regard to its effect on later pairings – allows for matching C to Y, A to Z and B to X, for a
total distance of 10. While an improvement over the greedy match, this optimal pair match is less
desirable than the match that places A, C, and Y in one group and B, X, and Z in another. The total
distance of this match is only 9. With two treated matched to one control in one set and two controls
matched to one treated unit in the second set, the best solution in Table 5.1 represents an optimal full
match.
The earliest matching literature employed greedy or nearest neighbor matching as a necessity [16,
e.g.,]. Rosenbaum [24] introduced a tractable algorithm for finding guaranteed optimal solutions
based on a minimum cost network flow (MNCF) that provided either a pair match, necessarily
discarding some units, or a match with one treated per set and multiple controls. Hansen and
Klopfer [25] updated the algorithm to allow for both variable matching ratios for both treated and
control units, that is strata composed of one treated unit matched to several controls units or one
control unit matched to several treated units, and restrictions on the magnitudes of these ratios. This
has been found to decrease the bias of the match [26]. Kallus [27] considers matching as a method
of creating linear estimators for treatment effects and finds that optimal matching, as compared to
greedy approaches, is optimal in the sense of minimizing mean squared error under a particular
model.
Optimal full matching has been shown to work quite well in practice. Simulation studies on
survival, binary, and continuous outcomes find that optimal full matching performs better than
competing techniques with respect to mean squared error compared to other methods [28–30]. Chen
and Kaplan [31] report similar findings in the context of a Bayesian framework. Augurzky and
Kluve [32] compares optimal full matching to greedy full matching and finds that optimal matching
is generally preferred. Hill et al. [33] investigate several matching techniques and found that full
matching performed particularly well in achieving balance in observed covariates.
[0, u0 − 1]
Tn1 Overflow:−(n1 u0 − m)
[0, u1 − 1]
Cn0
[0, u0 ]
Source: +n1 u0 ··· [d(xi , xj ), 1]
[0, 1]
··· Sink: −m
T1
C1
FIGURE 5.1
Edges labeled by [cost, capacity]. Edges to/from boxes represent one edge from each node in the box
and from each node in the box.
A minimum cost flow is a feasible flow where the total cost to move all units of supply is minimized,
though this solution need not be unique. Throughout, we only consider integer capacity and costs.
The problem of finding a minimum distance full match can be shown to be equivalent to the
problem of finding a minimal cost flow in a network [24, 25]. Nodes in the network can be either
classified as representing the units in the study (treated and control nodes) or nodes required to make
the solution a valid matching problem (bookkeeping nodes). The edges can be similarly classified as
having direct reference to the distance matrix of the matching problem and edges required to achieve
a feasible flow. Each edge has a capacity, the maximum amount of flow allowed to pass over the
edge, and a cost per unit of flow carried.
Figure 5.1 provides an overview of the network. Broadly, we see treatment nodes, control nodes,
and three bookkeeping nodes. The method of Rosenbaum [24] only included the source and sink
nodes and was restricted to solutions of pair matches or single treated units matched to fixed numbers
of control units. Hansen and Klopfer [25] introduced the overflow node, which permits varying
numbers of treated and control units, subject to full match constraints, and discarding a specified
number of treated or control units. In this version of the algorithm, in addition to the distances
d(Xi , Xj ), the researcher must supply the number of control units to match m, the maximum
number of control nodes that can share a single treated unit u0 , and the maximum number of treated
units that can share a control unit u1 . There are some natural constraints on these values to maintain
feasibility of the solution, though we omit a detailed discussion.
Flow starts at the source node. As each treated unit can match up to u0 control units, the flow
contains n1 u0 total flow, which gets divided across the n1 treated units equally. All edges connecting
treated units to control units have capacity one and cost equal to the distance d(Xi , Xj ), assumed
positive for each i and j. (This can be ensured by adding a small increment to each distance.) Edges
that move flow from treated units to control units indicate units matched in the optimal solution. Any
flow not used to match treated units to control units goes to the overflow node. By allowing no more
than u0 − 1 flow over the edges to the overflow, all treated units must be matched to at least one
control unit, but no more than u0 control units.
To ensure that m control units are matched, each control unit is connected by a capacity-1 edge
to the sink node which absorbs m total flow. For control units matched to more than one treated
unit, there will be additional flow that cannot be sent to sink, so it must go to the overflow node. By
limiting the amount of such over flow to u1 − 1, control nodes can never be matched to more than u1
treated units.
92 Optimal Full Matching
(0,1/
1)
(0
(3,0
1)
,1
/1
)
/1)
(0,0/1)
)
/2
,2
(0
1)
,1/
(5
(0,0/1) (0,1/1)
+6 B Y −3 −3
(0,2/2) (1
,1/
1)
)
,1/1
(2 (0,1/1)
)
(0
)
0/1
/1
,2
,1
/2
(0
(9,
)
(9,0/1)
C Z 1)
(0,0/
1)
(0,1/
FIGURE 5.2
Network flow diagram for a toy matching problem described in Table 5.1. Edges are labeled with
(cost, flow/capacity) for the optimal full match {{A, C, Y }, {B, X, Z}}.
All of the edges connecting the bookkeeping nodes to the treated and control nodes having a
cost of zero, the only edges with a positive cost are the edges between treated and control nodes.
Therefore, the total cost of the flow is equal to the total cost of pairing each treated and control
unit joined by an edge carrying positive flow. It can be shown that for any minimum cost flow, the
arrangment of pairs that is induced in this way must be a full matching; Hansen and Klopfer [25]
provide details. Pair matches and matches that restrict sets to have no more than a given number of
treated or control units can be enforced using careful selection of m, u1 , and u0 , and solutions will
be optimal within these restrictions.
To be more concrete, Figure 5.2 shows the network corresponding to the matching problem of
Table 5.1. Each edge shows its cost and the amount of flow, out of the maximum, that flows along the
edge. Overall, six units of flow enter the network and must pass through the nodes for the treated
units A, B, and C, though to ensure all treated units are matched, each treated node receives two units
of flow. The between the treated and control nodes have cost equal to the distance between the units
and a maximum capacity of 1. With a maximum capacity of one, precisely three of these edges will
carry one unit of flow each, which then be sent to the sink. Since A and C share control node Y, both
A and C send one unit of flow to the overflow node. Likewise, Y must also send one unit of flow to
the overflow.
The network based algorithms have several practical implications that can be useful for practi-
tioners to understand. First, as flow can only travel in one direction along an edge, there is a subtle
asymmetry between units designated as treatment and units designated as control. Software interfaces
may require stating restrictions with respect to one group, typically the “treatment” group, even if
they apply to both. For methods that allow variable matching ratios of treated to control units [26], the
maximum allowed number of shared units for one group is the minimal number of allowed controls
Controlling Local and Global Features of Matches 93
for the other group. In some settings it may be useful to swap the treated and control labels to make
these calculations easier to parse.
Second, many of the algorithms used to find minimum cost network flows require integer
costs [34]. If distances between nodes are provided in floating point values, they must be discretized,
inducing some error. Specified tolerance levels can be used control how close the integer approxima-
tion is to the floating point distance, but smaller tolerances will negatively impact the run time of the
algorithm.
Third, it is also valuable to understand how the algorithmic complexity of these algorithms
scales with larger input sizes. For each new treated unit (control unit) added to a study, there are n0
(n1 ) new edges to connect the new node to the opposite assignment nodes, along with some small
number of additional edges to properly connect to sinks and sources. As the algorithm must consider
these edges, it should be intuitive that number of steps to achieve a MCNF grows much faster than
linearly with the number of units in the study. In fact, the earliest solutions for MCNFs have an
algorithm complexity of O(|V |3 ), and newer approaches frequently do no better than O(|V ||E|).
In the discussions so far, |E| ≈ |V |2 , so again the cubic rate of growth is typical. Researchers
are cautioned that the difference between a modest sized problem, for which an optimal match
can be found in a reasonable amount of time, and a large problem, which is quite difficult to solve,
can be smaller than one might initially anticipate. We discuss some techniques for reducing the size
of the matching problem in the next section. Additional discussion of the particular algorithms for
solving MCNFs and the associated algorithmic complexity can be found in the books by Ahuja et
al. [35], Bertsekas [34], Korte and Vygen [36], and Williamson [37], as well as the recent survey
paper by Sifaleras [38]. For empirical comparisons on a variety of networks, see Kovács [39].
implemented using the distance function and a fixed threshold for all units, d(Xi , Xj ) < c, or using
other measures and unit specific thresholds. Calipers can often improve the quality of the match while
simultaneously improving computational efficiency. In the example of Table 5.1, neither the greedy
nor the optimal solution considered matching A and C. Little would be lost if distances as large as 8
were eliminated from consideration. Some attention must be given to ensuring that selected caliper
values do not eliminate some units from consideration entirely (or that doing so would be desirable
to the researcher). For some well behaved distance functions, finding the minimal feasible caliper
can be quite efficient [23, 40–42]. Calipers need not restrict matches based only on d(Xi , Xj ). A
common technique blends Mahalanobis distance and propensity score matching by matching units
primarily on the propensity score, but forbidding matches for which Equation (5.2) exceeds a given
value [43]. For selecting caliper width, for propensity scores a width of 1/5 of a standard deviation
on the scale of the linear predictors has been found to work well in simulation studies [44].
The network approach also makes additional global constraints relatively straight forward.
Examples of extending the networks structure include constraints on the order statistics of the within
matched set distances [45] and requiring certain covariates to have precisely the same marginal
distributions in the treated and control groups [46–48]. For clustered studies, the multilevel structure
can be accommodated by matching both clusters and units within clusters [49].
To illustrate the use of both local and global restrictions, we consider an example originally
presented in Fredrickson et al. [42]. This analysis recreated a propensity score analysis of vascular
closure devices (VCD) on patients undergoing percutaneous coronary intervention compared with
patients with whom VCD was not used [50]. The data set included approximately 31,000 subjects in
the VCD (treatment) group and 54,000 potential controls. Even on modern computing hardware, this
problem was sufficiently large as to be difficult to solve in a reasonable amount of time.
The original analysis of Gurm et al. [50] included three propensity score models and 192 exact
matching categories. Even with this structure to remove potential matches across exact match
categories, the problem remained difficult due to its size. Additionally, treated and control units were
quite different on estimated propensity scores, with a particular outlying treated observation having
no control units within 17 standard deviations on the logit scale. Any caliper that applied to all treated
units would either eliminate the outlying treatment member or not reduce the size of the matching
problem. Using the flexibility of the full matching approach, we did not place a caliper on the 52 (out
of 31,000) most extreme treated units. Using an algorithm to select the largest feasible caliper for the
remaining units, details given in Fredrickson et al. [42], for the full matching problem resulted in a
decrease of 99% in the number of arcs in the underlying network flow problem, making the solution
relatively quick to compute.
Additionally, our solution enforced global constraints on the maximum size sets. Initial matches
produced a significant number of sets with more than five members of a treatment level. Restricting
every match to be a pair match would be infeasible with the selected caliper constraints. We iteratively
increased the maximum number of units per set until a feasible full match could be constructed,
resulting in at most one treated to three controls and vice-versa. Compared to the best possible pair
matching, the full matched sets had an average within set distance that was smaller by a factor of ten,
illustrating the gap in optimality between pair matching and full matching.
Conditioning on FM leads to precisely the same structure as a blocked RCT, opening the possibility
of drawing on the strength of randomization based analysis, either estimating average treatment
effects [5, 51, 52] or testing sharp null hypothesis using permutation techniques [53–56], permuting
treatment assignments uniformly within strata. Pn
For estimating the average treatment effect ∆ = n−1 i=1 Y1 − Y0 , if all strata contain units
with either identical covariates or identical propensity scores, a stratified estimator is unbiased for
the average treatment effect:
s nb
ˆ M | FM =
X nb X Wbi Yi (1 − Wbi )Yi )
∆ − . (5.4)
n i=1 nb1 nb0
b=1
Alternatively, researchers can use a linear regression of the outcome on the treatment assignment and
matched set fixed effects. While the coefficient for the treatment assignment could be understood as an
ˆ M ; it combines treatment
estimated treatment effect, in general it does not precisely coincide with ∆
effect estimates for strata b using weights proportional to nb1 nb0 /nb , not nb [57]. Some authors
suggest including interaction terms between the treatment indicator and the stratum indicators [58,59].
Covariate adjustment can also be implemented by interacting covariates with the treatment assignment.
As within set units typically are more similar on covariates, matching plays a role in increasing the
precision of estimates of ∆, though simulation studies find that additional covariance adjustment
after matching can be quite beneficial [60]. Inference can proceed using sandwich estimators for the
variance, clustered at the level of strata [61].
In matching scenarios that discard units, the definition of the set of units for which the average
treatment effect is defined changes from all units to only those units retained in the match. In many
cases the researcher will only be discarding units from one of the two treatment levels, as when a
very large pool of controls exists to compare to a smaller treatment group. Recognizing that the entire
treatment group will be retained, it may be sensible to select the conditional ATE for the treatment
group only, the average treatment effect on the treated (ATT),
∆W =1 = E (Y1 − Y0 | W = 1) (5.5)
In this situation, weights Wbi = nb1 /(nb1 Wbi + nb0 (1 − Wbi )) estimate ∆W =1 in a regression of
the outcome on the treatment assignment vector. (This is “weighting by the odds” [62], because
control group members receive case weights of nb1 /nb0 , the odds of a random selection from the
matched set’s belonging to the treatment group.)
While the estimator of Equation (5.4) and related variations for the ATT are unbiased when
(Y1 , Y0 ) ⊥
⊥ W | FM , the variance of the estimator can be quite poor, particularly when the matched
set sizes vary greatly [63]. Returning to the inherent weighting of the linear regression estimator,
while not necessarily unbiased for average treatment effects, greatly improves the variance of the
estimator and frequently contributes minimal bias, as demonstrated in the simulation study below.
Both estimators’ variance is improved by full matching with structural restrictions, that is upper
and lower limits on matched sets’ ratios of treatment group members to controls. Using the linear
regression estimator after full matching on the propensity score, Hansen [64] explored various
restrictions on this ratio, finding that allowing ratios between half and twice the ratio that obtained
prior to matching preserved essential benefits achieved by unrestricted full matching, while furnishing
standard errors some 17% smaller. Analyzing a distinct example, Stuart and Green [65] also report
good performance from full matching with treatment to control ratios restricted to fall between half
and twice the ratios that obtained prior to matching (In either instance, full matching was combined
96 Optimal Full Matching
with exact matching, with ratios of treatment to control units varying across levels of exact matching
variables; by “ratios that obtained prior to exact matching” we mean the ratios of treatment to control
units by exact matching category.). Yoon [66] explored setting upper and lower limits on the number
of matched controls per treatment in terms of the “entire number,” a transformation of the estimated
propensity score; see also Pimentel et al. [67]. Fredrickson et al. [42] suggested structural restrictions
allowing a sufficiently narrow range of matching ratios that it becomes feasible for the statistician
to present a compact table giving the numbers of each matched configuration ` : 1, . . . , 1 : u that
were ultimately created, along with the corresponding weights assigned to strata of each of those
types by the selected estimator (For matching configurations (m1 , m0 ) = ` : 1, . . . , 1 : u, the
average treatment effect estimator (5.4) has stratum weights proportional to m1 + m0 , whereas the
ATT estimator’s are proportional to m1 and the linear regression estimator’s are proportional to
m1 m0 /(m1 + m0 ).). Limiting the number of entries to such a table can reduce estimates’ standard
errors even as it enhances their interpretability.
5. Full matching, without the exact matching constraint but with a penalty for matching across the
levels of X. As before, all units will be matched.
6. Pair matching, with a penalty for cross category matching. In this pair match, typically fewer
units will be discarded as number of treated and control units will be close in the overall sample.
7. One to many matching, with the penalty. As with pair matching, more units will be matched,
though some treated units may be discarded if treated units exceed control units in the entire
sample.
After matching, we estimate treatment effects using a linear regression of the outcome on the
treatment assignment, with fixed effects for matched sets. We compare the estimated value to the true
average treatment effect. Table 5.2 provides the key results of the simulations, the bias and mean
squared error (MSE) of each of the methods. Unsurprisingly, without addressing the confounding, the
treatment effect estimates are significantly biased. Additionally, as the variance of potential outcomes
within groups defined by X is smaller than the pooled variance, the difference of means estimate is
both biased and suffers from a higher variance. With exact matching restrictions units with different
values are never matched, so all three matches are effectively unbiased when estimating treatment
effects and have much lower variance.
When the exact matching restriction is removed and replaced with a penalty for matching across
levels of X, it is no longer the case that matches must be made between units with identical values
of X. For full matching, however, since set sizes can be either many treated to one control or many
controls to one treated, the minimum distance match continues to be one in which all sets have the
same value of X. For pair matching and one to many matching, however, the requiremejts on the set
sizes require that some matches are made between unit with different values of X. Consequently,
treatment effect estimates based on pair matching and one-to-many matching become biased as
demonstrated by Table 5.2.
5.7 Software
To implement the solution to the MNCF, a variety of packages exist. See Kovács [39] for a comparison
of several open source and commercial implementations. The optmatch package [25] in R combines
the RELAX-IV implementation from Bertsekas and Tseng [68] with a set of routines for creating
distance matrices, including methods for Mahalanobis matching, propensity score matching based on
logistic regression, exact matching, calipers, and combining distance matrices from different distance
98 Optimal Full Matching
functions. Several of methods extension methods mentioned in this paper have implementations in R
that rely in part on optmatch. the The MatchIt package [69] for R provides a simplified interface
to optmatch with several useful features, including calculating appropriate weights for estimating
the ATE and ATT.
5.9 Summary
For observational studies of treatment effects, the distributions of the potential outcomes for the
observed treatment and control group, Y1 | W = 1 and Y0 | W = 0, may differ greatly from
the marginal distributions, Y1 and Y0 . Provided observable covariates X are sufficient to make
W ⊥ ⊥ (Y1 , Y0 ) | X = x for all x, units could be stratified to achieve valid hypothesis tests for and
unbiased biased estimates of treatment effects. In practice, stratification can be difficult to implement,
Summary 99
so close approximations are necessary in the form of matches that ensure at least one treated and
control unit per stratum and strive to minimize differences in either observed characteristics or the
probability of treatment assignment.
Optimal full matching provides solutions that have guaranteed minimum distance strata. For
any minimum distance match, a full match exists with the same total distance, so nothing is lost
considering only matches that have pairs, one treated to many control units, or one control and many
treated units. Optimal full matches can be found using algorithms for minimum cost network flows.
The flexibility of these approaches allows for both local control of strata through the use of distance
matrices, but also global control of several properties of the match overall, such as restricting to pair
matches or one-to-many matches. Software packages make these algorithms widely available and
straightforward to use. After matching, inference for treatment effects can follow techniques for
analyzing randomized controlled trials, either estimating treatment effects or testing hypothesis using
permutation tests.
References
[1] Jasjeet S. Sekhon. Opiates for the matches: Matching methods for causal inference. Annual
Review of Political Science, 12(1):487–508, 2009.
[2] Elizabeth A. Stuart. Matching Methods for Causal Inference: A Review and a Look Forward.
Statistical Science, 25(1):1 – 21, 2010.
[3] Markus Gangl. Matching estimators for treatment effects. In Henning Best and Christof Wolf,
editors, The SAGE Handbook of Regression Analysis and Causal Inference. Sage Publications,
London, 2013.
[4] Paul R. Rosenbaum. Modern algorithms for matching in observational studies. Annual Review
of Statistics and Its Application, 7(1):143–176, 2020.
[5] Guido W. Imbens and Donald B. Rubin. Causal Inference for Statistics, Social, and Biomedical
Sciences: An Introduction. Cambridge University Press, New York, NY, 2015.
[6] Paul R. Rosenbaum. Observational Studies. Springer, 2nd edition, 2002.
[7] Paul R. Rosenbaum. Design of Observational Studies. Springer, New York, second edition,
2020.
[8] Donald B. Rubin. Randomization analysis of experimental data: The Fisher randomization test
comment. Journal of the American Statistical Association, 75(371):591–593, 1980.
[9] Paul R. Rosenbaum. Interference between units in randomized experiments. Journal of the
American Statistical Association, 102(477):191–200, 2007.
[10] M. G. Hudgens and M. E. Halloran. Toward causal inference with interference. Journal of the
American Statistical Association, 103(482):832–842, 2008.
[11] Jake Bowers, Mark M. Fredrickson, and Costas Panagopoulos. Reasoning about interference
between units: A general framework. Political Analysis, 21(1):97–124, 2013.
[12] Peter M. Aronow and Cyrus Samii. Estimating average causal effects under general interference,
with application to a social network experiment. Annals of Applied Statistics, 11(4):1912–1947,
12 2017.
100 Optimal Full Matching
[13] Susan Athey, Dean Eckles, and Guido W. Imbens. Exact p-values for network interference.
Journal of the American Statistical Association, 113(521):230–240, 2018.
[14] Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in observa-
tional studies for causal effects. Biometrika, 70(1):41–55, 1983.
[15] W. G. Cochran. The planning of observational studies of human populations. Journal of the
Royal Statistical Society. Series A (General), 128(2):234–266, 1965.
[16] Donald B. Rubin. Bias reduction using mahalanobis-metric matching. Biometrics, 36(2):293–
298, 1980.
[17] Donald B. Rubin. Multivariate matching methods that are equal percent bias reducing, ii:
Maximums on bias reduction for fixed sample sizes. Biometrics, 32(1):121–132, 1976.
[18] Donald B. Rubin and Neal Thomas. Characterizing the effect of matching using linear propen-
sity score methods with normal distributions. Biometrika, 79(4):797–809, 12 1992.
[19] Donald B. Rubin and Neal Thomas. Matching using estimated propensity scores: Relating
theory to practice. Biometrics, 52(1):249–264, 1996.
[20] Paul R. Rosenbaum. Design sensitivity in observational studies. Biometrika, 91(1):153–164,
2004.
[21] Paul R. Rosenbaum. Heterogeneity and causality. The American Statistician, 59(2):147–152,
2005.
[22] Paul R. Rosenbaum. A characterization of optimal designs for observational studies. Journal
of the Royal Statistical Society. Series B (Methodological), 53(3): 597–610, 1991.
[23] Ruoqi Yu, Jeffrey H. Silber, and Paul R. Rosenbaum. Matching Methods for Observational
Studies Derived from Large Administrative Databases. Statistical Science, 35(3):338 – 355,
2020.
[24] Paul R. Rosenbaum. Optimal matching for observational studies. Journal of the American
Statistical Association, 84(408): 1024–1032, 1989.
[25] Ben B. Hansen and Stephanie Olsen Klopfer. Optimal full matching and related designs via
network flows. Journal of Computational and Graphical Statistics, 15(3), 2006.
[26] Kewei Ming and Paul R. Rosenbaum. Substantial gains in bias reduction from matching with a
variable number of controls. Biometrics, 56(1):118–124, 2000.
[27] Nathan Kallus. Generalized optimal matching methods for causal inference. Journal of Machine
Learning Research, 21(62):1–54, 2020.
[28] Peter C. Austin and Elizabeth A. Stuart. Optimal full matching for survival outcomes: A method
that merits more widespread use. Statistics in Medicine, 34(30):3949–3967, 2015.
[29] Peter C. Austin and Elizabeth A. Stuart. Estimating the effect of treatment on binary out-
comes using full matching on the propensity score. Statistical Methods in Medical Research,
26(6):2505–2525, 2017.
[30] Peter C. Austin and Elizabeth A. Stuart. The effect of a constraint on the maximum number of
controls matched to each treated subject on the performance of full matching on the propensity
score when estimating risk differences. Statistics in Medicine, 40(1):101–118, 2021.
Summary 101
[31] Jianshen Chen and David Kaplan. Covariate balance in Bayesian propensity score approaches
for observational studies. Journal of Research on Educational Effectiveness, 8(2):280–302,
2015.
[32] Boris Augurzky and Jochen Kluve. Assessing the performance of matching algorithms when
selection into treatment is strong. Journal of Applied Econometrics, 22(3):533–557, 2007.
[33] Jennifer L. Hill, Jane Waldfogel, Jeanne Brooks-Gunn, and Wen-Jui Han. Maternal employ-
ment and child development: A fresh look using newer methods. Developmental Psychology,
41(6):833, 2005.
[34] Dimitri Bertsekas. Network Optimization: Continuous and Discrete Models. Athena Scientific,
Belmont, MA, 1998.
[35] Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin. Network Flows: Theory,
Algorithms, and Applications. Prentice Hall, Englewood Cliffs, New Jersey, 1993.
[36] Bernhard Korte and Jens Vygen. Combinatorial Optimization: Theory and Algorithms. Springer,
Berlin, 2012.
[37] David P Williamson. Network Flow Algorithms. Cambridge University Press, New York, NY,
2019.
[38] Angelo Sifaleras. Minimum cost network flows: Problems, algorithms, and software. Yugoslav
Journal of Operations Research 23:1, 2016.
[39] Péter Kovács. Minimum-cost flow algorithms: An experimental evaluation. Optimization
Methods and Software, 30(1):94–127, 2015.
[40] Pavel S. Ruzankin. A fast algorithm for maximal propensity score matching. Methodology and
Computing in Applied Probability, May 2019.
[41] Fredrik Sävje. Comment: Matching Methods for Observational Studies Derived from Large
Administrative Databases. Statistical Science, 35(3):356 – 360, 2020.
[42] Mark M. Fredrickson, Josh Errickson, and Ben B. Hansen. Comment: Matching Methods
for Observational Studies Derived from Large Administrative Databases. Statistical Science,
35(3):361 – 366, 2020.
[43] Paul R. Rosenbaum and Donald B. Rubin. Constructing a control group using multivariate
matched sampling methods that incorporate the propensity score. The American Statistician,
39(1):33–38, 1985.
[44] Peter C. Austin. Optimal caliper widths for propensity-score matching when estimating
differences in means and differences in proportions in observational studies. Pharmaceutical
Statistics, 10(2):150–161, 2011.
[45] Paul R. Rosenbaum. Imposing minimax and quantile constraints on optimal matching in
observational studies. Journal of Computational and Graphical Statistics, 26(1):66–78, 2017.
[46] Paul R Rosenbaum, Richard N Ross, and Jeffrey H Silber. Minimum distance matched sampling
with fine balance in an observational study of treatment for ovarian cancer. Journal of the
American Statistical Association, 102(477):75–83, 2007.
[47] Dan Yang, Dylan S. Small, Jeffrey H. Silber, and Paul R. Rosenbaum. Optimal matching with
minimal deviation from fine balance in a study of obesity and surgical outcomes. Biometrics,
68(2):628–636, 2012.
102 Optimal Full Matching
[48] Samuel D. Pimentel, Rachel R. Kelz, Jeffrey H. Silber, and Paul R. Rosenbaum. Large, sparse
optimal matching with refined covariate balance in an observational study of the health outcomes
produced by new surgeons. Journal of the American Statistical Association, 110(510):515–527,
2015. PMID: 26273117.
[49] Samuel D. Pimentel, Lindsay C. Page, Matthew Lenard, and Luke Keele. Optimal multilevel
matching using network flows: An application to a summer reading intervention. Annals
of Applied Statistics, 12(3):1479–1505, 09 2018. doi: 10.1214/17-AOAS1118. URL https:
//doi.org/10.1214/17-AOAS1118.
[50] Hitinder S. Gurm, Carrie Hosman, David Share, Mauro Moscucci, and Ben B. Hansen. Compar-
ative safety of vascular closure devices and manual closure among patients having percutaneous
coronary intervention. Annals of Internal Medicine, 159(10):660–666, 2013.
[51] Jerzy Splawa Neyman. On the application of probability theory to agricultural experiments.
Essay on principles. Section 9. Statistical Science, 5(4):465–480, 1923. (Originally in Roczniki
Nauk Tom X (1923) 1–51 (Annals of Agricultural Sciences). Translated from original Polish
by Dambrowska and Speed.).
[52] Guido W. Imbens. Nonparametric estimation of average treatment effects under exogeneity: A
review. Review of Economics and Statistics, 86(1):4–29, Feb 2004.
[53] Ronald A. Fisher. The Design of Experiments. Oliver and Boyd, Edinburgh, 1935.
[54] J. S. Maritz. Distribution-Free Statistical Methods. Chapman and Hall, London, 1981.
[55] Paul R. Rosenbaum. Conditional permutation tests and the propensity score in observational
studies. Journal of the American Statistical Association, 79(387): 565–574, 1984.
[56] Phillip I. Good. Permutation, Parametric and Bootstrap Tests of Hypotheses. Springer, New
York, third edition, 2005.
[57] G. Kalton. Standardization: A technique to control for extraneous variables. Journal of the
Royal Statistical Society. Series C (Applied Statistics), 17(2):118–136, 1968.
[58] Winston Lin. Agnostic notes on regression adjustments to experimental data: Reexamining
freedman’s critique. The Annals of Applied Statistics, 7(1):295–318, 2013.
[59] Peter Z. Schochet. Is regression adjustment supported by the neyman model for causal infer-
ence? Journal of Statistical Planning and Inference, 140(1):246–259, 2010.
[60] K. Ellicott Colson, Kara E. Rudolph, Scott C. Zimmerman, Dana E. Goin, Elizabeth A. Stuart,
Mark van der Laan, and Jennifer Ahern. Optimizing matching and analysis combinations for
estimating causal effects. Scientific Reports, 6(1):23222, 2016.
[61] Cyrus Samii and Peter M. Aronow. On equivalencies between design-based and regression-
based variance estimators for randomized experiments. Statistics and Probability Letters,
82(2):365 – 370, 2012.
[62] Valerie S. Harder, Elizabeth A. Stuart, and James C. Anthony. Propensity score techniques
and the assessment of measured covariate balance to test causal associations in psychological
research. Psychological Methods, 15(3):234, 2010.
[63] Ben B. Hansen. Propensity score matching to extract latent experiments from nonexperimental
data: A case study. In Neil Dorans and Sandip Sinharay, editors, Looking Back: Proceedings of
a Conference in Honor of Paul W. Holland, chapter 9, pages 149–181. Springer, 2011.
Summary 103
[64] Ben B Hansen. Full matching in an observational study of coaching for the sat. Journal of the
American Statistical Association, 99(467):609–618, 2004.
[65] Elizabeth A Stuart and Kerry M Green. Using full matching to estimate causal effects in
nonexperimental studies: Examining the relationship between adolescent marijuana use and
adult outcomes. Developmental Psychology, 44(2):395, 2008.
[66] Frank B. Yoon. New methods for the design and analysis of observational studies. PhD thesis,
University of Pennsylvania, 2008.
[67] Samuel D. Pimentel, Frank Yoon, and Luke Keele. Variable-ratio matching with fine balance in
a study of the peer health exchange. Statistics in Medicine, 34(30):4070–4082, 2015.
[68] Dimitri P. Bertsekas and Paul Tseng. Relax-iv: A faster version of the relax code for solving
minimum cost flow problems. Technical Report LIDS-P-2276, Massachusetts Institute of
Technology, November 1994.
[69] Daniel E. Ho, Kosuke Imai, Gary King, and Elizabeth A. Stuart. MatchIt: Nonparametric
preprocessing for parametric causal inference. Journal of Statistical Software, 42(8):1–28,
2011.
[70] Justin Colannino, Mirela Damian, Ferran Hurtado, John Iacono, Henk Meijer, Suneeta Ra-
maswami, and Godfried Toussaint. An O(n log n)-time algorithm for the restriction scaffold
assignment problem. Journal of Computational Biology, 13(4): 979-989, 2006.
[71] Justin Colannino, Mirela Damian, Ferran Hurtado, Stefan Langerman, Henk Meijer, Suneeta
Ramaswami, Diane Souvaine, and Godfried Toussaint. Efficient many-to-many point matching
in one dimension. Graphs and Combinatorics, 23(1):169–178, 2007.
[72] Fatemeh Rajabi-Alni and Alireza Bagheri. An O(n2 ) algorithm for the limited-capacity many-
to-many point matching in one dimension. Algorithmica, 76(2):381–400, Oct 2016.
[73] José R. Zubizarreta. Using mixed integer programming for matching in an observational study
of kidney failure after surgery. Journal of the American Statistical Association, 107(500):1360–
1371, 2012.
[74] José R. Zubizarreta and Luke Keele. Optimal multilevel matching in clustered observational
studies: A case study of the effectiveness of private schools under a large-scale voucher system.
Journal of the American Statistical Association, 112(518):547–560, 2017.
[75] Magdalena Bennett, Juan Pablo Vielma, and José R. Zubizarreta. Building representative
matched samples with multi-valued treatments in large observational studies. Journal of
Computational and Graphical Statistics, 0(0):1–29, 2020.
[76] Stefano M. Iacus, Gary King, and Giuseppe Porro. Causal inference without balance checking:
Coarsened exact matching. Political Analysis, 1:1–24, 2012. Working paper.
[77] Jens Hainmueller. Entropy balancing for causal effects: A multivariate reweighting method to
produce balanced samples in observational studies. Political Analysis, 20(1):25–46, 2012.
[78] Wendy K. Tam Cho, Jason J. Sauppe, Alexander G. Nikolaev, Sheldon H. Jacobson, and Ed-
ward C. Sewell. An optimization approach for making causal inferences. Statistica Neerlandica,
67(2):211–226, 2013.
[79] Kosuke Imai and Marc Ratkovic. Covariate balancing propensity score. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 76(1):243–263, 2014.
104 Optimal Full Matching
[80] Alexis Diamond and Jasjeet S. Sekhon. Genetic matching for estimating causal effects: A
general multivariate matching method for achieving balance in observational studies. The
Review of Economics and Statistics, 95(3):932–945, 2013.
[81] Fredrik Sävje, Michael J. Higgins, and Jasjeet S. Sekhon. Generalized full matching. Political
Analysis, 29(4), 423-447, 2020.
[82] Awa Dieng, Yameng Liu, Sudeepa Roy, Cynthia Rudin, and Alexander Volfovsky. Interpretable
almost-exact matching for causal inference. In Kamalika Chaudhuri and Masashi Sugiyama,
editors, Proceedings of the Twenty-Second International Conference on Artificial Intelligence
and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 2445–2453.
PMLR, 16–18 Apr 2019.
[83] Bo Lu, Robert Greevy, Xinyi Xu, and Cole Beck. Optimal nonbipartite matching and its
statistical applications. The American Statistician, 65(1):21–30, 2011. PMID: 23175567.
[84] Giovanni Nattino, Chi Song, and Bo Lu. Polymatching algorithm in observational studies with
multiple treatment groups. Computational Statistics & Data Analysis, 167:107364, 2022.
[85] Alberto Abadie and Guido W. Imbens. Matching on the estimated propensity score. Economet-
rica, 84(2):781–807, 2016.
[86] Alberto Abadie and Jann Spiess. Robust post-matching inference. Journal of the American
Statistical Association, 0(0):1–13, 2021.
[87] Stefano M. Iacus, Gary King, and Giuseppe Porro. A theory of statistical inference for matching
methods in causal research. Political Analysis, 27(1):46–68, 2019.
6
Fine Balance and Its Variations in Modern Optimal
Matching
Samuel D. Pimentel
CONTENTS
6.1 The Role of Balance in Observational Studies and Randomized Experiments . . . . . . . 105
6.1.1 Example: comparing patients of internationally trained and U.S.-trained
surgeons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2 Covariate Balance and Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3 Fine Balance and Its Variants Defined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.3.1 Matching as an optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.3.2 Fine balance and near-fine balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3.3 Refined balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.3.4 Strength-k matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.3.5 Controlled deviations from balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.4 Solving Matching Problems under Balance Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.4.1 Assignment method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.4.2 Network flow method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.4.3 Integer programming method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.4.4 Balance and other aspects of the matching problem . . . . . . . . . . . . . . . . . . . . . . . 122
6.4.5 Computational complexity theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.5 Balancing to an External Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.6 Practical Recommendations for Matching with Balance Constraints . . . . . . . . . . . . . . . . 126
6.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.8 R Code Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
distributions of covariates in the treated group and in the control group will be very similar. This
similarity in distribution is known as covariate balance and is frequently demonstrated via a table
of summary statistics for each covariate in the two groups, along with standardized differences in
means between the groups on individual variables and the results of two-sample hypothesis tests.
Here balance is a confirmatory signal that randomization has been conducted properly and that the
particular allocation of treatments has avoided low-probability large discrepancies between groups
In observational studies the study designer has no control over who in the original sample receives
treatment and who receives control. As a result, observed groups may differ substantially on variables
besides treatment. In this setting the key problem is to design, by appropriately reweighting or taking
subsets from the raw data, a new set of comparison groups that are comparable on observed variables,
mimicking the type of data that might have arisen had a randomized experiment been performed [1].
In this case covariate balance takes on a much more central role: it is the primary goal of the matching
or weighting procedure used to transform the raw data. Of course, covariate balance on observed
variables in an observational study is not informative about the similarity of groups on unobserved
covariates, and substantial bias may arise from assuming a matched or weighted design is identical
to a randomized study when unobserved covariate differences are present.
The most common form of matching, propensity score matching, targets overall covariate balance
only indirectly. When many pairs of subjects, each with identical propensity scores but opposite
treatment status, are matched to one another, and when no unobserved covariates are present, the
resulting study is equivalent to a paired randomized trial in which treatment is assigned uniformly
at random within pairs [2, §3]. However, paired individuals need not have identical covariates, and
balance is only guaranteed approximately and in large samples. Indeed, completely randomized trials
themselves do not guarantee balance in any particular finite sample, since an unlucky random draw
of treatment assignment can induce large imbalances.
A balance constraint is a requirement imposed on a matching problem, specifying that the match
produced must guarantee some level of balance between the selected treated and control samples. In
particular, a fine balance constraint requires that the empirical distributions of a nominal covariate be
identical in the matched treated sample and the matched control sample. In contrast to approaches
such as exact matching and propensity score calipers, which place constraints on the similarity of
subjects who are paired to one another, fine balance and other balance constraints describe similarity
between marginal distributions of covariates. This means that fine balance may be achieved on a
variable even when the values of this variable differ between matched individuals in many cases.
to systematic differences in the type of patients that IMGs and USMGs tend to treat. The outcome of
interest is death within 30 days of surgery.
Table 6.1 describes two matches conducted in this data, and their relative level of success in
balancing the rates of emergency room (ER) admission between IMG and USMG patients. ER
admission is an important predictor of 30-day mortality for general surgery patients, and it is initially
substantially imbalanced between the groups, with 61% of IMG patients admitted through the ER
and only 50% of USMG patients so admitted. In both matches we retain all IMG patients and select
a subset of USMG patients, so the ER admission rate of IMG patients remains the same across
comparisons. In addition, following Pimentel & Kelz [4] we match patients only within hospitals.
Match 1 in Table 6.1 focuses purely on minimizing the robust Mahalanobis distance between
paired units; for more discussion of the use of Mahalanobis distance in matching see Rubin [5] and
Rubin & Thomas [6]. ER admission is one of eight variables included in the Mahalanobis distance
and the resulting match improves balance on ER admission substantially, but the new distribution of
ER admission still differs between groups, with 58% of matched USMG patients admitted through
the ER compared to 61% of IMG patients. In contrast, Match 2 in Table 6.1 minimizes the robust
Mahalanobis distance among all matches achieving fine balance on ER admission. The result is a
match with identical empirical cumulative distribution functions of ER admission in the two groups:
146/241 patients admitted through the ER, 95/241 patients not admitted through the ER.
We note also that the match with fine balance does not match exactly on ER admissions. In 17
cases, an IMG patient not admitted through the ER is matched to a USMG patient who was; however,
in exactly 17 other cases, an IMG patient admitted through the ER is matched to a USMG patient
who was, ensuring that the marginal distributions balance. Matching exactly on ER admission is
sufficient for fine balance, but not necessary. This is important because of situations where exact
matching is not feasible but fine balance is. For instance, if one of the hospitals in our data contained
more IMG ER admissions than USMG ER admissions, it would not be possible to match each IMG
patient to a within-hospital USMG counterpart with the same ER admission status. However, if
total USMG ER admissions in other hospitals still exceeded IMG ER admissions, it might still be
possible to achieve fine balance on ER admissions over the match as a whole. Figure 6.1 provides a
toy illustration of such a setting.
6.1.2 Outline
The IMG-USMG matches in Table 6.1 illustrate the basic definition of fine balance and hint at its
value in practice. In what follows we describe fine balance and closely related balance and their
implementation and application in greater detail. First, in Section 6.2 we provide more precise
motivation for covariate balance as a method for bias reduction under a statistical model. In Section
6.3 we present a formulation of matching as a discrete optimization problem, and give formal
definitions of several different balance constraints closely related to fine balance; these address
many common situations in real observational studies, including cases in which variables cannot
be balanced exactly, in which many variables must be balanced in a prioritized order, in which
108 Fine Balance and Its Variations in Modern Optimal Matching
IMG USMG
Patients Patients
Hospital 1
Hospital 2
ER Admission
Yes
No
FIGURE 6.1
A hypothetical setting in which exact matching on ER admission is not possible within hospitals,
but fine balance is still achieved. Lines indicate which IMG patients are matched to which USMG
patients. In Hospital 1 exact matching is not possible since there are no USMG patients admitted
through the ER, and the IMG ER patient must be matched to a non-ER USMG patient. However, in
Hospital 2 this imbalance can be offset by matching an IMG non-ER patient to a USMG ER patient
so that the overall counts of ER and non-ER patients in both matched samples is identical.
interactions between variables of a certain order must be balanced, and in which continuous variables
must be balanced. In Section 6.4 we discuss algorithms for solving the matching optimization
problem under the constraints of Section 6.3 and related tradeoffs between flexible constraints and
efficient computation; we also review the compatibility of balance constraints with other alterations
to the original matching problems such as increasing the number of control units matched to treated
units, calipers and exact matching constraints, trimming of treated units, and changing the objective
function.In Section 6.5 we consider how balance constraints can be useful in settings where the
goal is not to achieve balance between the two samples at hand but to bring two or more samples
into balance with an external population or sample. In Section 6.6 we discuss practical aspects
of observational study design with balance constraints, including selection of tuning parameters.
Finally, in Section 6.7 we conclude with discussion of connections between matching under balanced
constraints and closely related weighting methods, as well as a survey of outstanding open problems
related to matching under balance constraints.
This heuristic explanation also has a more precise formulation in terms of a statistical model. The
explicit link between balance after matching and covariate bias in a statistical model goes back at
least to Rubin [5, 7], who discussed balance as a path to reduced bias in the context of a simple
“mean-matching” algorithm that uses marginal covariate balance as an objective function rather than
in constraints and that searches for solutions in a greedy manner rather than finding a global optimum.
Here we consider the specific case of balance constraints in greater detail.
Suppose there are I matched pairs, with individual j in matched pair i characterized by treatment
indicator Wij , potential outcomes Y1ij and Y0ij , and a p-dimensional vector of observed binary or
continuous covariates Xij = (X1,ij , . . . , Xp,ij ). In matched studies it is common to estimate the
average effect of the treatment on the treated units using the average matched pair difference. We
denote this estimand by SATT (where the S stands for “sample” and indicates that the average is
taken over the units in the observed sample rather than over an infinite population) and denote the
\ Formally:
estimator by SATT.
I 2
1 XX
SATT = Zij (Y1ij − Y0ij )
I i=1 j=1
I
\= 1
X
SATT (Zi1 − Zi2 )(Yi1 − Yi2 )
I i=1
I
1X
= SATT + (Zi1 − Zi2 )(Y0i1 − Y0i2 ).
I i=1
The quantity in the braces is a measure of treatment-control covariate balance on some transformation
fk (·) of the covariate Xk,ij . When Xk,ij is discrete and fine balance has been achieved, this difference
will be zero for any choice fk , meaning that these observed variables contribute no bias on average
over possible sampled realizations of the potential outcomes. When interactions of discrete variables
are balanced, this same argument applies to a more flexible additive model that allows arbitrary
nonlinear interactions between the discrete variables which are jointly balanced.
As such we can understand fine balance constraints and their variants as a method of bias
reduction (at least for a particular estimator of a particular causal effect). Imposing these constraints
in a problem such as cardinality matching, where more stringent balance constraints lead to smaller
matched designs, can result in a form of bias-variance tradeoff, since smaller matched designs tend
to produce test statistics with higher sampling variability (although for discussion of other factors
influencing bias and variance of matched estimators see Aikens et al. [8]). Wang & Zubizarreta [9]
studied the asymptotic impact of balance jointly on the bias and variance of the paired difference
in means estimator under nearest-neighbor matching with replacement and found that imposing
110 Fine Balance and Its Variations in Modern Optimal Matching
√
balance constraints removes sufficient bias to render the estimator n-consistent for a population
average
√ causal effect, in contrast to previous results showing that the estimator fails to achieve
n-consistency in the absence of balancing constraints. Sauppe et al. [10] and Kallus [11] provide
alternative motivations for balance-constrained matching and closely related methods as minimax
solutions to a bias-reduction problem in which the goal is to control the worst possible bias obtained
under a given family of models for potential outcomes under control.
Resa & Zubizarreta [12] also explored the bias-reduction potential of balancing constraints in
finite samples through an extensive simulation study comparing several of the balance constraints
described in Section 6.3. Their findings show that methods involving fine balance performed espe-
cially well relative to competitor matching methods without balance constraints in cases where the
outcome was a nonlinear function of covariates, achieving root mean squared estimation error 5–10
times smaller.
s.t.
C
X
xij = 1 ∀i
j=1
T
X
xij ≤ 1 ∀j
i=1
xij ∈ {0, 1} ∀(i, j)
Fine Balance and Its Variants Defined 111
The first two constraints specify that each treated unit must be matched to exactly one control, and
that each control may be matched to at most one treated unit. The D(τi , γj ) terms in the objective
function refer to some distance or dissimilarity between treated unit τi and control unit γj , typically
calculated based on their respective covariate values and considered fixed; the decision variables xij
are indicators for whether treated unit i is matched to control unit j. In addition, the set A ⊆ T × C
describes which treatment-control pairings are permitted; while sometimes A = T × C, it may
instead be chosen as a strict subset, ruling out some possible pairings, in order to forbid matches
differing by more than a fixed caliper on the propensity score or forbid matches that are not exactly
identical on some important variable (such as hospital in the IMG-USMG match). In what follows
we will denote the case A = T × C as a fully dense matching, referencing the general classes of
dense and sparse matches described in greater detail in Section 6.4.2. In Sections 6.4.1-6.4.3 we will
discuss algorithms for solving this problem and in Section 6.4.4 we will discuss the impact of various
changes to the constraints and the objective function.
Although we are primarily interested in the optimal solution to problem (6.1), any value of the
decision variables x = (x11 , x12 , . . . , x1C , x21 , . . . xT C ) in the feasible set represents a possible
match, and can equivalently be represented as a subset of treated-control ordered PT pairs M =
{(τi , γj ) ∈ T × C : xij = 1}. For convenience, we also define CM = {γj ∈ C : i=1 xij = 1} as
the set of controls selected to be in the match, and we define F as the set of all matches M feasible
for problem (6.1).
The term fine balance was first introduced in Rosenbaum et al. [13] in the context of a patient
outcomes study involving patients from 61 different sites; although patients were not always matched
to counterparts from the same site, each site contributed identical numbers of patients to each matched
sample.
Fine balance on ν is not always feasible even when a feasible match M exists for problem (6.1).
For example, suppose that the treated sample contains more subjects in some category λk of ν than
are present in the entire control sample:
Since CM ⊆ C, there is no choice of M for which fine balance holds. This phenomenon occurs in
the synthetic patient outcomes data with the procedure type variable. IMG surgeons performed 8
surgeries of procedure type 10 while USMG surgeons performed only 5. No matter how we select
USMG surgeries in our match, we can obtain at most 5 in procedure type 10 which will be less than
the total of 8 for that category in the IMG surgery group.
Motivated by a similar patient outcomes study, Yang et al. [14] defined a more general type of
balance constraint, near-fine balance, that remains feasible even when treated and control counts in
each category cannot be made identical. For any M ∈ F and any k ∈ {1, . . . , K}, let
Essentially βk measures the absolute deviation from fine balance in category λk , or more intuitively
the amount of “overflow” by which the count in category λk for one of the groups exceeds that of the
112 Fine Balance and Its Variations in Modern Optimal Matching
PK
other. In fact the quantity k=1 βk /(2T ) measures the total variation distance between empirical
cumulative distribution functions of ν in the treated and the control group. Near-fine balance specifies
that this quantity must be minimized. In formal terms (writing βk as a function of M to emphasize
its dependence on the match selected):
K
X K
X
βk (M) = min
0
βk (M0 ). (6.3)
M ∈F
k=1 k=1
When problem (6.1) is fully dense, so that all treatment-control pairings are allowed, the right-hand
side of this constraint is equal to twice the sum of the “overflow” by which treated counts exceed
control counts in each category in the raw data (zero when the control count is larger):
K
X
2 max {0, {τi ∈ T : ν(τi ) = λk }| − |{γj ∈ C : ν(γj ) = λk }} (6.4)
k=1
In cases where the match is not fully dense, this value provides a lower bound on the sum βk (M0 ).
P
It can be easily calculated by inspecting tables of the original samples prior to matching and
identifying categories of the balance variable in which the number of treated units exceeds the
number of controls. For the procedure type variable in the surgical example, there are three categories
with more treated units than controls: category 4 (4 IMG patients, 2 USMG patients), category 10 (8
IMG, 5 USMG), and category 12 (7 IMG, 5 USMG). The total overflow is therefore (4 − 2) + (8 −
5) + (7 − 5) = 7. Thus, solving problem (6.1) in the context of the hospital data under a near-fine
balance constraint on procedure type ensures firstly that if possible all the USMG surgeries in each
of procedure types 4, 10, 12 will be selected, minimizing the treated overflow to exactly 7 in these
categories, and ensures secondly that the matched control overflow in all other procedure types is
also limited to 7 if possible, and to its minimum feasible value otherwise. The “if possible” qualifier
is necessary because the hospital match is not fully dense, forbidding matches across hospitals. In
this particular case it is possible to limit the total overflow in both treated and control categories to 7,
as shown in Table 6.2; however, if instead some hospital contained 4 controls with procedure type 10
and only 3 treated units in total, it would not be possible to include all the controls of procedure type
10 in the match and the best achievable overflow would be at least 8.
Near-fine balance as described so far minimizes the total variation distance between the empirical
covariate distributions in the matched samples, essentially maximizing the overlap of the probability
mass between the two distributions. However, it does not specify how the control overflows are to be
allocated. In the context of the surgical example, a total of 7 extra controls must be distributed over
the otherwise-balanced procedure types excluding categories 4, 10, and 12, but the basic near-fine
balance constraint makes no distinction between distributing these controls relatively evenly across
many categories or choosing to select them all from the same category. As the former is likely more
desirable in practice than the latter, Yang et al. [14] define two stronger versions of near-fine balance
that encourage more even distribution of control overflow across categories. One version minimizes
the maximum overflow in any one control category; in the surgical example this corresponds to
allowing exactly one unit of overflow in 7 of the 29 categories initially containing more controls
than treated units. A second version minimizes the chi-squared distance between category counts in
the two groups, essentially allocating overflow across control categories proportionally according to
their relative prevalences. In what follows we use the term “near-fine balance” to refer specifically to
constraint (6.3) as is typical in the literature.
Procedure Type
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
IMG 1 2 0 4 1 1 0 4 4 8 4 7 5 12 4 7 9
USMG 1 2 3 2 2 4 13 13 15 5 13 5 14 35 13 20 27
Matched USMG 1 2 0 2 1 1 0 4 4 5 4 5 5 12 4 7 9
Discrepancy 0 0 0 2 0 0 0 0 0 3 0 2 0 0 0 0 0
Procedure Type
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
IMG 12 11 4 12 6 8 14 11 20 21 19 8 6 8 8
USMG 34 21 31 28 53 17 39 29 39 35 34 58 37 40 25
Matched USMG 12 11 4 12 7 8 14 11 20 21 19 12 8 8 8
Discrepancy 0 0 0 0 −1 0 0 0 0 0 0 −4 −2 0 0
solution is to add multiple fine balance constraints to the matching problem. However, feasibility
issues arise, since even if fine balance is possible for each of two variables individually it may not
be simultaneously possible for both in the same match. For similar reasons, satisfying multiple
near-fine balance constraints, at least in the form defined in (6.3), may not be simultaneously
possible. Furthermore, even in settings where satisfying multiple constraints is possible, it may be
computationally difficult for reasons discussed in Section 6.4.5.
Another option is to impose a single fine balance constraint on an interaction of several nominal
covariates, i.e., a new nominal variable in which each unique combination of categories from the
component variables becomes a new category. When fine balance on the interaction is feasible, it
guarantees fine balance on all the individual component variables and the interactions involving any
subset of variables. However, interactions tend to involve very large numbers of categories, many
with small counts, and fine balance is unlikely to be possible in these cases. While near-fine balance
is well-defined for the interaction, near-fine balance on an interaction does not guarantee near-fine
balance on component variables. For example, matching in the IMG-USMG data with near-fine
balance on the interaction of ER admission and procedure type limits the total variation distance
between the two groups to 24/482, i.e., treated counts overflow control counts by a total of 12 units
in some subset of categories while control counts overflow treated units by a total of 12 units in the
complementary subset. However, in the match produced, there remains an overflow of 12 units on
each side for procedure type alone, when by our previous match with near-fine balance we know that
an overflow of at most 7 on each side is achievable for this variable, and there remains an overflow of
6 units on each side for emergency room, even though exact fine balance is possible in this dataset.
A second drawback to all the methods described above is that they necessarily treat all nominal
variables involved as equally important. In practice some variables are often known a priori be more
important to balance than others, either because they initially exhibit a very high degree of imbalance
or because they are known to be prognostically important. In settings where not all variables can
be balanced exactly, it is desirable to prioritize balance on the important variables over balance on
others.
114 Fine Balance and Its Variations in Modern Optimal Matching
Refined balance is a modification of fine balance that offers a way to address balance on multiple
covariates with heterogeneous levels of importance in a prioritized way. Assume we have L nominal
covariates (possibly created by interacting other nominal covariates), ν1 , . . . , νL listed in decreasing
order of priority for balance. In addition, suppose that for any ` ∈ {1, . . . , L − 1}, covariate ν`+1 is
nested inside covariate ν` in the sense that any two subjects with the same level for covariate ν`+1
also share the same level for covariate ν` . We say that a match M satisfies refined balance on the
ordered list ν1 , . . . , νL if and only if:
PK PK 0
1. k=1 βk (M) = minM0 ∈F k=1 βk (M ), i.e. M satisfies near-fine balance on ν1 .
PK PK 0
2. k=1 βk (M) = minM0 ∈F1 k=1 βk (M ) where F1 is the subset of matches in F also satis-
fying constraint 1.
PK PK 0
3. k=1 βk (M) = minM0 ∈F2 k=1 βk (M ) where F2 is the subset of matches in F also satis-
fying constraints 1 and 2.
..
.
PK PK
L. k=1 βk (M) = minM0 ∈FL−1 k=1 βk (M0 ) where FL−1 is the subset of matches in F also
satisfying constraints 1 through L − 1.
Essentially, refined balance enforces the closest remaining balance (in the sense of the total
variation distance) on each variable subject to the closest possible balance on the previous variables
in the priority list. For matches that are fully dense, this is equivalent to achieving near fine balance
on each level. Pimentel et al. [15] introduced refined balance and applied it in a study where the
finest interaction had over 2.9 million different categories.
Typically, the nested structure of the list ν1 , ν2 , . . . , νL is created by choosing either a single
variable or an interaction of a small number variables as ν1 , then choosing ν2 as an interaction of
ν1 with one or more additional variables, and so on, letting each new covariate be an interaction of
the previous variable with some additional variables. For example, we may run a new match in the
IMG-USMG data with refined balance on the following list of nested covariates in decreasing order
of priority: ER admission, ER admission × procedure, ER admission × procedure × patient sex. In
this match we achieve exact fine balance on ER admission so that all category counts agree exactly
in the two groups, an overflow of 12 units on each side for the second interaction (for an overall total
variation distance of 24/482), and an overflow of 19 units on each side in the third interaction (with
total variation distance 38/482). In this case, the balance on the second two levels can be verified to
be equivalent to near-fine balance by equation (6.4), although this is not guaranteed a priori in our
case because the match is not fully dense.
pairs used vary across tests of individual variables, but not the overall samples compared). Since
researchers may not want to rule out potential effect modifiers a priori and since interest in explaining
effect heterogeneity is usually limited to fairly simple models without very high-order interactions,
it is more important here to achieve marginal balance on each individual variable and possibly on
low-order interactions than to give attention to high-order interactions as refined balance tends to do.
Strength-k matching, introduced in Hsu et al. [16], is a variant of fine balance designed for this
setting. Assuming L nominal covariates are initially present in the data for balancing, fine balance
constraints are enforced on all interactions of exactly k < L variables. Strength-1 matching corre-
sponds to requiring fine balance only on each variable individually; strength-2 matching corresponds
to requiring fine balance on any interaction between two variables; and so on. While strength-k
matching’s relatively inflexible demands make it infeasible in many settings and appropriate only for
certain kinds of optimization algorithms (see Section 6.4.3), it is very useful for guaranteeing perfect
balance on many variables and interactions at once. In their simulation study Resa and Zubizarreta
[12] found that strength-2 matching performed better in terms of mean-squared estimation error
for causal effects than competitor methods such as distance-based matching and fine balance on
individual variables only in cases where strong interactions were present in the outcome model. It is
notable that the estimation error was smaller despite the fact that strength-2 matching also tended to
result in much smaller sample sizes, and hence higher sampling variability under the outcome model,
due to the stringency of the constraint.
Consider an example from the IMG-USMG data, in which we might wish to balance all of
the following four variables: ER admission, patient sex, patient race, and comorbidity count. All
of these variables are prognostically important and some could explain effect modification (if, for
instance, IMGs and USMGs perform similarly on patients with few comorbidities but differ in their
performance for patients with many). Matching with strength-k balance on these four variables
ensures that all four of them are balanced exactly. In contrast, if we match with near-fine balance
on the four-way interaction of these variables, we do not obtain perfect balance on any of them
individually; refined balance could be used to ensure at least one of the individual variables is
balanced perfectly, but not for all four variables at once since they are not nested.
X X
fj (Xi (τ )) − fj (Xi (γ)) < δij i = 1, . . . , p, j = 1, . . . , Jp .
τ ∈T γ∈C
We refer to this constraint as controlled deviation from balance. For continuous variables Xi , the
function fj can be chosen as the identity to balance sample means; fj can raise Xi to higher powers
to induce balance on higher moments, or can indicate whether Xi lies above a threshold to balance
distributions in the tails. For nominal covariates with K categories, one fj function can be added for
each category indicating whether the level for a given unit is in this category, so that the deviation in
treated control counts in each category k is controlled at a corresponding level δik .
In certain cases controlled deviation from balance, as described above for nominal covariates,
is equivalent to fine or near-fine balance. When all δik values are set to zero for a given nominal
covariate, the K deviation constraints are equivalent to a fine-balance constraint. Furthermore,
controlled deviation from balance is sufficient for near-fine balance (at least in a fully dense match)
116 Fine Balance and Its Variations in Modern Optimal Matching
when
K
X K
X
δik = min
0
βk (M0 ).
M ∈F
k=1 k=1
0
where the βk (M ) terms are defined on variable i. Note that this constraint is not quite equivalent to
near-fine balance however, since while near-fine balance places no constraint on the allocation off
the overflow, controlled deviation dictates exactly where overflows are allowed. Beyond these special
cases controlled deviation from balance need not guarantee fine or near-fine balance.
Controlled deviation from balance offers several advantages relative to fine and near-fine balance
constraints. First and most importantly, it provides natural handling of continuous variables. Second,
in large studies where tiny deviations in balance are unlikely to introduce bias of substantive
importance (e.g. a deviation of 1 subject in category counts in a study with 100,000 matched pairs),
controlled deviation can free the optimization algorithm from overly stringent restrictions, allowing
users to specify a tolerance of similarity. On the other hand, controlled deviation requires users
to specify tolerances δij for each constraint, a practically tedious process, while near-fine balance
and refined balance are designed to adaptively identify the best achievable balance subject to other
constraints on the match. For discussion of the choice of δij parameters in a closely related weighting
problem, see Wang & Zubizarreta [18].
the problem and ensuring they contribute nothing to the objective function, leaving only the optimal
set of T controls to be matched to the treated units so as to minimize the objective function.
The assignment method can be easily adapted to enforce a fine balance or a near-fine balance
constraint. This is done by placing restrictions on which controls can be matched to the false treated
units ρ1 , . . . , ρC−T . In particular, for each level λk of fine balance covariate ν, define treated and
control counts as follows:
Fine balance is possible exactly when nk ≤ Mk for all k = 1, . . . , K; in this setting is also the case
that
XK
(Mk − nk ) = C − T.
k=1
To modify the assignment algorithm to enforce the fine balance constraint, instead of adding C − T
interchangeable false treated units ρi with zero distances to all controls, we add Mk − nk false treated
units ρik for each category k = 1, . . . , K, each of which have zero distance to controls γ satisfying
ν(γ) = λk and infinite distance to all other controls (in practice infinite distances can be replaced by
any large distance exceeding the maximum of the other distances in the matrix). This modification
forces exactly Mk − nk controls to be removed from consideration in each category, guaranteeing
fine balance on the match produced at the end.
The strategy for enforcing near-fine balance is similar. As before, in all categories where Mk ≥
nk , we add Mk − nk false treated units ρik to each category k = 1, . . . , K, each with zero distance
only to control units in the same category. However, if fine balance is not possible then we have
nk > Mk in at least one category, and this also means that the total number of false treated units
added satisfies:
" #
X X
(Mk − nk ) = C − T + (nk − Mk ) > C − T.
k:Mk ≥nk k:nk >Mk
We now have more rows than columns in our matrix D and so we must add some additional columns
to ensure that it is square. In particular, for each category k in which nk > Mk we add nk − Mk
false control units κj , each of which has zero distance to false treated units and infinite distance to
real treated units. Essentially, these false controls absorb excess false treated units to ensure that all
treated units are matched.
The method just described enforces near-fine balance with respect to the total variation distance
alone, but Yang et al. [14] describe further tweaks that can achieve near-fine balance in the sense of
minimizing both total variation distance and maximum overflow in any category or of minimizing
chi-squared distance in category counts. Essentially these tweaks involve changing a certain number
of the zero distances for false controls back to infinities to ensure that the false treated units excluded
by these false controls do not come disproportionately from certain groups rather than others.
Notice that the modifications described to enforce near-fine balance assume implicitly that the
right-hand side of constraint (6.3) is known a priori. As such the assignment algorithm is best used
for near-fine balance in cases where the match is fully dense (as is indeed assumed throughout [14]).
For cases where the match is not fully dense, the network flow algorithm of the following section is
often a more convenient approach.
present, with nodes N and directed edges E. Certain “source” nodes supply a commodity in positive
integral quantities, other “sink” nodes exhibit positive integral demand for this commodity; the
problem is to transport the commodity from sources to sinks over edges in the graph while paying the
minimum transportation cost. In particular each edge has a positive integral upper bound governing
how much commodity can be sent across it, as well as a nonnegative-real-valued cost per unit flow
across the edge. Formally, we write the problem as follows where integer bn represents the supply or
demand (supplies being positive values and demands being negative values) at node n, ue and ce
represent upper capacity and cost respectively for edge e, Ein (n) and Eout (n) represent the subsets of
E directed into node n and out of node n, respectively, and xe are the flow or decision variables.
X
min ce xe (6.5)
e∈E
s.t.
X X
xe − xe = bn for all n ∈ N
e∈Ein (n) e∈Eout (n)
0 ≤ xe ≤ ce .
The second constraint, which ensures that the amount of flow coming into a node differs from the
amount coming out only by the demand or supply, is known as the “preservation of flow” constraint.
One feature of problem 6.5 that is important to matching is that although the xe decision variables are
real-valued, some optimal solution with integer-valued decision must exist, and can be produced via
polynomial time algorithms. Formally, this derives from the fact that network incidence matrices are
totally unimodular; for more discussion, see Papadimitriou & Stieglitz [21, §13.2]. Algorithms for
solving the minimum-cost network flow problem are reviewed by Bertsekas [22], and the RELAX-IV
network flow solver [23] is accessible in R via the package optmatch [24]. Although optmatch
itself does not implement any balance constraints, related R packages do so within the network flow
framework. Package bigmatch [25] implements near-fine balance on a single nominal variable,
and packages rcbalance and rcbsubset implement refined balance (as well as near-fine and
fine balance on a single nominal variable as special cases). Pimentel [26] provides a user-friendly
introduction to the rcbalance package.
To represent matching as a network flow problem (Figure 6.2(a)), first take a directed bipartite
graph with nodes for each treated unit and each control unit, in which each treated unit τi connected
to all control units to which it can be matched, or all γj such that (τi , γj ) ∈ A in the language of
problem (6.1). Each edge is given a cost associated with the covariate distance between the units it
connects and a capacity of one. Next, add an edge with zero cost and capacity one from each control
node to an additional sink node. Finally, assign a supply of one to each treated node and a demand
equal to the number of treated units T to the sink node. Intuitively, asking for the best network flow
solution now asks for the best way to route the commodity produced at each treated node through a
distinct control node to its destination at the sink; the choice of flow across the treated-control edges
is bijective with the choice of an optimal match in problem (6.1).
How can balance constraints be represented in a network flow problem? The method described
in the previous paragraph can be adapted to solve any assignment problem, including the methods
for fine and near-fine balance described in Section 6.3. Here we describe another approach, more
elegant in that it uses a graph with fewer nodes and edges. To implement a fine balance constraint,
take the graph in the previous paragraph remove the control-sink edges, and add a node for each
category of the fine balance variable. Finally, add edges of cost zero and upper capacity one from
each control node to the fine balance node for the category to which it belongs, and add edges of
cost zero and capacity nk from each category node k to the sink node, where the nk are the counts
for each category k in the treated group (Figure 6.2(b)). These last edges enforce fine balance by
ensuring that exactly the same number of controls as treated or chosen in each category. To impose
Solving Matching Problems under Balance Constraints 119
(a)
(b) (c)
FIGURE 6.2
Network flow algorithms for optimal pair matching within blocks: without balance constraints (a),
with a fine balance constraint (b), and with a near-fine balance constraint (c). Each algorithm is based
on a directed bipartite graph containing nodes for each treated unit and each control unit, with edges
connected treated units only to control units in the same block; additional sink and category nodes are
added for housekeeping and do not refer to any unit in the original problem. Short boldface arrows
represent a supply when directed into nodes and a demand when directed out; solid edges connecting
nodes have some nonzero cost while dashed edges have zero cost.
near-fine balance instead, the capacity constraints on the category-sink edges can be changed from
nk to max(nk , Mk ) if the match is fully dense. More generally, if the match is not necessarily fully
dense, an additional edge may be added between each category node and the sink node with a very
large capacity but a high cost (Figure 6.2(c)). In cases where restrictions on who can be matched to
whom make the minimal achievable imbalance larger than the nk and Mk values suggest, a solution
is still feasible in this network, since additional controls supplied in some category can always be
routed through the new high-cost edge. However, for sufficiently high cost the number of units
sent across these edges will be minimized. A generalization of this latter strategy is also useful for
implementing refined balance. Here the network described above for near-fine balance is generalized
by adding multiple layers of fine balance category nodes, each associated with a different variable in
the refined balance hierarchy. The most granular layer with the most categories receives flow directly
from the control nodes, then channels into category nodes of the next layer up in the hierarchy and so
on. The costs associated with not meeting the balance constraints exactly increase dramatically with
each additional level, enforcing the relative priority among the balance levels. For more discussion of
this method and of the choice of costs to enforce balance, see Pimentel et al. [15].
In what settings will a match not be fully dense? In the IMG-USMG example, we have already
seen one case, where matches are restricted to appear within certain blocks in the data; for additional
120 Fine Balance and Its Variations in Modern Optimal Matching
discussion of matching with exact agreement on discrete variables and its statistical benefits see
Iacus et al. [27], Hsu et al. [28], and Pimentel et al. [15]. Another common constraint on pairings
is a propensity score caliper, which requires paired individuals’ estimated propensity scores differ
by no more than a fixed amount; such calipers are important for ensuring randomization inference
after matching has a clear basis [29]. Calipers are an especially important case for network flow
methods, since rather than partitioning the overall match into fully dense bipartite subgraphs they
induce a complex structure of presence and absence among the edges that can make it very difficult
to calculate the right-hand sides of constraints like (6.3) a priori. One workaround to the problems
of non-fully dense matches is to assign very large edge costs D(τ, γ) to edges associated with
pairings the researcher wishes to forbid; under this strategy the match essentially remains fully dense
and the assignment algorithm can be used in place of network flow to solve for near-fine balance.
However, there are two important drawbacks to this strategy. Firstly, the near-fine balance constraint,
if implemented via the assignment algorithm described above or by setting exact edge capacities, will
take precedence over the constraint forbidding certain edges to be used, so that propensity calipers or
requirements to match exactly will be violated if necessary to achieve balance.
Secondly, the number of edges in a network flow problem has computational implications. The
number of edges is the primary driver of worst-case complexity for state-of-the-art network flow
solution algorithms and large edge counts tend also to be associated with long computation times in
practice even when the worst case is not realized. As discussed in detail by Pimentel et al. [15], there
is a major difference in worst-case complexity between dense problems, in which the number of
edges scales quadratically in the number of nodes, and sparse problems, in which the scaling is linear.
The latter case can be achieved by matching within natural blocks of a bounded size, among other
ways. Sparse matches are a setting where balance constraints add an especially large amount of value;
while the absence of most treatment-control edges greatly limits the options for achieving small
within-pair covariate distances (and, in the case of blocks of bounded size, eliminates the possibility
of within-pair covariate distances shrinking to zero asymptotically as is assumed in studies such as
Abadie & Imbens [30]), the impact of edge deletions on balance tends to be much milder, and high
degrees of covariate balance can often be achieved even in highly sparse matches [15]. Balance
constraints in turn motivate the development of methods that can simultaneously match very large
treatment and control pools; while in the absence of balance constraints exact matching within small
blocks could be implemented by matching within each block separately in parallel, assessing balance
across the many blocks requires doing all the within-block matches together as part of a coordinated
algorithm.
Just as any assignment problem can be represented as a network flow problem via the trans-
formation described several paragraphs ago, any network problem can also be transformed into an
assignment problem [31, §4.1.4], so from a certain abstract perspective these two solution approaches
are equivalent. In practice, however, it is meaningful to distinguish them for computational reasons,
since dedicated network solvers produce solutions faster and in more convenient formats than could
be achieved by first transforming network problems into assignment problems.
strength-k matching except in very special cases (such as when only k variables are to be balanced
and fine balance is possibly on their full interaction).
To solve matching problems under such constraints, integer programming and mixed integer pro-
gramming methods may instead be adopted [17]. For example, letting β1 , . . . , βK1 and β10 , . . . , βK
0
2
represent the absolute discrepancies in counts after matching for the categories in the two variables
to be balanced, we may write the problem as:
X
min D(τi , γj )xij (6.6)
(τi ,γj )∈A
s.t.
C
X
xij = 1 ∀i
j=1
T
X
xij ≤ 1 ∀j
i=1
βk = 0 ∀k ∈ {1, . . . , K1 }
βk0 = 0 ∀k ∈ {1, . . . , K2 }
xij ∈ {0, 1} ∀(i, j)
Alternatively, for sufficiently large costs ω1 and ω2 and additional decision variables z1 , z2 ,we can
represent the problem as follows [17] :
X
min D(τi , γj )xij + ω1 z1 + ω2 z2 (6.7)
(τi ,γj )∈A
s.t.
C
X
xij = 1 ∀i
j=1
T
X
xij ≤ 1 ∀j
i=1
K1
X
βk ≤ z1 ∀k ∈ {1, . . . , K1 }
k=1
K2
X
βk0 ≤ z2 ∀k ∈ {1, . . . , K2 }
k=1
z1 , z2 ≥ 0
xij ∈ {0, 1} ∀(i, j)
Problem (6.6) is an integer program (all decision variables are integers) while problem (6.7) is a
mixed integer program (having both real and integer-valued decision variables). The former problem
explicitly represents the balance requirements as constraints, while the latter enforces them indirectly
via a penalty term in the objective function. The differences between these two formulations recall the
differences between the two networks for near-fine balance described earlier. The integer program has
the advantage of simplicity, not requiring the penalties to be chosen, but the mixed integer program
has the advantage of feasibility even in settings where fine balance is not achievable. Either of these
problems may be solved by the application of a standard integer-program solver such as Gurobi,
which implements integer programming techniques such as cutting planes and branch-and-bound
122 Fine Balance and Its Variations in Modern Optimal Matching
algorithms. Unlike linear programming and network flow methods, these approaches lack strong
guarantees of limited computation time in the worst case; more precisely, they do not generally have
polynomial-time guarantees (see Section 6.4.5 for more discussion). However, in many settings in
practice highly optimized solvers can be competitive with network flow methods, especially if one
is willing to accept an approximation to the correct solution in place of an exactly optimal solution
[32]. The R package designmatch [33] implements matching based on integer programming
methods and supports all the varieties of balance constraint described in Section 6.3.
be paired to distinct USMG patients. Similar problems may arise with calipers. As described by
Pimentel & Kelz [4], network flow methods may be modified to exclude the minimal number of
treated individuals necessary to render the problem feasible by adding “bypass” edges to the network
that connect treated units directly to the sink node; when these edges have sufficiently high cost
(much higher than the highest individual treated-control dissimilarity in the matching problem) the
number of treated units retained will be maximized. This innovation is also compatible with fine,
near-fine, and refined balance; in this case the bypass edges go not directly to the sink but to the most
granular fine balance category node. The penalties for the bypass edges must not only be much larger
than the treated-control dissimilarities but also be much larger than the near-fine balance penalties, in
order to ensure that the maximal number of controls is retained.
The problem as just described prioritizes keeping the largest sample possible and achieves
optimal balance subject to this constraint. A different idea is to require exact balance, and maximize
the sample size subject to this constraint. This method is called cardinality matching. Formally, it
involves adding fine balance constraints to problem (6.1, relaxing the first constraint from an equality
to an inequality, and replacing the cost-minimization objective with the following:
T X
X C
max xij .
i=1 j=1
Cardinality matching, which must generally be solved by integer or mixed integer programming,
always guarantees a specified degree of balance, at the potential cost of sacrificing sample size.
In addition it is always feasible, although constraints that are too difficult to satisfy may result in
unusably tiny matches. It works especially well with hard balance constraints such as strength-k
balance and controlled deviations from balance which can frequently lead to infeasibility when
enforced while also requiring all treated units to be retained. As described cardinality matching does
not make any direct use of pair costs, but these can be optimized subject to balance and maximal
sample size, sometimes as a second step after an initial balanced set of samples is obtained. In this
latter case when the match is also fully dense, the initial optimization problem can use a much smaller
set of decision variables, indicating simply whether each treated or control unit is in the matched
sample and not which other unit is matched to it. For discussion of these ideas and some associated
computational advantages see Zubizarreta et al. [39] and Bennett et al. [32].
Both cardinality matching and the network-flow-based method for excluding treated units just de-
scribed work by imposing some strict order of priority among the competing priorities of minimizing
marginal balance, maximizing sample size, and minimizing pair distances. The network flow method
places top priority on sample size (retaining as many treated units as possible), followed by balance,
followed by pair costs; cardinality matching prioritizes balance first, then sample size, then pair costs.
In practice some happy medium among these different goals is desired, and strictly prioritizing them
in some order may not manage the tradeoffs in an ideal manner. In these settings one may instead
seek to construct matches that are Pareto optimal for multiple objective functions, not achieving
optimality on any one of the objectives but neither being dominated by any other solution on all
objective functions. For example, consider the procedure variable in the IMG-USMG study, which
cannot be balanced exactly when all treated units are retained. Matching with near-fine balance
produces a solution in which all 241 treated units are retained and in which treated categories overflow
control categories by only 7 units (for a total variation distance of 14/482 ≈ 0.029); cardinality
matching under a fine balance constraint would select a match with only 234 treated units but in
which procedure is exactly balanced. There also exists a match with 236 treated units and a total
variation distance of 4/482 ≈ 0.008. This match is neither quite as large as the near-fine balanced
match nor quite as well-balanced as the cardinality match but is not dominated on both objectives
by either option, so it is Pareto optimal with respect to sample size and total variation imbalance.
In particular, if we are comfortable with any total variation distance below 0.01 this match may be
preferable to either.
124 Fine Balance and Its Variations in Modern Optimal Matching
Rosenbaum [40] introduced Pareto optimality in the context of optimal matched designs with
a specific formulation exploring tradeoffs between achieving minimal average pair distances and
maximal sample size, and King et al. [41] explored the set of matches Pareto optimal for sample
size and covariate balance, although only in the special case of matching with replacement. Pimentel
& Kelz [4] generalized the notion to include any objective function that can be represented as a
linear function in a network flow problem and provided network flow algorithm that finds Pareto
optimal solutions in settings where total variation imbalance on some variable is one of the objective
functions; this method could in principle be used to recover the 236-pair match mentioned above
for the IMG-USMG data More generally, matches with controlled deviations from balance may be
understood as Pareto optimal solutions for the balance measure controlled. Using this approach, the
236-pair match could be recovered via a cardinality match with a constraint specifying total variation
distance must not exceed 0.01.
in play may be treated as a constant, Hochbaum et al. [42] argue that polynomial time bounds are
available over a much broader array of settings. In particular, minimum-cost pair matching with
balance constraints and cardinality matching without pair costs both generally have polynomial time
algorithms for any fixed number of covariates. However, cardinality matching with minimum-cost
pairing is still NP-hard with κ ≥ 3 and remains open at κ = 2.
Of course, all of these results refer only to asymptotic worst-case performance, and may not
provide perfect guidance about computational performance in practice. On the one hand, pathological
worst-case settings are not always particularly plausible or common in problems arising from real-
world data, which may be quicker to solve on average. On the other hand, polynomial time bounds
may involve large constants which are ignored in the complexity analysis but may lead to long
computation in practice. Furthermore, the worst-case results given here apply only to exact solutions,
and near-optimal solutions may require very different levels of computational effort. We note in
particular that the R package designmatch has been shown to solve certain integer programming
formulations of matching problems with high worst-case complexity efficiently in practice, and can
also produce approximate solutions for reduced effort [32].
drawn as a random sample from a larger population of interest (such as all the subjects across all the
institutions) and matching subjects in each institution to those in the template, thereby comparing
performance in each institution on a common set of subjects with a representative profile.
How can standard fine, near-fine, and related balance constraints be adapted to this setting? In
the case of template matching or balancing both groups to a general population, this is simply done
by repeated use of the standard bipartite matching approach, treating the template or a sample from
the external population as the treated group and treating each of the other comparison groups in
turn as the control pool. Bennett et al. [32] give a helpful mathematical formulation of this repeated
optimization process. The situation is slightly different in the case where a difference on some
variable is desired, since the treated group is held fixed, and since the desired target distribution to
which to make controls similar is not necessarily observed directly. In this case the researcher first
defines a perturbed version of the treated sample in which most covariates are held fixed but values
of the innocuous variable are changed in the direction of anticipated bias reduction. Then standard
fine or near-fine balance constraints are defined with respect to the new perturbed target population,
by choosing appropriate constraints in a network flow problem; see online appendix to Pimentel et al.
[43] for full details.
priority on one discrete variable in the refined balance framework relative to balance on another
nested variable. Similar principles may be applied in an integer programming context, for example
by gradually increasing the tolerance on one set of controlled-deviation-from-balance constraints
while holding other aspects of the matching problem fixed.
6.7 Discussion
Balance constraints of some kind are virtually always a good idea in matching. In the author’s
experience, a single balance constraint or two can almost always be added to a problem with only
minimal impact on the average similarity of pairs. This practice both guarantees improvements in
observed balance (as shown in a balance table) and tends to reduce bias under a wide variety of
statistical models.
While this review has focused on the role of balance constraints in observational studies, covariate
balance is also a topic of central interest in modern experimental design. In particular, rerandomized
experimental designs offer balance guarantees for randomized trial similar to those achieved by
balance constraints [46]. Algorithms for constructing rerandomized designs differ substantially from
those for creating optimal matches because treatment assignments are not fixed in value, but results
about inference under the two settings may be mutually relevant. Balance is also important in the
gray area between classical observational studies and experiments occupied by natural experiments
and instrumental variable designs; for an application of refined balance as a way to strengthen the
credibility of an instrumental variable, see Keele et al. [47].
Matching methods that focus almost exclusively, or with top priority, on multiple balance
constraints, such as cardinality matching as described in [32], have many similarities with modern
optimal weighting methods such as entropy balancing [48], covariate balancing propensity scores [49],
stable balancing weights [50], and overlap weights [51]. All of these methods find a continuous re-
weighting of one group to match the distribution of the other under constraints that covariates or their
transformations exhibit exact balance or controlled or minimized deviation from balance. Loosely
speaking, matching can be understood as a restricted form of weighting in which weights are restricted
to take on values of 0 and 1 only (for more in-depth discussion of this connection, see Kallus [11]).
Indeed, theoretical analyses such as those in Wang & Zubizarreta [9] exploit this connection to prove
asymptotic results about matching. As a general rule, we expect many of the documented benefits
of matching under balance constraints to extend to optimal balancing weights approaches and vice
versa.
Many important methodological problems involving matching with balance constraints remain
to be solved. One direction is in providing solution algorithms and complexity analyses for various
balance constraints in more complicated matching settings, such as nonbipartite matching [52],
multilevel matching [53, 54], and matching in longitudinal settings including risk-set matching [55]
and designs with rolling enrollment [56]. While individual studies have used balance constraints of
some kind in these settings in an incidental manner, comprehensive studies of balance constraints in
these settings are not extant.
In the area of statistical practice, there is also a need for improved software and guidance for
practitioners in the use of balance constraints. For instance, most of the work reviewed above focuses
on matching without replacement. While balance constraints are known to have theoretical benefits
for matching with replacement as well, applied examples and software adding balance constraints to
methods for matching with replacement are currently lacking.
At a more abstract level, another important question is understanding the limits and value of
balance constraints in settings where the number of covariates is very large. One major selling point
of balance constraints relative to near-exact matching in any given finite sample is that they are
128 Fine Balance and Its Variations in Modern Optimal Matching
easier to satisfy; even in settings where it is impossible to pair individuals exactly on all measured
covariates, it may still be possible to balance the marginal distributions of all or most measured
covariates exactly by aligning inexact matches in a way so that they cancel. In this way balance
constraints offer a partial solution, or at least a mitigation, for the curse of dimensionality. However,
satisfying marginal balance on multiple covariates also becomes increasingly difficult as the number
of covariates to be balanced grows, and it is not clear how much the balancing approach buys us in
asymptotic regimes where p grows quickly with n.
Finally, matches with balance constraints pose interesting questions for matched randomization
inference. While simulation studies and theoretical analyses such as those of Resa & Zubizarreta [12]
and Wang & Zubizarreta [9] have focused on statistical guarantees over samples from a population
model on the potential outcomes, a more common way to analyze matched designs with balance
constraints in practice is to condition on the potential outcomes and the pairs selected and compare
observed test statistics to those obtained by permuting treatment labels within pairs. This method,
proposed originally by Fisher [57] for randomized experiments and adapted by Rosenbaum [2]
for matched studies, frees researchers from the need to make strong assumptions about sampling
or outcome models, and admits attractive methods of sensitivity analysis to assess the role of
unobserved confounding. The key requirement needed to develop these methods in the absence of
actual randomization is sufficiently similar propensity scores within matched pairs [29]. Balance
constraints alone provide no such guarantee, so that the validity of randomization inference depends
on other features of the match such as the presence of a propensity score caliper. Furthermore, since
imbalance is a function of both treatment status and covariates, it is not preserved under permutations
of treatment status within pairs; this means that when randomization inference is performed in a
matched design with a balance constraint, the constraint need not hold in the null distribution draws
obtained by permuting treatment labels. Li et al. [58] consider an analogous discrepancy between
data configurations permitted by the design and those considered by the inferential approach in the
randomized case and finds that analyzing a rerandomized design while ignoring the rerandomization
preserves Type I error control and suffers only in terms of precision. However, it is not yet clear how
the story plays out in the observational case.
library(designmatch)
#sort data in order of treatment variable as designmatch requires
mini.sort <- mini.dat[order(mini.dat$img, decreasing = TRUE),]
#convert my.dist to matrix
dist.mat <- matrix(1000, nrow = sum(mini.sort$img), ncol = sum(1 - mini.sort$img))
for(i in 1:length(my.dist)){
dist.mat[i,as.numeric(names(my.dist[[i]]))] <- my.dist[[i]]
}
#strength-1 matching on sex, ER, race, and comorbidity count
my.match <- bmatch(t_ind = mini.sort$img, dist_mat = dist.mat,
total_groups = sum(mini.sort$img),
exact = mini.sort[c('hospid')],
fine = list(covs = mini.sort[c('mf','emergency','race_cat',
'comorb')]))
#check that you don't get this by near-fine balance on the interaction
my.match.alt <- rcbalance(my.dist, fb.list =
list(c('emergency','mf','race_cat','comorb')),
treated.info = mini.dat[mini.dat$img == 1,],
control.info = mini.dat[mini.dat$img == 0,])
matched.data <- rbind(mini.dat[mini.dat$img == 1,][
as.numeric(rownames(my.match.alt$matches)),],
mini.dat[mini.dat$img == 0,][my.match.alt$matches,])
R Code Appendix 131
table(matched.data$mf, matched.data$img)
table(matched.data$emergency, matched.data$img)
table(matched.data$race_cat, matched.data$img)
table(matched.data$comorb, matched.data$img)
References
[1] Paul R Rosenbaum. Design of Observational Studies. Springer, New York, NY, 2010.
[2] Paul R Rosenbaum. Observational Studies. Springer, New York, NY, 2002.
[3] Salman Zaheer, Samuel D Pimentel, Kristina D Simmons, Lindsay E Kuo, Jashodeep Datta,
Noel Williams, Douglas L Fraker, and Rachel R Kelz. Comparing international and united
states undergraduate medical education and surgical outcomes using a refined balance matching
methodology. Annals of Surgery, 265(5):916–922, 2017.
[4] Samuel D Pimentel and Rachel R Kelz. Optimal tradeoffs in matched designs comparing
us-trained and internationally trained surgeons. Journal of the American Statistical Association,
115(532):1675–1688, 2020.
[5] Donald B Rubin. Bias reduction using mahalanobis-metric matching. Biometrics, 36(2):293–
298, 1980.
[6] Donald B Rubin and Neal Thomas. Affinely invariant matching methods with ellipsoidal
distributions. The Annals of Statistics, 20(2):1079–1093, 1992.
[7] Donald B Rubin. Matching to remove bias in observational studies. Biometrics, 29:159–183,
1973.
[8] Rachael C Aikens, Dylan Greaves, and Michael Baiocchi. A pilot design for observational
studies: Using abundant data thoughtfully. Statistics in Medicine, 39(30):4821–4840, 2020.
[9] Yixin Wang and José R Zubizarreta. Large sample properties of matching for balance. arXiv
preprint arXiv:1905.11386, 2019.
[10] Jason J Sauppe, Sheldon H Jacobson, and Edward C Sewell. Complexity and approximation
results for the balance optimization subset selection model for causal inference in observational
studies. INFORMS Journal on Computing, 26(3):547–566, 2014.
[11] Nathan Kallus. Generalized optimal matching methods for causal inference. Journal of Machine
Learning Research, 21(62):1–54, 2020.
[12] Marıa de los Angeles Resa and José R Zubizarreta. Evaluation of subset matching methods and
forms of covariate balance. Statistics in Medicine, 35(27):4961–4979, 2016.
[13] Paul R Rosenbaum, Richard N Ross, and Jeffrey H Silber. Minimum distance matched sampling
with fine balance in an observational study of treatment for ovarian cancer. Journal of the
American Statistical Association, 102(477):75–83, 2007.
[14] Dan Yang, Dylan S Small, Jeffrey H Silber, and Paul R Rosenbaum. Optimal matching with
minimal deviation from fine balance in a study of obesity and surgical outcomes. Biometrics,
68(2):628–636, 2012.
132 Fine Balance and Its Variations in Modern Optimal Matching
[15] Samuel D Pimentel, Rachel R Kelz, Jeffrey H Silber, and Paul R Rosenbaum. Large, sparse
optimal matching with refined covariate balance in an observational study of the health outcomes
produced by new surgeons. Journal of the American Statistical Association, 110(510):515–527,
2015.
[16] Jesse Y Hsu, José R Zubizarreta, Dylan S Small, and Paul R Rosenbaum. Strong control of the
familywise error rate in observational studies that discover effect modification by exploratory
methods. Biometrika, 102(4):767–782, 2015.
[17] José R Zubizarreta. Using mixed integer programming for matching in an observational study
of kidney failure after surgery. Journal of the American Statistical Association, 107(500):1360–
1371, 2012.
[18] Yixin Wang and Jose R Zubizarreta. Minimal dispersion approximately balancing weights:
Asymptotic properties and practical considerations. Biometrika, 107(1):93–105, 2020.
[19] Rainer Burkard, Mauro Dell’Amico, and Silvano Martello. Assignment problems: Revised
reprint. SIAM, 2012.
[20] Lester Randolph Ford, Jr and Delbert Ray Fulkerson. Flows in Networks. Princeton University
Press, Princeton, NJ, 1962.
[21] Christos H Papadimitriou and Kenneth Steiglitz. Combinatorial Optimization: Algorithms and
Complexity. Courier Corporation, North Chelmsford, MA, 1982.
[22] Dimitri P Bertsekas. Linear Network Optimization: Algorithms and Codes. MIT Press,
Cambridge, MA, 1991.
[23] Dimitri P Bertsekas and Paul Tseng. Relax-iv: A faster version of the relax code for solv-
ing minimum cost flow problems. Technical report, Massachusetts Institute of Technology,
Laboratory for Information and Decision Systems, Cambridge, MA, 1994.
[24] Ben B. Hansen and Stephanie Olsen Klopfer. Optimal full matching and related designs via
network flows. Journal of Computational and Graphical Statistics, 15(3):609–627, 2006.
[25] Ruoqi Yu. bigmatch: Making Optimal Matching Size-Scalable Using Optimal Calipers, 2020.
R package version 0.6.2.
[26] Samuel D Pimentel. Large, sparse optimal matching with r package rcbalance. Observational
Studies, 2:4–23, 2016.
[27] Stefano M Iacus, Gary King, and Giuseppe Porro. Multivariate matching methods that are
monotonic imbalance bounding. Journal of the American Statistical Association, 106(493):
345–361, 2011.
[28] Jesse Y Hsu, Dylan S Small, and Paul R Rosenbaum. Effect modification and design sensitivity
in observational studies. Journal of the American Statistical Association, 108(501):135–148,
2013.
[29] Ben Hansen. Propensity score matching to recover latent experiments: Diagnostics and asymp-
totics. Technical Report 486, University of Michigan, 2009.
[30] Alberto Abadie and Guido W Imbens. Large sample properties of matching estimators for
average treatment effects. Econometrica, 74(1):235–267, 2006.
[31] Dimitri P Bertsekas. Network Optimization: Continuous and Discrete Models. Athena Scientific,
Belmont, MA, 1998.
R Code Appendix 133
[32] Magdalena Bennett, Juan Pablo Vielma, and José R Zubizarreta. Building representative
matched samples with multi-valued treatments in large observational studies. Journal of Com-
putational and Graphical Statistics, 29(4):744–757, 2020.
[33] Jose R. Zubizarreta, Cinar Kilcioglu, and Juan P. Vielma. designmatch: Matched Samples that
are Balanced and Representative by Design, 2018. R package version 0.3.1.
[34] Kewei Ming and Paul R Rosenbaum. Substantial gains in bias reduction from matching with a
variable number of controls. Biometrics, 56(1):118–124, 2000.
[35] Kewei Ming and Paul R Rosenbaum. A note on optimal matching with variable controls using
the assignment algorithm. Journal of Computational and Graphical Statistics, 10(3):455–463,
2001.
[36] Paul R Rosenbaum. Optimal matching for observational studies. Journal of the American
Statistical Association, 84(408):1024–1032, 1989.
[37] Samuel D Pimentel, Frank Yoon, and Luke Keele. Variable-ratio matching with fine balance in
a study of the peer health exchange. Statistics in Medicine, 34(30):4070–4082, 2015.
[38] Cinar Kilcioglu and José R Zubizarreta. Maximizing the information content of a balanced
matched sample in a study of the economic performance of green buildings. The Annals of
Applied Statistics, 10(4):1997–2020, 2016.
[39] José R Zubizarreta, Ricardo D Paredes, and Paul R Rosenbaum. Matching for balance, pairing
for heterogeneity in an observational study of the effectiveness of for-profit and not-for-profit
high schools in chile. The Annals of Applied Statistics, 8(1):204–231, 2014.
[40] Paul R. Rosenbaum. Optimal Matching of an Optimally Chosen Subset in Observational
Studies. Journal of Computational and Graphical Statistics, 21(1):57–71, 2012.
[41] Gary King, Christopher Lucas, and Richard A Nielsen. The balance-sample size frontier in
matching methods for causal inference. American Journal of Political Science, 61(2):473–489,
2017.
[42] Dorit S Hochbaum, Asaf Levin, and Xu Rao. Algorithms and complexity for variants of
covariates fine balance. arXiv preprint arXiv:2009.08172, 2020.
[43] Samuel D Pimentel, Dylan S Small, and Paul R Rosenbaum. Constructed second control
groups and attenuation of unmeasured biases. Journal of the American Statistical Association,
111(515):1157–1167, 2016.
[44] Jeffrey H Silber, Paul R Rosenbaum, Richard N Ross, Justin M Ludwig, Wei Wang, Bijan A
Niknam, Nabanita Mukherjee, Philip A Saynisch, Orit Even-Shoshan, Rachel R Kelz, et al.
Template matching for auditing hospital cost and quality. Health Services Research, 49(5):1446–
1474, 2014.
[45] Adam C Sales, Ben B Hansen, and Brian Rowan. Rebar: Reinforcing a matching estimator
with predictions from high-dimensional covariates. Journal of Educational and Behavioral
Statistics, 43(1):3–31, 2018.
[46] Kari Lock Morgan and Donald B Rubin. Rerandomization to improve covariate balance in
experiments. The Annals of Statistics, 40(2):1263–1282, 2012.
134 Fine Balance and Its Variations in Modern Optimal Matching
[47] Luke Keele, Steve Harris, Samuel D Pimentel, and Richard Grieve. Stronger instruments and
refined covariate balance in an observational study of the effectiveness of prompt admission to
intensive care units. Journal of the Royal Statistical Society: Series A (Statistics in Society),
183(4):1501–1521, 2020.
[48] Jens Hainmueller. Entropy balancing for causal effects: A multivariate reweighting method to
produce balanced samples in observational studies. Political Analysis, 20:1, 25–46, 2012.
[49] Kosuke Imai and Marc Ratkovic. Covariate balancing propensity score. Journal of the Royal
Statistical Society: Series B: Statistical Methodology, 76(1):243–263, 2014.
[50] José R Zubizarreta. Stable weights that balance covariates for estimation with incomplete
outcome data. Journal of the American Statistical Association, 110(511):910–922, 2015.
[51] Fan Li, Kari Lock Morgan, and Alan M Zaslavsky. Balancing covariates via propensity score
weighting. Journal of the American Statistical Association, 113(521):390–400, 2018.
[52] Bo Lu, Robert Greevy, Xinyi Xu, and Cole Beck. Optimal nonbipartite matching and its
statistical applications. The American Statistician, 65(1):21–30, 2011.
[53] José R Zubizarreta and Luke Keele. Optimal multilevel matching in clustered observational
studies: A case study of the effectiveness of private schools under a large-scale voucher system.
Journal of the American Statistical Association, 112(518):547–560, 2017.
[54] Samuel D Pimentel, Lindsay C Page, Matthew Lenard, Luke Keele, et al. Optimal multilevel
matching using network flows: An application to a summer reading intervention. The Annals of
Applied Statistics, 12(3):1479–1505, 2018.
[55] Yunfei Paul Li, Kathleen J Propert, and Paul R Rosenbaum. Balanced risk set matching. Journal
of the American Statistical Association, 96(455):870–882, 2001.
[56] Samuel D Pimentel, Lauren V Forrow, Jonathan Gellar, and Jiaqi Li. Optimal matching
approaches in health policy evaluations under rolling enrollment. Journal of the Royal Statistical
Society: Series A (Statistics in Society), 2020. in press.
[57] Ronald A Fisher. The Design of Experiments. Oliver and Boyd, Edinburgh, 1935.
[58] Xinran Li, Peng Ding, and Donald B Rubin. Asymptotic theory of rerandomization in treatment–
control experiments. Proceedings of the National Academy of Sciences, 115(37):9157–9162,
2018.
7
Matching with Instrumental Variables
CONTENTS
7.1 A Brief Background on Instrumental Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.1.1 Motivation: The problem of unmeasured confounding in observational studies 136
7.1.2 Instrumental variables: Assumptions, definitions, and examples . . . . . . . . . . . 136
7.2 Instruments and Optimal Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.2.2 Matching algorithm and covariate balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.2.3 Parameter of interest: Effect ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.2.4 Inference for effect ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.2.5 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.2.6 A continuous instrument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.3 Application: The Causal Effect of Malaria on Stunting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.3.1 Data background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.3.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.3.3 Conditions for sickle cell trait as a valid instrument . . . . . . . . . . . . . . . . . . . . . . . 145
7.3.4 Estimate and inference of causal effect of malaria on stunting . . . . . . . . . . . . . 147
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
In this chapter we first introduce the assumptions and frequently used terminology in instrumental
variables (IV). The introduction is meant to illustrate the basics of how an instrumental variables
analysis works in the context of optimal matching. The literature on instrumental variables is large,
spanning nearly 100 years and intersecting multiple disciplines, and we recommend the following
literature for a more thorough perspective: [1–5], and [6].
After the introduction to instrumental variables, we explore how to design matched sets that
incorporate IVs. At a high level, a matching-based instrumental variables analysis replaces the
treatment variable discussed in previous chapters with an instrumental variable. The goal in IV
matching is to group units with different values of the instrument but similar values of the observed
covariates. This design produces sets wherein the salient difference between the units is their values of
the instrument [7, 8]. We then compare the relative differences in both the treatment and the outcome
variables within each matched set to assess the causal effect of the treatment on the outcome.
To make the discussions more concrete, throughout the chapter, we use a real example concerning
the effects of malaria among children; the dataset was generously provided by Dr. Benno Kreuels and
Professor Jürgen May. The example studies the effect of malaria on stunted growth among children in
sub-Saharan Africa using a binary instrumental variable based on a sickle cell genotype. In addition
to highlighting how to use matching-based instrumental variables, the real example points out some
common, but important discussions about using an instrumental variables analysis, notably how to
justify an instrumental variable’s plausibility.
X U
Z D Y
FIGURE 7.1
Causal directed acyclic graph with instrument Z, treatment D, outcome Y , measured confounder
X, and unmeasured confounder U . Solid lines are non-zero causal paths. The graph represents
the “prototypical’ instrumental variable setting where there are no confounders associated with the
instrument Z, say if Z is based on a randomized encouragement design.
A Brief Background on Instrumental Variables 137
TABLE 7.1
Compliance classes in a randomized encouragement design trial.
of justifying an instrument derived from an observational study. For example, in the malaria example
above, genotypic variations are used as instruments, specifically the presence of a sickle cell genotype
(HbAS) versus carrying the normal hemoglobin type (HbAA) as an instrument. Briefly, the sickle
cell genotype (HbAS) is a condition where a person inherits from one parent a mutated copy of the
hemoglobin beta (HBB) gene called the sickle cell gene mutation, but inherits a normal copy of
the HBB gene from the other parent. We discuss in detail the validity of the sickle cell trait as an
instrument in Section 7.3.3.
values Rij , Dij , and Zij are related by the following equation:
(1) (0)
(1,dij ) (0,dij ) (1) (0)
Rij = rij Zij + rij (1 − Zij ) Dij = dij Zij + dij (1 − Zij ) (7.1)
For individual j in matched set i, let Xij be a vector of observed covariates and uij be the
unobserved covariates. For example, in the malaria data, Xij represents each child’s covari-
ates listed in Table 7.2 while uij is an unmeasured confounder, like diet. We define the set
(1,k) (0,k) (1) (0)
F = {(rij , rij , dij , dij , Xij , uij ), i = 1, ..., I, j = 1, ..., ni , k ∈ K} to be the collection
of potential outcomes and all confounders, observed and unobserved.
Roughly speaking, the effect ratio is the change in the outcome caused by the instrument divided
by the change in thePexposure caused by the instrument. Note that this is a parameter of the finite
I
population of N = i=1 ni individuals.
140 Matching with Instrumental Variables
The effect ratio admits a well-known interpretation in the IV literature if the core IV assumptions
(A1)-(A4) are satisfied. Specifically, in the context of matching, assumptions (A1)-(A4) formally
translate to
ni
I X
(1) (0)
X
(A1) dij − dij 6= 0,
i=1 j=1
(1,k) (0,k) (k)
(A2) ∀k ∈ K, rij = rij = rij ,
I I
Y Y 1
(A3) ∀z ∈ Z, P (Z = z | F) = P (Zi1 = zi1 , . . . , Zini = zini | F) = ,
i=1 i=1
n i mi
(0) (1)
(A4) dij ≤ dij .
Additionally, suppose d1ij and d0ij are discrete values from 0 to M (i.e. K = {0, 1, . . . , M }),
which is the case with the malaria example where d1ij and d0ij are the number of malaria episodes.
Then, [24] showed that
I Xni XM
(k) (k−1)
X
λ= (rij − rij )wijk (7.3)
i=1 j=1 k=1
where
χ(d1ij ≥ k > d0ij )
wijk = PI Pni PM
i=1 j=1 l=1 χ(d1ij ≥ l > d0ij )
and χ(·) is an indicator function. In words, with the core IV assumptions (A1)–(A3) and the
monotonicity assumption (A4), the effect ratio is interpreted as the weighted average of the causal
effect of a one unit change in the exposure among individuals in the study population whose exposure
would be affected by a change in the instrument. Each weight wijk represents whether an individual j
in stratum i’s exposure would be moved from below k to at or above k by the instrument, relative to the
number of people in the study population whose exposure would be changed by the instrument. For
example, if λ = 0.1 in the malaria example and we assume the said assumptions, 0.1 is the weighted
average reduction in stunting from a one-unit reduction in malaria episodes among individuals who
were protected from malaria by the sickle cell trait. Similarly, each weight wijk represents the jth
individual in ith stratum’s protection from at least k malaria episodes by carrying the sickle cell trait
compared to the overall number of individuals who are protected from varying degrees of malaria
episodes by carrying the sickle cell trait. We remark that the interpretation of λ is akin to Theorem 1
in [25], except that our result is for the finite-sample case and is specific to matching.
We conclude with some remarks about the subtle nuances of interpreting the effect ratio. First,
only assumptions (A1) and (A3) are necessary to identify the “bare-bone” interpretation of λ in
(7.2), the ratio of causal effects of the instrument on the outcome (numerator) and on the exposure
(denominator). This interpretation can be useful, especially in the setting where the exposure is
continuous. However, this interpretation based on the ratio of differences cannot identify the weighted
average of effects of the treatment on the outcome as described in (7.3). Second, irrespective of
whether the investigator prefers the “bare-bone” interpretation of λ or the usual interpretation of λ
based on complier average causal effect, λ can be identified from the observed data by the following
formula, i.e.
PI Pni
i=1 j=1 E(Rij |Zij = 1, F, Z) − E(Rij |Zij = 0, F, Z)
λ = PI Pni (7.4)
i=1 j=1 E(Dij |Zij = 1, F, Z) − E(Dij |Zij = 0, F, Z)
In other words, assuming assumption (A2) or assumption (A4) does not change the identification
strategy of λ. Third, recent works by [26] and [27] discuss heterogeneous effect ratios, i.e. the effect
ratio among a subgroup of individuals defined by pre-instrument covariates. Heterogeneous effect
ratios are useful to discover heterogeneity of the effect ratio within the study population.
Instruments and Optimal Matching 141
Each variable Vi (λ0 ) is the difference in adjusted responses, Rij − λ0 Dij , of those individuals with
Zij = 1 and those with Zij = 0. Under the null hypothesis in (7.5), these adjusted responses have
the same expected value for Zij = 1 and Zij = 0 and thus, deviation of T (λ0 ) from zero suggests
H0 is not true.
[24] states that under regularity conditions, the asymptotic null distribution of T (λ0 ) is standard
Normal. This result can be used to derive a point estimate as well as a confidence interval for the
effect ratio. For the point estimate, in the spirit of [28], we find the value of λ that maximizes the
p-value, Specifically, setting T (λ) = 0 and solving for λ gives an estimate for the effect ratio, λ̂
PI n2i Pni
i=1 mi (ni −mi ) j=1 (Zij − Z̄i. )(Rij − R̄i. )
λ̂ = PI n2i Pni
i=1 mi (ni −mi ) j=1 (Zij − Z̄i. )(Dij − D̄i. )
where Z̄i. , R̄i. , and D̄i. are averages of the instrument, response, and exposure, respectively, within
each matched set i. For confidence interval estimation, say a two-sided, 95% confidence interval for
the effect ratio, we can solve the equation T (λ) = ±1.96 for λ to get the two endpoints of the 95%
confidence interval. In general, for any two-sided, 1 − α level confidence interval, we can solve the
equation T (λ) = ±z1−α/2 where z1−α/2 is the 1 − α/2 quantile of the standard Normal distribution.
A closed form solution for the confidence interval is provided in [24] and [18]. The software to
implement the inferential procedure is described in [29].
Finally, we remark that the inference procedure we develop for the effect ratio allows for
binary, discrete, or continuous outcomes and exposures, even though our malaria data have binary
outcomes and whole-number exposures. However, as remarked earlier, if the exposure is continuous,
interpreting λ based on the complier average causal effect can be challenging; see [30] for additional
discussions.
(A3). For example, with the malaria study, within a matched set i , two children, j and j 0 , may have
the same birth weights, be from the same village, and have the same covariate values (xij = xij 0 ), but
have different probabilities of carrying the HbAS genotype, P (Zij = 1|F) 6= P (Zij 0 = 1|F) due
to unmeasured confounders, denoted as uij and uij 0 for the jth and j 0 th unit, respectively. Despite
our best efforts to minimize the observed differences in covariates and to adhere to assumption (A3)
after conditioning on the matched sets, unmeasured confounders could still be different between the
jth and j 0 th child, and this difference could make the inheritance of the sickle cell trait depart from
randomized assignment, violating assumption (A3).
To model this deviation from randomized assignment due to unmeasured confounders, let
πij = P (Zij = 1|F) and πij 0 = P (Zij 0 = 1|F) for each unit j and j 0 in the ith matched set. The
odds that unit j will receive Zij = 1 instead of Zij = 0 is πij /(1 − πij ). Similarly, the odds for unit
j 0 is πij 0 /(1 − πij 0 ). Suppose the ratio of these odds is bounded by Γ ≥ 1
1 πij (1 − πij 0 )
≤ ≤Γ (7.7)
Γ πij 0 (1 − πij )
If unmeasured confounders play no role in the assignment of Zij , then πij = πij 0 and Γ = 1.
That is, child j and j 0 have the same probability of receiving Zij = 1 in matched set i. If there
are unmeasured confounders that affect the distribution of Zij , then πij 6= πij 0 and Γ > 1. For a
fixed Γ > 1, we can obtain lower and upper bounds on πij , which can be used to derive the null
distribution of T (0) under H0 : λ = 0 in the presence of unmeasured confounding and be used
to compute a range of possible p-values for the hypothesis H0 : λ = 0 [9]. The range of p-values
indicates the effect of unmeasured confounders on the conclusions reached by the inference on λ. If
the range contains α, the significance level, then we cannot reject the null hypothesis at the α level
when there is an unmeasured confounder with an effect quantified by Γ. In addition, we can amplify
the interpretation of Γ using [32] to get a better understanding of the impact of the unmeasured
confounding on the outcome and the instrument; see [24] for the derivation of the sensitivity analysis
and the amplification of Γ.
IV are now “high-IV” and all observations below the pth percentile are “low-IV.” The advantage of
doing this is that investigators can use existing software as described earlier to conduct an IV analysis.
Also, in non-matching contexts, [35] provides some justification behind the dichotomiziation of a
continuous instrument.
However, in some situations, it may be unwise to dichotomize the continuous instrument into a
binary variable, notably when the values of the instrument are correlated with observed covariates,
but are still uncorrelated with unobserved covariates. In that case, it might be useful to simulta-
neously consider the IV values and covariate balance within each matched set. That is, in some
situations, we may accept slightly less than ideal matched sets in terms of covariate balance if we can
achieve excellent differences in the IV values because the instrument’s independence to unmeasured
confounders is useful for achieving unbiased estimation. This kind of consideration arose in the
analysis of neonatal intensive care units in [34] and technical aspects of choosing the strength of
the instrument are well-described in [36]. Computationally, using a continuous IV in matching may
require software that is, at the time of writing this, not standard. But, case-specific implementations
are discussed in [34], [37], and [36].
Birth weight (Mean,(SD)) 3112.44 (381.9) (32 missing) 2978.7 (467.9) (239 missing) ***
Sex (Male/Female) 46.4% Male 51.0% Male
Birth season (Dry/Rainy) 56.4% Dry 55.3% Dry
Ethnic group (Akan/Northerner) 86.4% Akan 88.8 % Akan (4 missing)
α-globin genotype (Norm/Hetero/Homo) 75.7% / 21.5% / 2.8% (3 missing) 74.4% / 23.1% / 2.6% (29 missing)
Village of residence:
Afamanso 4.6 % 4.8%
Agona 10.0% 13.6%
Asamang 13.6% 11.1%
Bedomase 5.5% 4.5%
Bipoa 14.5 % 10.7%
Jamasi 15.5 % 13.8%
can contribute to the violation of IV assumption (A3) if we do not control for these differences. For
instance, it is possible that children with low birth weights were malnourished at birth, making them
more prone to malarial episodes and stunted growth compared to children with high birth weights.
We must control for these differences to eliminate this possibility, which we do through matching
Birth:
Birth weight
Gender
Birth season
Ethnicity
Alpha−globin genotype (Hetero)
Alpha−globin genotype (Homo)
Village of Residence:
Agona
Asamang
Bedomase
Bipoa
Jamasi
Kona
Tano−Odumasi
Wiamoase
Mother and Household:
Motherʼs occupation
Motherʼs education
Motherʼs financial status
Mosquito protection (Screen)
Mosquito protection (Nets)
Other:
Sulphadoxine pyrimethamine
Missing Covariates:
Birth weight missing
Ethnicity missing
Alpha−globin genotype missing
Motherʼs occupation missing
Motherʼs education missing
Motherʼs financial status missing
Mosquito protection missing
FIGURE 7.2
Absolute standardized differences before and after full matching. Unfilled circles indicate differences
before matching and filled circles indicate differences after matching.
that there are other variables besides HbAA that differ between the village Tano-Odumasi and other
villages and affect stunting. Hence, assumption (A3) is more plausible if we control for observed
variables like village of birth. Specifically, within the framework of full matching, for each matched
set i, if the observed variables xij are similar among all ni individuals, it may be more plausible
that the unobserved variable uij plays no role in the distribution of Zij among the ni children. If
(A3) exactly holds and subjects are exactly matched for Xij , then within each matched set i, Zij
Application: The Causal Effect of Malaria on Stunting 147
TABLE 7.3
Estimates of the causal effect using full matching compared to two-stage least squares and multiple
regression.
is simply a result of random assignment where Zij = 1 with probability mi /ni and Zij = 0 with
probability (ni − mi )/ni when we condition on the number of units in the matched set with Zij = 1
being mi . In Section 7.2.5, we discuss a sensitivity analysis that allows for the possibility that even
after matching for observed variables, the unobserved variable uij may still influence the assignment
of Zij in each matched set i, meaning that assumption (A3) is violated.
For assumption (A4), there are known biological mechanisms by which the sickle cell genotype
protects against malaria [44, 46, 53, 54] and no known mechanisms by which the sickle cell genotype
increase the risk of malaria. Hence, there is no biological or epidemiological evidence to suggest that
not carrying the sickle cell genotype would decrease the risk of getting malaria episodes compared to
carrying it.
7.4 Summary
Incorporating an IV into a matched provides an opportunity to quantify the effect of the treatment in
the presence of unmeasured confounders. In particular, by using matching to adjust for measured
confounders, a matching-based IV analysis provides a matching-based approach to study the causal
effect of the treatment on an outcome. Similar to other matching-based designs discussed in the book,
this IV design makes it explicitly clear how these covariates were adjusted by stratifying individuals
based on similar covariate values. Importantly, as with other design-based studies, this analysis
only looked at the outcome data once the balance was acceptable, i.e., once the differences in birth
weight, village of residence, and mosquito protection between children with HbAS and HbAA were
controlled for. If the balance was unacceptable, then comparing the outcomes between the two groups
would not provide reliable causal inference since any differences in the outcome can be attributed to
the differences in the covariates.
Of course, similar to a SITA-based analysis, an IV analysis carries untestable assumptions, which
must be carefully assessed on a case-by-case basis; Section 6 of [6] discusses additional approaches
to assessing IV assumptions in application. Also, in the binary setup discussed above, we are only
able to identify the treatment effect among a subgroup of individuals known as the compliers. Other
formulations of instrumental variables discussed in the overview allow identification of different
treatment effects, including the treatment effect for the entire population or among those who are
treated. However, in general, all formulations of instrumental variables must contend with the core IV
assumptions (A1)-(A3). All investigators using an instrumental variables analysis must be attentive
to these assumptions, providing readers with both empirical and application-specific arguments for
their plausibility. If done well then an IV analysis can yield useful evidence about the causal effect of
the treatment in the presence of unmeasured confounding.
References
[1] Joshua D. Angrist, Guido W. Imbens, and Donald B. Rubin. Identification of causal effects
using instrumental variables. Journal of the American Statistical Association, 91(434):444–455,
1996.
[2] Miguel A. Hernán and James M. Robins. Instruments for causal inference: An epidemiologist’s
dream? Epidemiology, 17(4):360–372, 2006.
[3] M. Alan Brookhart and Sebastian Schneeweiss. Preference-based instrumental variable meth-
ods for the estimation of treatment effects: assessing validity and interpreting results. The
International Journal of Biostatistics, 3(1):14, 2007.
Summary 149
[4] Jing Cheng, Jing Qin, and Biao Zhang. Semiparametric estimation and inference for distribu-
tional and general treatment effects. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 71(4):881–904, 2009.
[5] Sonja A. Swanson and Miguel A. Hernán. Commentary: How to report instrumental variable
analyses (suggestions welcome). Epidemiology, 24(3):370–374, 2013.
[6] Michael Baiocchi, Jing Cheng, and Dylan S. Small. Instrumental variable methods for causal
inference. Statistics in Medicine, 33(13):2297–2340, 2014.
[7] Amelia Haviland, Daniel S. Nagin, and Paul R. Rosenbaum. Combining propensity score
matching and group-based trajectory analysis in an observational study. Psychological Methods,
12(3):247, 2007.
[8] Paul R. Rosenbaum. Design of Observational Studies. Springer Series in Statistics. Springer,
New York, 2010.
[9] Paul R. Rosenbaum. Observational Studies. Springer Series in Statistics. Springer-Verlag, New
York, second edition, 2002.
[10] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observa-
tional studies for causal effects. Biometrika, 70(1):41–55, 1983.
[11] Hyunseung Kang, Benno Kreuels, Ohene Adjei, Ralf Krumkamp, Jürgen May, and Dylan S.
Small. The causal effect of malaria on stunting: A mendelian randomization and matching
approach. International Journal of Epidemiology, 42(5):1390–1398, 2013.
[12] Hyunseung Kang, Benno Kreuels, Jürgen May, and Dylan S Small. Full matching approach
to instrumental variables estimation with application to the effect of malaria on stunting. The
Annals of Applied Statistics, 10(1):335–364, 2016.
[13] World Health Organization. World malaria report 2014. World Health Organization, 2014.
[14] WHO Multicentre Growth Reference Study Group. WHO child growth standards based on
length/height, weight and age. Acta Paediatrica. Supplement, 450:76–85, 2006.
[15] Florie Fillol, Jean B. Sarr, Denis Boulanger, Badara Cisse, Cheikh Sokhna, Gilles Riveau,
Kirsten B. Simondon, and Franck Remoué. Impact of child malnutrition on the specific
anti-plasmodium falciparum antibody response. Malaria Journal, 8(1):116, 2009.
[16] Amare Deribew, Fessehaye Alemseged, Fasil Tessema, Lelisa Sena, Zewdie Birhanu, Ahmed
Zeynudin, Morankar Sudhakar, Nasir Abdo, Kebede Deribe, and Sibhatu Biadgilign. Malaria
and under-nutrition: A community based study among under-five children at risk of malaria,
south-west ethiopia. PLoS One, 5(5):e10775, 2010.
[17] Paul W. Holland. Causal inference, path analysis, and recursive structural equations models.
Sociological Methodology, 18(1):449–484, 1988.
[18] Hyunseung Kang and Luke Keele. Estimation methods for cluster randomized trials with
noncompliance: A study of a biometric smartcard payment system in india. arXiv preprint
arXiv:1805.03744, 2018.
[19] Bo Zhang, Siyu Heng, Emily J MacKay, and Ting Ye. Bridging preference-based instrumental
variable studies and cluster-randomized encouragement experiments: Study design, noncompli-
ance, and average cluster effect ratio. Biometrics, 2021. https://doi.org/10.1111/biom.13500
150 Matching with Instrumental Variables
[20] Donald B. Rubin. Comment on “randomized analysis of experimental data: The fisher random-
ization test”. Journal of the American Statistical Association, 75(371):591–593, 1980.
[21] Ben B. Hansen. Full matching in an observational study of coaching for the sat. Journal of the
American Statistical Association, 99(467):609–618, 2004.
[22] Elizabeth A. Stuart. Matching methods for causal inference: A review and a look forward.
Statistical Science, 25(1):1, 2010.
[23] Ben B. Hansen and Stephanie Olsen Klopfer. Optimal full matching and related designs via
network flows. Journal of Computational and Graphical Statistics, 15(3):609–627, 2006.
[24] Hyunseung Kang, Laura Peck, and Luke Keele. Inference for Instrumental Variables. Jour-
nal of the Royal Statistical Society: Series A (Statistics in Society), 181(4):1231–1254.
https://doi.org/10.1111/rssa.12353
[25] Joshua D. Angrist and Guido W. Imbens. Two-stage least squares estimation of average
causal effects in models with variable treatment intensity. Journal of the American Statistical
Association, 90(430):431–442, 1995.
[26] Michael Johnson, Jiongyi Cao, and Hyunseung Kang. Detecting heterogeneous treatment
effects with instrumental variables and application to the oregon health insurance experiment.
Annals of Applied Statistics, 16 (2):1111–1129, 2022.
[27] Colin B. Fogarty, Kwonsang Lee, Rachel R. Kelz, and Luke J. Keele. Biased encouragements
and heterogeneous effects in an instrumental variable study of emergency general surgical
outcomes. Journal of the American Statistical Association, 116(536):1625–1636, 2021.
[28] J. L. Hodges and E. L. Lehmann. Estimation of location based on ranks. Annals of Mathematical
Statistics, 34(2):598–611, 1963.
[29] Hyunseung Kang, Yang Jiang, Qingyuan Zhao, and Dylan S Small. Ivmodel: an r package
for inference and sensitivity analysis of instrumental variables models with one endogenous
variable. Observational Studies, 7(2):1–24, 2021.
[30] Joshua D. Angrist, Kathryn Graddy, and Guido W. Imbens. The interpretation of instrumental
variables estimators in simultaneous equations models with an application to the demand for
fish. The Review of Economic Studies, 67(3):499–527, 2000.
[31] Dylan S. Small. Sensitivity analysis for instrumental variables regression with overidentifying
restrictions. Journal of the American Statistical Association, 102(479):1049–1058, 2007.
[32] Paul R. Rosenbaum and Jeffrey H. Silber. Amplification of sensitivity analysis in matched
observational studies. Journal of the American Statistical Association, 104(488):1398–1405,
2009.
[33] Mark McClellan, Barbara J McNeil, and Joseph P Newhouse. Does more intensive treatment
of acute myocardial infarction in the elderly reduce mortality?: Analysis using instrumental
variables. JAMA, 272(11):859–866, 1994.
[34] Mike Baiocchi, Dylan S. Small, Scott Lorch, and Paul R. Rosenbaum. Building a stronger
instrument in an observational study of perinatal care for premature infants. Journal of the
American Statistical Association, 105(492):1285–1296, 2010.
Summary 151
[35] Joy Shi, Sonja A Swanson, Peter Kraft, Bernard Rosner, Immaculata De Vivo, and Miguel A
Hernán. Instrumental variable estimation for a time-varying treatment and a time-to-event
outcome via structural nested cumulative failure time models. BMC Medical Research Method-
ology, 21(1):1–12, 2021.
[36] Luke Keele and Jason W Morgan. How strong is strong enough? strengthening instruments
through matching and weak instrument tests. The Annals of Applied Statistics, 10(2):1086–1106,
2016.
[37] José R. Zubizarreta, Dylan S. Small, Neera K. Goyal, Scott Lorch, and Paul R. Rosenbaum.
Stronger instruments via integer programming in an observational study of late preterm birth
outcomes. The Annals of Applied Statistics, 7(1):25–50, 2013.
[38] Benno Kreuels, Stephan Ehrhardt, Christina Kreuzberg, Samuel Adjei, Robin Kobbe, Gerd
Burchard, Christa Ehmen, Matilda Ayim, Ohene Adjei, and Jürgen May. Sickle cell trait (hbas)
and stunting in children below two years of age in an area of high malaria transmission. Malaria
Journal, 8(1):16, 2009.
[39] Paul R. Rosenbaum. Heterogeneity and causality: Unit heterogeneity and design sensitivity in
observational studies. American Statistician, 59(2):147–152, 2005.
[40] José R. Zubizarreta, Ricardo D. Paredes, and Paul R. Rosenbaum. Matching for balance, pairing
for heterogeneity in an observational study of the effectiveness of for-profit and not-for-profit
high schools in chile. Annals of Applied Statistics, (1):204–231, 2014.
[41] Donald B. Rubin. Should observational studies be designed to allow lack of balance in covariate
distributions across treatment groups? Statistics in Medicine, 28(9):1420–1423, 2009.
[42] Sharon-Lise T. Normand, Mary Beth Landrum, Edward Guadagnoli, John Z. Ayanian, Thomas J.
Ryan, Paul D. Cleary, and Barbara J. McNeil. Validating recommendations for coronary
angiography following acute myocardial infarction in the elderly: A matched analysis using
propensity scores. Journal of Clinical Epidemiology, 54(4):387–398, 2001.
[43] Michael Aidoo, Dianne J. Terlouw, Margarette S. Kolczak, Peter D. McElroy, Feiko O. ter
Kuile, Simon Kariuki, Bernard L. Nahlen, Altaf A. Lal, and Venkatachalam Udhayakumar.
Protective effects of the sickle cell gene against malaria morbidity and mortality. The Lancet,
359(9314):1311–1312, 2002.
[44] Thomas N. Williams, Tabitha W. Mwangi, David J. Roberts, Neal D. Alexander, David J.
Weatherall, Sammy Wambua, Moses Kortok, Robert W. Snow, and Kevin Marsh. An immune
basis for malaria protection by the sickle cell trait. PLoS Medicine, 2(5):e128, 2005.
[45] Jürgen May, Jennifer A. Evans, Christian Timmann, Christa Ehmen, Wibke Busch, Thorsten
Thye, Tsiri Agbenyega, and Rolf D. Horstmann. Hemoglobin variants and disease manifesta-
tions in severe falciparum malaria. Journal of the American Medical Association, 297(20):2220–
2226, 2007.
[46] Rushina Cholera, Nathaniel J. Brittain, Mark R. Gillrie, Tatiana M. Lopera-Mesa, Séidina A. S.
Diakité, Takayuki Arie, Michael A. Krause, Aldiouma Guindo, Abby Tubman, Hisashi Fujioka,
Dapa A. Diallo, Ogobara K. Doumbo, May Ho, Thomas E. Wellems, and Rick M. Fairhurst.
Impaired cytoadherence of plasmodium falciparum-infected erythrocytes containing sickle
hemoglobin. Proceedings of the National Academy of Sciences, 105(3):991–996, 2008.
152 Matching with Instrumental Variables
[47] Benno Kreuels, Christina Kreuzberg, Robin Kobbe, Matilda Ayim-Akonor, Peter Apiah-
Thompson, Benedicta Thompson, Christa Ehmen, Samuel Adjei, Iris Langefeld, Ohene Adjei,
and Jürgen May. Differing effects of hbs and hbc traits on uncomplicated falciparum malaria,
anemia, and child growth. Blood, 115(22):4551–4558, 2010.
[48] George Davey Smith and Shah Ebrahim. ‘mendelian randomization’: can genetic epidemiology
contribute to understanding environmental determinants of disease? International Journal of
Epidemiology, 32(1):1–22, 2003.
[49] Michael T. Ashcroft, Patricia Desai, and Stephen A. Richardson. Growth, behaviour, and
educational achievement of jamaican children with sickle-cell trait. British Medical Journal,
1(6022):1371–1373, 1976.
[50] Michael S. Kramer, Yolanda Rooks, and Howard A. Pearson. Growth and development in
children with sickle-cell trait: a prospective study of matched pairs. New England Journal of
Medicine, 299(13):686–689, 1978.
[51] Michael T. Ashcroft, Patricia Desai, G. A. Grell, Beryl E. Serjeant, and Graham R. Serjeant.
Heights and weights of west indian children with the sickle cell trait. Archives of Disease in
Childhood, 53(7):596–598, 1978.
[52] N. Rehan. Growth status of children with and without sickle cell trait. Clinical Pediatrics,
20(11):705–709, 1981.
[53] Milton J. Friedman. Erythrocytic mechanism of sickle cell resistance to malaria. Proceedings
of the National Academy of Sciences, 75(4):1994–1997, 1978.
[54] Milton J. Friedman and William Trager. The biochemistry of resistance to malaria. Scientific
American, 244:154–155, 1981.
[55] Jeffrey M. Wooldridge. Econometric Analysis of Cross Section and Panel Data. MIT Press,
Cambridge, Massachusetts, 2nd edition, 2010.
8
Covariate Adjustment in Regression Discontinuity
Designs
CONTENTS
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.2 Covariate Adjustment in RD Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.2.1 Overview of the canonical RD design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.2.2 Efficiency and power improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.2.3 Auxiliary information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.2.4 Treatment effect heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.2.5 Other parameters of interest and extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.3 Can Covariates Fix a Broken RD Design? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.4 Recommendations for Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.6 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.1 Introduction
In causal inference and program evaluation, a central goal is to learn about the causal effect of a
policy or treatment on an outcome of interest. There is a variety of methodological approaches for
the identification, estimation, and inference for causal treatment effects, depending on the specific
application. In experimental settings, where the treatment assignment mechanism is known, studying
treatment effects is relatively straightforward because the methods are based on assumptions that
are known to be true by virtue of the treatment assignment rule. In contrast, in observational
studies, the assignment mechanism is unknown and researchers are forced to invoke identifying
assumptions whose validity is not guaranteed by design. Well-known observational study methods
include selection on observables, instrumental variables, difference-in-differences, synthetic controls,
and regression discontinuity designs.1
In experimental and observational studies, baseline covariates generally serve different purposes.
In experimental settings covariate adjustment based on pre-intervention measures is often used either
for efficiency gains or for the evaluation of treatment effect heterogeneity. On the other hand, in many
observational studies, the primary purpose of covariate adjustment is for the identification of causal
effects. For example, under selection on observables, the researcher posits that, after conditioning
1 See, for example, [1], [2] and [3] for reviews on causal inference and program evaluation, and [4] for further discussion
The rest of the chapter is organized as follows. The next section reviews the different roles of
covariate adjustment in RD designs. Then, in Section 8.3, we discuss the important question of
whether the use of covariate adjustment such as adding fixed effects for group of units can “restore”
the validity of an RD design where the key identifying assumptions do not hold (e.g., settings where
units are suspected to have sorted around the cutoff). Section 8.4 offers generic recommendations for
practice, and Section 8.5 concludes.
and the key identifying assumption is the continuity of the conditional expectations of the potential
outcomes given the score: E[Yi (1) | Xi = x] and E[Yi (0) | Xi = x] are continuous in x at c, which
leads to the well-known continuity-based identification result [11]:
Many methods are available for estimation, inference, and validation of RD designs within the
continuity framework. The most common approach is to use (non-parametric) local polynomial
methods to approximate the two regression functions near the cutoff. The implementation requires
choosing a bandwidth around the cutoff, and then fitting two polynomial regressions of the outcome
on the score – one above and the other below the cutoff – using only observations with scores within
156 Covariate Adjustment in Regression Discontinuity Designs
the chosen region around the cutoff as determined by the bandwidth. The bandwidth is typically
chosen in an optimal and data-driven fashion to ensure transparency and to avoid specification
searching. The most common strategy is to choose a bandwidth that minimizes the asymptotic mean
squared error (MSE) of the RD point estimator (see [12], and references therein).
Although the implementation of local polynomial methods reduces to fitting two local, lower-
order polynomials using least-squares methods, the interpretation and properties are different from the
typical parametric least squares framework. Local polynomials are meant to provide a non-parametric
approximation to the regression functions of the potential outcomes given the score. Because these
functions are fundamentally unknown, the local polynomial approximation has an unavoidable
misspecification error. When the bandwidth is chosen to be MSE-optimal, this approximation error
appears in the standard distributional approximation and renders the conventional least squares
confidence intervals invalid because conventional inference assumes that the approximation error is
zero [13] developed valid confidence intervals based on a novel distributional approximation for a test
statistic in which the RD point estimator is adjusted using bias correction and also the standard error
is modified to account for the additional variability introduced by the bias estimation step. These
robust bias-corrected confidence intervals lead to valid inferences when using the MSE-optimal
bandwidth, and remain valid for other bandwidth choices, in addition to having several other optimal
properties [12].
The canonical implementation of local polynomial estimation and inference in RD designs uses
only the outcome and the score, without additional covariates. We discuss below how the local
polynomial approach can be augmented to include pre-determined covariates in the two polynomial
fits. In addition we discuss how the local polynomial estimation and inference methods are adapted
to incorporate covariates for other purposes such as heterogeneity analysis or extrapolation.
In the local randomization framework for RD analysis, the parameter of interest is defined
differently because the core identifying assumptions are different from those in the continuity
framework [10, 14]. Because in the local randomization approach the central assumption is that
the RD design creates conditions that resemble a randomized experiment in a neighborhood
near the cutoff, the parameter is defined within this neighborhood—as opposed to at the cutoff
as in the continuity approach. The local randomization parameter can be generically written as
1
EW [Yi (1) − Yi (0)]
X
τSLR =
NW
i:Xi ∈W
1 h Ti Yi i 1 h (1 − Ti )Yi i
EW EW
X X
τSLR = − ,
NW
i:Xi ∈W
PW [Ti = 1] NW
i:Xi ∈W
1 − PW [Ti = 1]
where PW denotes a probability computed conditionally for all units with Xi ∈ W, which indicates a
natural estimator by virtue of the known assignment mechanism within W and the other assumptions
imposed.
The window W where the local randomization assumptions are assumed to hold is unknown
in most applications. The recommended approach to choose this window in practice is based on
Covariate Adjustment in RD Designs 157
balance tests performed on one or more pre-determined covariates [14]. The main idea behind this
method is that the treatment effect on any pre-determined covariate is known to be zero, and thus
a non-zero effect is evidence that the local randomization assumptions do not hold. Assuming that
the covariate is correlated with the score outside of W but not inside, a test of the hypothesis that
the mean (or other feature of the distribution) of the pre-determined covariate is the same in the
treated and control groups should be rejected in all windows larger than W, and not rejected in W or
any window contained in it. The method thus consists of performing balance tests in nested (often
symmetric) windows, starting with the largest possible window and continuing until the hypothesis
of covariate balance fails to be rejected in one window and in all windows contained in it.
The window W in the local randomization approach is analogous to the bandwidth in the
continuity-based local polynomial analysis. An important distinction, however, is that the bandwidth
can be chosen in a data-driven and optimal way using only data on the outcome and the score, while
choosing the local randomization window W requires data on auxiliary pre-determined covariates.
In this sense, the local polynomial analysis can be fully implemented without additional covariates,
while the local randomization analysis typically cannot—unless W is known or chosen in an arbitrary
manner. For further details on RD estimation and inference under the continuity-based and local
randomization approaches, we refer the reader to [5–9], and references therein.
RD literature only has methodological developments for simple power calculations and survey sample design ( [16]), but no
work is available taking into account covariates for how to use ex-ante survey information to increase efficiency in the ex-post
analysis.
158 Covariate Adjustment in Regression Discontinuity Designs
These ideas can be applied to the analysis of RD designs with suitable modifications. In the local
randomization RD approach, the particular properties of covariate adjustment follow directly from
applying the results from the literature on experiments to the window around the cutoff where the local
randomization assumptions are assumed to hold. Depending on the assumptions imposed (Fisherian,
Neymann, or super-population), and the specific estimation and inference methods considered, the
efficiency gains may be more or less important.
Furthermore, regression-based covariate adjustment can also be used in the local randomization
approach to relax some of the underlying assumptions. For example, [10] model the potential
outcomes as polynomials of the running variable, to allow for a more flexible functional form of the
potential outcomes inside the window W where the local randomization assumptions are assumed to
hold. This model can be extended directly to include pre-determined covariates in the polynomial
model, in addition to the running variable.
In the continuity-based framework, applying the same arguments is not immediate because,
although the local polynomial RD estimator is based on linear regressions above and below the cutoff,
these regressions are non-parametric and the average treatment effect is estimated by extrapolating to
the cutoff point. The properties of regression-based covariate adjustment in this context were first
studied by [17], who show that the inclusion of covariates other than the score in a local polynomial
analysis can lead to asymptotic efficiency gains, if carefully implemented. The authors also show
that covariate adjustment can result in unintended changes to the parameter that is being estimated
depending on how the covariates are introduced in the estimation, a topic we revisit in Section 8.3.
The standard local linear estimator of the RD treatment is implemented by running the weighted
least squares regression of Yi on a constant, Ti , Xi , and Xi Ti using only units with scores inside the
chosen bandwidth, Xi ∈ [c − h, c + h], and applying weights based on some kernel function. This
leads to a point estimator, which can be interpreted as MSE-optimal if the bandwidth employed is
chosen to minimize the mean squared error of the RD point estimator. This estimator includes only
the score Xi , and thus it is said to be “unadjusted” because it does not incorporate any additional
pre-treatment characteristics of the units. As discussed above, under standard continuity assumptions,
the unadjusted local linear estimator τ̂SRD is consistent for the continuity-based RD treatment effect
τSRD , and robust bias-corrected inference can be employed as is now standard in the literature.
In practice, it is common for researchers to adjust the RD estimator by augmenting the local
polynomial specification with the additional pre-determined covariates Zi . [17] show that augmenting
the local polynomial specification by adding covariates in an additive-separable, linear-in-parameters
way that imposes the same common coefficient for treated and control groups leads to a covariated-
adjusted RD estimator that remains consistent for the canonical sharp RD treatment effect τSRD , while
offering a reduction in its variance in large samples (see also [18]). The key required restriction, of
course, is that the covariates are pre-intervention, and that they have non-zero correlation with the
outcome. More specifically, whenever the effect of the additional pre-intervention covariates on the
potential outcomes near the cutoff is (approximately) the same in the control and treatment groups,
augmenting the specification with these covariates can lead to efficiency gains. Furthermore, the
authors also show that other approaches commonly used in the applied RD literature for regression-
based covariate adjustment can lead to inconsistent estimators of the parameter of interest τSRD .
More recently, [19] proposed a high-dimensional implementation of the covariate-adjusted local
polynomial regression approach of [17], which allows for selecting a subset of pre-intervention
covariates out of a potentially large covariate pool. Their method is based on penalization techniques
(Lasso) and can lead to further efficiency improvements in practice. More generally, combining
modern high-dimensional and machine learning methods can be a fruitful avenue for further efficiency
improvements via covariate adjustment in RD settings. For example, non-parametric or machine
learning adjustments can be applied directly to the outcome variable to then employ the residualized
outcomes in the subsequent local polynomial analysis, a procedure that can lead to further efficiency
and power improvements.
Covariate Adjustment in RD Designs 159
In sum, our discussion highlights several important issues. First, standard methods for covariate
adjustment in the analysis of experiments can in principle be applied to RD designs with suitable
modifications depending on the specific conceptual framework used. These methods and their
properties apply immediately to RD estimation and inference in a local randomization framework,
within the window W where the local randomization assumption is assumed to hold. The extension to
continuity-based RD analysis is less immediate, as it requires considering the properties of the specific
techniques at boundary points in a non-parametric local polynomial regression setup. However, there
is a critical conceptual point: when using covariates for efficiency gains in RD designs, adjusting for
those covariates should not change the RD point estimate. Regardless of the methods considered,
covariate adjustment should not change the parameter of interest, and therefore unadjusted and
adjusted point estimators should be similar in applications.
Second, pre-intervention covariates can also affect tuning parameter selection and implementation
of the RD design more broadly. For example, in the local randomization framework, covariates can
be used to select the window where the key assumptions hold and to relax those assumptions, while
in the continuity-based framework they can be used to further refine bandwidth selection for local
polynomial estimation and robust bias-corrected inference.
Finally, while augmenting RD analysis with covariates to increase efficiency is a principled
goal, researchers should be careful to avoid using covariate adjustment for specification searches.
In empirical studies, researchers should always report unadjusted RD results first, and covariate-
adjusted results second. Covariate adjustment can be implemented in many different ways, and
different adjustments might lead to different results and conclusions. Instead of trying multiple
ways of covariate adjustment at the time of analysis, researchers should pre-register their pre-
ferred adjustment method, and perform only that specification as opposed to trying multiple regres-
sion models. We return to this important point when discussing recommendations for practice in
Section 8.4.
include the running variable and the treatment indicator, but other covariates can be added to the
specification. A somewhat more general use of pre-determined covariates in Bayesian approaches
to RD estimation is to use them to inform priors. Here, researchers sometimes use pre-treatment
information rather than specific covariates. For example, in their study of the effect of prescribing
statins to patients with high cardiovascular disease risk scores on their future LDL cholesterol
levels, [24] use data on previous experimental studies to formulate informative priors in a Bayesian
approach to estimate RD effects. Finally, in a related approach, [25] use principal stratification in
a Bayesian framework to estimate complier average treatment effects in fuzzy RD designs. They
model the imperfect compliance near the cutoff using a discrete confounder variable that captures
different types of subjects: compliers, never-takers, and always-takers. Their setup leads to a mixture
model averaged over the unknown subject types. The compliers’ potential outcomes are modeled as
a function of observed pre-determined covariates and the running variable, plus an idisyoncractic
shock that follows a t-distribution. Given a prior distribution on the type probabilities, the posterior
distribution of the complier average treatment effect is obtained by Markov chain Monte Carlo
(MCMC) methods.
These types of methods have not been widely adopted in practice, but they provide useful
examples of the different roles that covariate adjustment methods can have in RD applications.
Importantly, in all these methods, the goal is ultimately to identify, estimate and conduct inference
on the canonical (sharp and other) RD treatment effects, that is, without changing the parameter
of interest. In the upcoming sections, we discuss covariate adjustment methods that do change the
estimand and therefore require careful interpretation.
where Zi denotes the additional pre-intervention covariate. The parameter τCSRD (z) corresponds
to the (conditional, sharp) RD treatment effect at the cutoff for the subpopulation with covariates
Zi = z. Naturally, under appropriate assumptions, the canonical sharp RD treatment effect τSRD
becomes a weighted average of the conditional RD treatment effects, with weights related to the
conditional distribution of Zi |Xi near the cutoff. Similar ideas apply to the local randomization
framework, depending on the specific assumptions invoked.
When the covariates Zi are few and discrete, the simplest strategy to explore heterogeneity is
to conduct the RD analysis for each subgroup defined by Zi = z, for each value z, separately. For
example, in a medical experiment, researchers may be interested in separately estimating treatment
effects for patients in different age groups, e.g., patients between 45 and 64 versus patients 65
and above. In such cases separate analyses by subgroups have the advantage of being entirely
non-parametric and involving no additional assumptions. One disadvantage is that the number
of observations in each subgroup is lower than in the overall analysis, which reduces statistical
power. Whether this is a limitation will depend on the number of observations and other features
Covariate Adjustment in RD Designs 161
of the data generating process in each particular application. Practically, this type of analysis can
also be implemented by generating indicator variables for each category of the pre-intervention
covariates, and then conducting estimation and inference using a fully saturated, interacted local
polynomial regression model; if the indicators cover all possible subgroups, this strategy is equivalent
to estimating effects separately by subgroup. This approach is also valid with respect to estimation
and inference, and can be deployed in both the continuity-based and local randomization frameworks.
However, if the covariates are multi-valued or continuous, using an interactive model imposes
additional parametric assumptions, local to the cutoff, and the validity of inferences will depend on
the validity of these assumptions.
If the number of covariates is large and researchers wish to explore heterogeneity in multiple
dimensions, estimating separate average treatment effects by subgroups may be infeasible or imprac-
tical. In this case, researchers are usually interested in exploring heterogeneity in a large number
of covariate partitions without specifying subgroups a priori. Standard non-parametric or machine
learning methods can be useful in this situation, as they allow researchers to learn the relevant
subgroups from the data. Machine learning methods in the continuity-based RD framework require
modifications and have begun to be explored only recently. In particular, [26] explores heterogeneity
of the RD effect in subpopulations defined by levels of pre-determined covariates, creating a tree
where each leaf contains an RD effect estimated on an independent sample. The approach assumes
a parametric q-th order polynomial in each leaf of the tree; under this parametric assumption and
additional regularity assumptions, the method leads to subgroup point estimates and standard errors
that can be used for inference.
A different continuity-based approach to examine RD treatment effect heterogeneity on different
subpopulations is proposed by [27]. The authors impose stronger continuity conditions than in the
standard RD design, requiring (in the sharp RD case) continuity of the expectation of the potential
outcomes conditional on both the running variable and the additional covariates, and continuity
of the conditional distribution of the additional covariates given the running variable. Under these
conditions, they derive conditional average treatment effects given the covariates, and propose tests
for three hypothesis: (i) that the treatment is beneficial for at least some subpopulations defined
by values of the pre-determined covariates, (ii) that the treatment has any impact on at least some
subpopulations, and (iii) that the treatment effect is heterogeneous across all subpopulations.
[27] define the null and alternative hypotheses by conditional moment inequalities given both
the running variable and the additional covariates, and convert these conditional moment inequalities
into an infinite number of unconditional-on-covariates moment inequalities (that is, inequalities that
are conditional only on the running variable and not on the additional covariates). Once the null and
the alternative hypotheses are re-defined as instrumented moment conditions, these are estimated
with local linear polynomials. Under regularity conditions, the asymptotic distribution of the local
polynomial estimators of the moment conditions can then be used to derive the distribution of test
statistics under the null hypothesis. In a follow-up paper, [28] employ similar methods to develop
a monotonicity test to assess whether a conditional local average treatment effect in a sharp RD
design or a conditional local average treatment effect for compliers in a fuzzy RD design has a
monotonic relationship with an observed pre-determined covariate – that is, the null hypothesis is that
the conditional local average treatment effect, seen as a function of a covariate z, is non-decreasing
in z.
Covariate adjustments for heterogeneity analysis within the local randomization framework may
be more challenging due to limited sample sizes. Since the local randomization framework is generi-
cally more appropriate for very small windows W around the cutoff, and these windows typically
contain only a few observations, conducting estimation and inference for subsets of observations
according to Zi = z can be difficult in practice, leading to limited statistical power.
In sum, there are several important issues when using covariate adjustment for the analysis of
heterogenous RD treatment effects. First, focusing on treatment effect heterogeneity requires that
researchers redefine the parameter of interest. Second, estimating different RD treatment effects for
162 Covariate Adjustment in Regression Discontinuity Designs
different subgroups defined by covariate values can be done in a non-parametric way by estimating
fully saturated models – or equivalently, by analyzing each sub-group separately. When the covariate
dimensionality is too high and sub-group analysis is not possible, machine learning methods may
offer an attractive strategy to discover relevant subgroups. Finally, augmented models that include
interactions between the RD treatment and non-binary covariates require additional parametric
assumptions for identification and inference.
study the effect of the treatment for observations far from the cutoff. This is formalized by assuming
that, conditional on the covariates, the potential outcomes are mean-independent of the running
variable. This “selection on observables” assumption (together with the standard common support
assumption) immediately allows for extrapolation of the RD treatment effect, since it is assumed to
hold for the entire support of the running variable. In this case covariates allow for the identification
of new parameters that capture the effect of the treatment at different values or regions of the running
variable.
A similar conditional independence assumption has been used by in the context of a geographic or
multi-score RD designs. In a geographic RD design, a treatment is assigned to units in a geographic
area and withheld from units in an adjacent geographic area. In this setup, units can be thought of as
having a score (such as a latitude-longitude pair) that uniquely defines their geographic location and
allows the researcher to calculate their distance to any point on the boundary between the treated and
control areas. The assignment of the treatment can then be viewed as a deterministic function of this
score, and the probability of receiving treatment jumps discontinuously at the border that separates
the treated and control areas. As discussed by [35], seen in this way, the geographic RD design can be
analyzed with the same continuity-based and local randomization tools of standard, one-dimensional
RD designs [36] consider a geographic RD application where units appear to choose their location on
either side of the boundary in a strategic way. They propose a conditional independence assumption
according to which, for each point on the boundary, the potential outcomes are independent of the
treatment assignment conditional on a set of covariates for units located in a neighborhood of that
point (see also [37] for an extension to multiple dimensions). An important difference between their
assumption and the conditional independence assumption proposed by [34] is that the former imposes
conditional independence in a neighborhood of the cutoff and thus allows for extrapolation only
within a neighborhood, while the latter imposes (mean) conditional independence along the entire
support of the running variable and thus allows for extrapolation in the entire support of the running
variable. Despite the differences, in both cases covariates are used for identification purposes, to
define new parameters of interest that capture the average treatment effect at values of the running
variable that are different from the cutoff.
Another use of auxiliary covariates for extrapolation purposes is proposed by [38], who augment
the usual RD design with an exogenous measure of the outcome variable. Under the assumption that
the exogenous outcome data can be used to consistently estimate the regression function of the actual
outcome on the score in the absence of the treatment, treatment effects can be extrapolated to values
of the score other than the actual cutoff used for assignment.
In sum, additional covariates can be used in the RD design to define new parameters of interest,
including parameters capturing treatment effects for units with score values different from the original
cutoff. These covariates are sometimes part of the original RD design, as in the case of multiple
cutoffs or multiple time periods, while in other settings they are external to the design and must be
collected separately, as in the case of auxiliary unit characteristics or exogenous outcomes.
often assume that (Yi (1), Yi (0)) ⊥ ⊥ Di |Zi in order to identify treatment effects. This assumption
directly justifies covariate adjustment via regression models, inverse probability weighting, and other
matching methods.
This observation has motivated some researchers to employ covariate adjustment to “fix” invalid
RD designs where the identifying assumptions are suspected not to hold. In these cases, it is common
to encounter RD analyses that include fixed effects or other covariates via additive linear adjustments
in (local) polynomial regression estimation. In the methodological RD literature, adjustment for
covariate imbalances has been proposed using multi-step non-parametric regression [39] and inverse
probability weighting [40].
In the continuity-based framework, covariate adjustment for imbalances in RD designs is done
within a framework where E[Yi (t)|Xi = x] is assumed to be discontinuous at the cutoff but
E[Yi (t)|Xi = x, Zi = z], seen as a function of x, is assumed continuous in x, for t = 0, 1. These
assumptions combined imply that the conditional distribution of Zi |Xi must be discontinuous at
the cutoff, which in turn implies that canonical average RD treatment effects are necessarily not
identifiable in general. Rather, in the absence of additional strong assumptions, the estimand that
can be identified in this case is a weighted average of treatment effects where the weights depend on
the conditional distributions of Zi |Xi for control and treatment units, which are different from each
other [17, 39]. As a consequence, in imbalanced RD designs, canonical RD treatment effects are not
identifiable in general, and by implication covariate adjustment cannot fix a broken RD design. In the
local randomization framework, the situation is analogous.
In conclusion, for invalid RD designs where important covariates are imbalanced at the cut-
off, covariate adjustment cannot restore identification of the canonical RD treatment effect. At
best, covariate adjustment can identify some weighted average of the conditional treatment effects,
E[Yi (1)|Xi = x, Zi = z] − E[Yi (0)|Xi = x, Zi = z], with weights determined by the properties
of the two distinct conditional distributions of Zi |Xi below and above the cutoff. Unfortunately, in
most applications, this covariate-adjusted estimand is not a parameter of interest and requires strong,
additional assumptions on the underlying data generating process to deliver a useful parameter of
interest. In other words, the main advantages of canonical RD designs in terms of identification are
lost once the key identifying assumptions fail, and cannot be restored by simply using covariate
adjustment methods without additional untestable assumptions.
testing when incorporating covariate adjustment in the RD analysis. One possible solution to address
this concern is pre-registration of the analysis, which requires practitioners to declare ex-ante the
covariate adjustment approach to be used in the subsequent RD analysis in other to avoid ex-post
specification searching.
Pre-intervention covariates can also be employed for identification of other useful RD treatment
effects. In Section 8.2, we discussed different examples such as treatment effect heterogeneity, multi-
cutoff, geographic and multi-score designs, and extrapolation. These approaches offer relatively
principled ways of incorporating covariates into an RD analysis. In the case of treatment effect
heterogeneity, no substantial additional assumptions are needed, particularly in cases where the
covariates are discrete and low-dimensional, in which case subset analysis is a natural way to
proceed empirically. In the other cases, additional assumptions are needed in order to exploit
covariate adjustment for identification of other meaningful RD treatment effects. These covariate
adjustment practices are reasonable, and can be used in applications whenever the necessary additional
assumptions are clearly stated and judged to be plausible. Again, it is important to guard against
model specification searching, just like when employing pre-interventation covariates for efficiency
and power improvements.
Finally, an important misconception among some practitioners and methodologists is that covari-
ates can be used to restore identification or “fix” an RD design where observations just above the
cutoff are very different from observations just below it – in other words, a design where there is
covariate imbalance at or near the cutoff and the RD assumptions are not supported by the empirical
evidence. However, as we discussed in Section 8.3, covariate adjustment in such cases requires strong
additional assumptions, and cannot in general recover canonical RD treatment effects. In the absence
of additional assumptions, adjusting for imbalanced covariates will at best recover other estimands at
the cutoff that are unlikely to be of substantive interest in applications.
8.5 Conclusion
We provided a conceptual overview of the different approaches for covariate adjustment in RD
designs. We discussed benign approaches based on pre-intervention covariates, including methods
for efficiency and power improvements, heterogeneity analysis, and extrapolation. Those methods
are principled and generally valid under additional reasonable assumptions on the data generating
process. However, we also highlighted a natural tension between incorporating covariates in an RD
analysis and issues related to p-hacking and specification searching. As a consequence, researchers
employing covariate adjustment in RD designs should always be cognizant of issues related to
model/covariate selection when implementing those methods in practice. Last but not least, we
addressed the common misconception that covariate adjustment can fix an otherwise invalid RD
design, and discussed how those methods can at best recover other estimands that may not be of
interest.
In conclusion, covariate adjustment can be a useful additional tool for the analysis and inter-
pretation of RD designs when implemented carefully and in a principled way. In canonical RD
settings it should never replace, only complement, the basic RD analysis based on the score and
outcome variables alone. In other settings, covariate adjustment can offer additional insights in terms
of heterogeneous or other treatment effects of interest, under additional design-specific assumptions.
166 Covariate Adjustment in Regression Discontinuity Designs
8.6 Acknowledgments
We thank our collaborators Sebastian Calonico, Max Farrell, Michael Jansson, Xinwei Ma, Gonzalo
Vazquez-Bare, and Jose Zubizarreta for stimulating discussions on the topic of this chapter. Cattaneo
and Titiunik gratefully acknowledge financial support from the National Science Foundation (SES-
2019432), and Cattaneo gratefully acknowledge financial support from the National Institutes of
Health (R01 GM072611-16).
References
[1] Paul R. Rosenbaum. Design of Observational Studies. Springer-Verlag, 2010.
[2] Guido W Imbens and Donald B Rubin. Causal Inference in Statistics, Social, and Biomedical
Sciences. Cambridge University Press, 2015.
[3] Alberto Abadie and Matias D. Cattaneo. Econometric methods for program evaluation. Annual
Review of Economics, 10:465–503, 2018.
[4] Rocio Titiunik. Natural experiments. In J. N. Druckman and D. P. Gree, editors, Advances in
Experimental Political Science, chapter 6, pages 103–129. Cambridge University Press, 2021.
[5] Matias D. Cattaneo, Rocio Titiunik, and Gonzalo Vazquez-Bare. The regression discontinuity
design. In L. Curini and R. J. Franzese, editors, Handbook of Research Methods in Political
Science and International Relations, chapter 44, pages 835–857. Sage Publications, 2020.
[6] Matias D. Cattaneo, Nicolás Idrobo, and Rocio Titiunik. A Practical Introduction to Regression
Discontinuity Designs: Foundations. Cambridge University Press, 2020.
[7] Matias D. Cattaneo, Nicolás Idrobo, and Rocio Titiunik. A Practical Introduction to Regression
Discontinuity Designs: Extensions. Cambridge University Press (to appear), 2023.
[8] Matias D. Cattaneo, Luke Keele, and Rocio Titiunik. A guide to regression discontinuity
designs in medical applications. Working Paper, 2022.
[9] Matias D. Cattaneo and Rocio Titiunik. Regression discontinuity designs. Annual Review of
Economics, 14:821–851, 2022.
[10] Matias D. Cattaneo, Rocio Titiunik, and Gonzalo Vazquez-Bare. Comparing inference ap-
proaches for rd designs: A reexamination of the effect of head start on child mortality. Journal
of Policy Analysis and Management, 36(3):643–681, 2017.
[11] Jinyong Hahn, Petra Todd, and Wilbert van der Klaauw. Identification and estimation of
treatment effects with a regression-discontinuity design. Econometrica, 69(1):201–209, 2001.
[12] Sebastian Calonico, Matias D. Cattaneo, and Max H. Farrell. Optimal bandwidth choice for
robust bias corrected inference in regression discontinuity designs. Econometrics Journal,
23(2):192–210, 2020.
[13] Sebastian Calonico, Matias D. Cattaneo, and Rocio Titiunik. Robust nonparametric confidence
intervals for regression-discontinuity designs. Econometrica, 82(6):2295–2326, 2014.
Acknowledgments 167
[14] Matias D. Cattaneo, Brigham Frandsen, and Rocio Titiunik. Randomization inference in the
regression discontinuity design: An application to party advantages in the u.s. senate. Journal
of Causal Inference, 3(1):1–24, 2015.
[15] Donald P. Green Green and Alan Gerber. Field Experiments: Design, Analysis, and Interpreta-
tion. W. W. Norton & Company, 2012.
[16] Matias D. Cattaneo, Rocio Titiunik, and Gonzalo Vazquez-Bare. Power calculations for
regression discontinuity designs. Stata Journal, 19(1):210–245, 2019.
[17] Sebastian Calonico, Matias D. Cattaneo, Max H. Farrell, and Rocio Titiunik. Regression
discontinuity designs using covariates. Review of Economics and Statistics, 101(3):442–451,
2019.
[18] Jun Ma and Zhengfei Yu. Empirical likelihood covariate adjustment for regression discontinuity
designs. arXiv:2008.09263, 2022.
[19] Yoichi Arai, Taisuke Otsu, and Myung Hwan Seo. Regression discontinuity design with
potentially many covariates. arXiv:2109.08351, 2021.
[20] Zhuan Pei and Yi Shen. The devil is in the tails: Regression discontinuity design with mea-
surement error in the assignment variable. In Matias D. Cattaneo and Juan Carlos Escanciano,
editors, Regression Discontinuity Designs: Theory and Applications (Advances in Econometrics,
volume 38), pages 455–502. Emerald Group Publishing, 2017.
[21] Otavio Bartalotti, Quentin Brummet, and Steven Dieterle. A correction for regression disconti-
nuity designs with group-specific mismeasurement of the running variable. Journal of Business
& Economic Statistics, 39(3):833–848, 2021.
[22] Laurent Davezies and Thomas Le Barbanchon. Regression discontinuity design with continuous
measurement error in the running variable. Journal of Econometrics, 200(2):260–281, 2017.
[23] George Karabatsos and Stephen G Walker. A bayesian nonparametric causal model for
regression discontinuity designs. In R. Mitra and P. Müller, editors, Nonparametric Bayesian
Inference in Biostatistics, pages 403–421, 2015.
[24] Sara Geneletti, Aidan G O’Keeffe, Linda D Sharples, Sylvia Richardson, and Gianluca Baio.
Bayesian regression discontinuity designs: Incorporating clinical knowledge in the causal
analysis of primary care data. Statistics in Medicine, 34(15):2334–2352, 2015.
[25] Siddhartha Chib and Liana Jacobi. Bayesian fuzzy regression discontinuity analysis and returns
to compulsory schooling. Journal of Applied Econometrics, 31(6):1026–1047, 2016.
[26] Agoston Reguly. Heterogeneous treatment effects in regression discontinuity designs. arXiv
preprint arXiv:2106.11640, 2021.
[27] Yu-Chin Hsu and Shu Shen. Testing treatment effect heterogeneity in regression discontinuity
designs. Journal of Econometrics, 208(2):468–486, 2019.
[28] Yu-Chin Hsu and Shu Shen. Testing monotonicity of conditional treatment effects under
regression discontinuity designs. Journal of Applied Econometrics, 36(3):346–366, 2021.
[29] Matias D. Cattaneo, Luke Keele, Rocio Titiunik, and Gonzalo Vazquez-Bare. Interpreting
regression discontinuity designs with multiple cutoffs. Journal of Politics, 78(4):1229–1248,
2016.
168 Covariate Adjustment in Regression Discontinuity Designs
[30] Yasin Kursat Onder and Mrittika Shamsuddin. Heterogeneous treatment under regression dis-
continuity design: Application to female high school enrolment. Oxford Bulletin of Economics
and Statistics, 81(4):744–767, 2019.
[31] Marinho Bertanha. Regression discontinuity design with many thresholds. Journal of Econo-
metrics, 218(1):216–241, 2020.
[32] Matias D Cattaneo, Luke Keele, Rocio Titiunik, and Gonzalo Vazquez-Bare. Extrapolating
treatment effects in multi-cutoff regression discontinuity designs. Journal of the American
Statistical Association, 116(536):1941–1952, 2021.
[33] Veronica Grembi, Tommaso Nannicini, and Ugo Troiano. Do fiscal rules matter? American
Economic Journal: Applied Economics, 8(3):1–30, 2016.
[34] Joshua D Angrist and Miikka Rokkanen. Wanna get away? regression discontinuity estimation
of exam school effects away from the cutoff. Journal of the American Statistical Association,
110(512):1331–1344, 2015.
[35] Luke J Keele and Rocio Titiunik. Geographic boundaries as regression discontinuities. Political
Analysis, 23(1):127–155, 2015.
[36] Luke Keele, Rocio Titiunik, and Jose R Zubizarreta. Enhancing a geographic regression
discontinuity design through matching to estimate the effect of ballot initiatives on voter turnout.
Journal of the Royal Statistical Society. Series A (Statistics in Society), 178(1):223–239, 2015.
[37] Juan D Diaz and Jose R Zubizarreta. Complex discontinuity designs using covariates for policy
impact evaluation. Annals of Applied Statistics, forthcoming, 2022.
[38] Coady Wing and Thomas D Cook. Strengthening the regression discontinuity design using
additional design elements: A within-study comparison. Journal of Policy Analysis and
Management, 32(4):853–877, 2013.
[39] Markus Frolich and Martin Huber. Including covariates in the regression discontinuity design.
Journal of Business & Economic Statistics, 37(4):736–748, 2019.
[40] Sida Peng and Yang Ning. Regression discontinuity design under self-selection. In International
Conference on Artificial Intelligence and Statistics, pages 118–126. PMLR, 2021.
9
Risk Set Matching
CONTENTS
9.1 Treatment at Different Time Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
9.2 Methodology Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9.3 Implementation of Risk Set Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.3.1 Sequential matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.3.2 Simultaneous matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.3.3 A toy example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.4 Illustrative Real-World Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
9.4.1 Evaluation of surgery for interstitial cystitis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
9.4.2 Impact of premature infants staying longer on subsequent health care costs
and outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.4.3 Drug effect for pregnant women with recurrent pre-term birth . . . . . . . . . . . . . 179
9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.6 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
9.7 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
When a treatment may be given at different time points, it is natural to examine the treatment delay
effect, i.e., the impact of giving the treatment at an earlier time point in lieu of at some later time
point. In such situation, it is crucial to form matched pairs or sets in which patients have similar
covariate history up to the time point of treatment, but not matching on any post-treatment event.
Risk set matching is a powerful design to match a cohort of subjects, in which the treatment was
given at various times. The inference following such design may reveal the causal effect of a delayed
treatment.
the treatment. In some other situations, most or all of subjects may get the treatment eventually if
the study followup is long enough. A naive way is to split the subjects into early- and late-treatment
groups by an intermediate time point, then contrast their outcomes. But again this is a mistake as
the timing of treatment could be substantially affected by underlying health status. For example,
those who receive treatment earlier are much sicker and the rapid disease progression demands early
treatment, while those who receive treatment later are healthier and they can afford longer waiting
time.
A good illustration of this problem comes from early studies of the effects of transplantation [1].
A patient who needed an organ transplant was immediately enrolled into the study but had to wait
until a suitable organ become available, depending on many factors including blood types and severity
of illness. The early studies usually compared the duration of survival from entry into the study
for patients who were transplanted to those who were not transplanted. People who died before a
matching organ was available, were considered “control” by definition. In a critique of the early
studies evaluating cardiac transplantation [2], Gail pointed out that “this assignment method biases
the results in favor to the transplanted group.” Patients with more severe conditions were likely to die
before a heart became available, thus classified as controls. In contrast, transplanted patients survived
at least long enough to receive a heart. This way of assigning transplant and control groups would
introduce bias in comparison as the control group was expected to have a shorter overall survival,
due to the fact that many of them failed to reach the time point when a transplant could be conducted.
To address the difficulty introduced by the potential treatment delay, waiting for a matching heart,
Gail suggested a paired randomized experiment design that removes such bias. When a heart becomes
available, two patients in waiting who are most compatible with that heart are identified, and one
is chosen at random to receive it, while the other becomes a control. To assess the treatment effect
of survival, the survival time is measured from the time of randomization. This design effectively
remove the biases associated with the fact that hearts are not immediately available to patients upon
entry into the study. Li and colleagues [3] suggested a slightly modified version of the design by
allowing the control subject, who did not receive the available heart, to remain eligible to receive a
suitable heart at a later time.
This modified design addresses a different problem of estimating the effect a treatment delay, that
is, the effect of treating now as contrasted with waiting and possibly treating at a later time. In fact, the
conventional treated-vs-control comparison can be regarded as a special case of the treatment delay
problem, in the sense that the waiting time is forever. The effect of a treatment delay has important
practical implication in terms of treatment management. Does treating immediately improve patient
outcomes? Or shall we wait to see if there are better options later? If immediate treatment is deemed
unnecessary, how long can we delay the treatment? Conceivably, longer delays are likely to have
different effects than shorter delays. A randomized experiment with pre-specified delay times could
address those questions. Risk set matching is a design tool for observational studies analogous to
such randomized experiments. It takes advantage of the varying treatment times in an observational
dataset to created matched pairs of similar subjects with one being treated immediately and the other
being delayed. At the moment when someone receives treatment, that person is paired with another
who has not yet treated and whose observed covariates are similar up to the treatment time point.
It is crucial to emphasize that their similarity is solely determined by observed covariates history
before treatment. The pairing should not match on any post-treatment variable, which may potentially
remove part of the treatment effect. As seen in the old way of analyzing transplant patients, controls
were defined as those who never got transplanted. Knowing who was control is nearly the same as
knowing the future outcome, namely, that they died before a heart became available. This would
likely bias the treatment effect estimation, while risk set matching matches on the past, not the future.
Another important feature of risk set matching is that it can balance the time-varying covariates, as
the conventional two-group matching usually does not address the temporal change of covariates.
The chapter is organized as following: section 2 introduces some theoretical background about
risk set matching; section 3 describes the implementation of risk set matching via both bipartite and
Methodology Overview 171
non-bipartite algorithms; section 4 presents three illustrative examples from the literature, including
evaluation of surgery for interstitial cystitis, impact of premature infants staying longer on subsequent
healthcare costs and outcomes, and drug effect for pregnant women with recurrent pre-term birth,
followed by a summary of the chapter.
To match on the observed histories, Li et al. [3] first picked important variables at different
time points (prior to treatment) and stratified them by quantiles, then conducted optimal matching
within each stratum using Mahalanobis distance computed based on relevant covariates. They also
pointed out that exact matching on the entire covariate history is not needed and some dimension
reduction strategy can be applied to obtain the same result. Lu [5] extended the idea by conducting
risk set matching on time-dependent propensity scores. The propensity score is estimated with a
Cox proportional hazards model with time-varying covariates. It is shown that matching on the
covariates-related hazard component is sufficient to balance the covariates distribution, hence to
justify the randomization based inference.
and late groups is 0.33, indicating a slightly beneficial effect (probably not significant) of delaying
the treatment.
It does not, however, take into account the fact that patients receiving treatment earlier are often
those who are sicker and need the treatment immediately. As apparent in Table 9.1, patients A and B,
who receive treatment at week 1, have severity scores of 6 or higher in the first two weeks, while
patients K and L, who receive treatment much later, have scores 5 or lower. Therefore, a more valid
comparison is to match patient scores up to the moment of treatment, then contrast the outcomes
between the one immediately treated and the other with delayed treatment.
We first use the sequential matching design to construct matched pairs at each week. Starting
from week 1, the risk set consists of two treated and ten not-yet-treated patients. For simplicity,
the distance metric between any two patients is the average absolute difference of severity scores
prior to treatment. For example, since the treatment is after week 1, the first two scores from each
patient will be used to calculate the distance for week 1 risk set. The distance between A and C is
zero since they have exactly the same scores at baseline and week 1. The distance between A and
D is |6−6|+|7−6|
2 = 0.5. The distances are set to ∞ between treated (or not-yet-treated) patients to
prohibit matching. The full distance matrix is shown below and a conventional optimal bipartite
matching algorithm produces pairs of (A, C) and (B, D).
∞ ∞ 0 0.5 1 2.5 1 2.5 4 9 4 9
∞ ∞ 0.5 0 0.5 1 0.5 1 2.5 6.5 2.5 6.5
0 0.5 ∞ ··· ··· ∞
0.5 0
∞ ··· ··· ∞
1 0.5 ∞ ··· ··· ∞
2.5 1
∞ ··· ··· ∞
1 0.5 ∞ ··· ··· ∞
2.5 1
∞ ··· ··· ∞
4 2.5 ∞ ··· ··· ∞
9 6.5 ∞ ··· ··· ∞
4 2.5 ∞ ··· ··· ∞
9 6.5 ∞ ··· ··· ∞
Prior to conducting match for week 2 risk set, matched patients (A, B, C, D) are removed to
prevent matching the same patient more than once. Therefore, week 2 risk set consists of eight
patients, two treated and six not-yet-treated. Now, three pre-treatment severity scores (baseline, week
1 and week 2) can be used in distance calculation. For example, the distance between patients E and
H is |5−5|+|6−5|+|7−6|
3 = 0.67. The full distance matrix for week 2 risk set is shown below and the
Implementation of Risk Set Matching 175
TABLE 9.2
Risk set matching process: Sequential vs. simultaneous.
Sequential matching
1 Treated: A, B (A, C)
Not-yet-treated: C, D, E, F, G, H, I, J, K, L (B, D)
2 Treated: E, F (E, G)
Not-yet-treated: G, H, I, J, K, L (F, H)
3 Treated: I, J (I, K)
Not-yet-treated: K, L (J, L)
Simultaneous matching
1-3 A, B, C, D, E, F, G, H, I, J, K, L (A, C), (B, D)
(E, G), (F, H)
(I, K), (J, L)
optimal bipartite matching algorithm produces pairs of (E, G) and (F, H).
∞ ∞ 0 0.67 2 4 2 4
∞ ∞ 0.67 0 0.67 2 0.67 2
0
0.67 ∞ ··· ··· ∞
0.67
0 ∞ ··· ··· ∞
2
0.67 ∞ ··· ··· ∞
4
2 ∞ ··· ··· ∞
2 0.67 ∞ ··· ··· ∞
4 2 ∞ ··· ··· ∞
After removing matched patients from week 2 risk set, week 3 risk set consists of only four
patients, two treated and two not-yet-treated. Four pre-treatment severity scores can be used in
distance calculation at week 3. The full distance matrix for week 2 risk set is shown below and the
optimal bipartite matching algorithm produces pairs of (I, K) and (J, L).
∞ ∞ 0 0.5
∞ ∞ 0.5 0
0 0.5 ∞ ∞
0.5 0 ∞ ∞
The whole process of sequential matching is summarized in the upper panel of Table 9.2. The
bottom panel shows the process of simultaneous matching, which can be done in one step. This is
possible because simultaneous matching uses nonbipartite algorithm, which regards patients treated
at different time points as different groups, not confined to only two treatment groups. The key to
implement nonbipartite matching is to calculate distance correctly at each time point. For example,
patient C may be used as a control if he/she is matched to A or B, or C may be used as a treated if
he/she is matched to G, H, I, J, K or L. When matching with A or B, the distance is calculated based
on the first two time points since the treatment occurs at week 1. When matching with G to L, the
distance is calculated based on the first three time points since the treatment occurs at week 2. The
distance between C and D, E, F is set to ∞, because they are treated at the same time.
176 Risk Set Matching
like flipping a coin. It is important to point out that the treatment assignment looks random only at
the moment when the first patient receives surgery, as we do not know what would happen afterwards.
The control patient may receive surgery at the next follow-up time point or not at all until the end of
the study. This paired design estimates the effect of having surgery now versus delaying surgery into
the future. This is a typical question that physicians and patients face in practice for chronic diseases.
Unlike many drug studies or behavioral therapies [10], as an invasive procedure, C/H was given
at the most once during the entire study period. This sets up the stage to answer the causal question
about treatment delay given no multiple treatment assignments for the same patient. In this C/H study
on IC patients, the time intervals between visits were approximately three months. By design, the
control patient did not receive surgery in the three months interval immediately following surgery for
the matched treated patient. So naturally, we can estimate the treatment delay effect of three months
using all matched pairs.
There are many patient level characteristics to be balanced with matching, both fixed and
time-varying. At baseline, there are five binary demographic variables: Race, Education Level,
Full-Time Employment Status, Part-Time Employment Status, and Income. The symptom measures
are arguably more important due to their potential strong association with surgery decision. Three
measurements taken repeatedly every three months are Nocturnal Frequency of Voiding, Pain and
Urgency. Nocturnal Frequency of Voiding is a daily count of nocturnal voids. Pain and Urgency are
scores scaled on 0 to 9 with 9 indicating the most severe rating. Those three measurements will be
considered as time-varying covariates. After removing cases with missing covariate information,
the final analytical dataset consists of 424 patients with 273 controls and 151 patients who received
surgery at some point during the course of the study.
A Cox proportional hazards model with time-varying covariates is used to estimate the hazard
of being treated at a certain time point for each patient. Nocturnal Frequency of Voiding, Pain, and
Urgency measured at the entry of the study are treated as baseline characteristics, along with five
demographic variables. Symptom measures post baseline are treated as time-varying covariates.
Since the goal is to balance all covariates up to the moment of surgery, all predictors are kept in the
model regardless of their statistical significance. For the distance metric in matching, the Euclidean
distance between any two patients is calculated based on the linear hazard component β T Xm (t),
which serves as a linear propensity score [11]. The distance is computed at each follow-up time point
whenever a surgery occurred. If two patients were treated at the same time, their distance would be
set to ∞.
The goal of risk set matching is to balance the distribution of both fixed and time-varying
covariates, hence to remove observed confounding bias. For time-varying symptoms, it is desir-
able to track their measurements at different time points. Two time points are selected in balance
checking – baseline and the month of surgery based on the treated patient in each pair (denoted
as “At Trt” in Table 1 from the original paper [5]; refer to the original paper for table content). For
instance, in one pair, if the treated patient received surgery six months after entry into the study, then
for both patients in this pair, the “At Trt” month meant six months post baseline. In another pair, if
the treated patient received surgery 12 months after entry into the study, then for both patients, the
“At Trt” month meant 12 months post baseline. Because all covariates are categorical, the Mantel-
Haenszel test or Mante’s extension test was applied to the stratified 2 × 2 or 2 × k tables to check the
independence between each covariate and treatment group. The strata were the risk sets at different
time points. To measure the balance of covariates without matching, the Mantel–Haenszel test and
Mantel’s extension test with only one stratum were applied to the data before matching, as shown
in the first column in Table 1 from the original paper [5]. Without matching, there is no “At Trt”
time point, so the last three rows are marked as not available. P-values from corresponding tests are
reported for each scenario. If a p-value of less than 0.1 indicates some imbalance in the covariates,
Education, Part-Time Employment, and Frequency seem to be unbalanced, before matching. After
either sequential matching or simultaneous matching, all covariates look balanced. It implies, barring
any unmeasured confounder, within each matched pair, the decision to have surgery is more or less
178 Risk Set Matching
at random, and the outcome difference three months after the at-treatment time should reflect the
effect of surgery.
Li et al. [3] reported effect estimates based on a paired risk set matching design, where only
100 pairs of patient data were used. Lu [5] included more patients with a 1–3 design. Ninety-
one matched sets were formed, including 364 patients which is 86% of the full data. Matching
with multiple controls is more efficient than pair matching, in terms of increasing the precision
of treatment effect estimates. Specifically, 1–3 design would be 50% more efficient than the pair
design. Following [5], the patients’ outcomes for each symptom are compared at three different time
points: at baseline, at the time of treatment, and 3 months after the treatment. Each matched set
yields one value of the contrast for each measurement, which is defined using all three time points:
(T3 − Tb +T2
0
) − (C3 − Cb +C
2
0
), where Tb is the measurement at baseline for the treated patient in
the matched set, T0 is at the moment before the treatment was given, and T3 is at 3 months after the
treatment occurred. The same notations apply for the control patients. To accommodate multiple
controls, the average of the measurements of the three controls is used. The Wilcoxon signed rank test
is applied to the contrasts to test the hypothesis of no treatment effect. In Table 3 from the original
paper [5], the measures of Frequency, Pain, and Urgency are compared at different time points. As a
group, the treated patients and the control patients are quite comparable both at baseline and at the
time of treatment. A significant drop is observed for Frequency, and the magnitude is about 0.75,
which equals one quarter of the Nocturnal Frequency of Voiding before the treatment. No significant
differences are observed for Pain or Urgency scores. These findings are consistent with the results in
Li et al. [3].
9.4.2 Impact of premature infants staying longer on subsequent health care costs
and outcomes
More complications may arise for premature babies, such as weakened immune system, bleeding in
brain, etc. as babies mature best within the nurturing environment of mother’s womb. An infant born
2 or 3 months premature may need to spend a substantial amount of time in the neonatal intensive
care unit (NICU) to mature until additional physiology functioning makes discharge appropriate. It is
also associated with a hefty price tag, as the estimated annual medical cost approaches $33,200 per
premature infant. So it is important to know, when a premature baby stays a few days longer in the
hospital, does the accompanying increased physiologic maturity reduce expenditures after discharge?
Silber et al. [12] provided an answer to his question using the risk set matching design and their
study is summarized as below. This study serves as a perfect example of one type of longitudinal
observational studies, where everyone will receive the treatment eventually, as all infants will be
discharged sooner or later.
The Infant Functional Status (IFS) Study dataset was used in this evaluation, which include
eligible infants born at one of five Kaiser Permanente Medical Care Program (KPMCP) hospitals
between 1998 and 2002. All infants surviving to discharge who were born at a gestational age (GA)
of 32 weeks or less were included in the study cohort, plus a random sample of infants with a GA
of 33 or 34 weeks. The final IFS cohort included 1402 infants with 246 having a GA of 28 weeks
or below, after removing infants meeting exclusion criteria or with missing data. Information on
physiologic maturity, including respirator and incubator settings, body temperature, notations of
apnea and bradycardia, use of caffeine or methylxanthines, weight, feeding method, and requirements
for intravenous fluids, was obtain through chart abstraction. KPMCP resource estimates were used to
estimate costs related to hospital stays.
To answer the question of longer stay on subsequent health care costs, a risk set matching design
was utilized by forming 701 pairs of “Early” and “Late” babies. Early babies were those who was
discharged first within the pair, and Late babies were those who looked very similar on the day
each Early baby went home, but who actually were discharged between 2 and 7 days later (use
postmenstrual age as the time scale, PMA). The choice of 2–7 days represents a period of discretion
Illustrative Real-World Studies 179
on the part of neonatologists that has economic significance. Because the two babies were very alike
at the time of the early discharge, in terms of multiple maturity and risk factors, the decision of
letting one go home but not the other could be viewed as random. This served as the basis for a
causal comparison to address whether the extra hospital stay would benefit babies who received
them. Specifically, Silber and colleagues tried to answer two questions: (1) Are 6-month total costs
comparable between the Early and Late babies? and (2) Are 6-month clinical outcomes similar or
not?
Five types of costs were compared in their paper and only one of them, “Total Cost” (TC), is
introduced here for illustrative purpose. TC is 180 days worth of resource consumption starting
from the Early baby discharge. TC is an adequate measure for cost comparison between early and
late babies in terms of PMA, as both babies were of the same age when the Early baby went home.
Therefore, both babies had the same 180 days, in terms of PMA. There were five deaths among the
1402 babies, and deaths were counted as infinite costs. Silber and colleagues converted a variety of
clinical outcomes after discharge into coherence rank scores, with higher scores to babies with worse
outcomes. Due to the dimensionality of clinical outcomes, when two babies were compared, one
might have a uniformly worse outcome than the other, or each of them had some worse outcomes. To
break the tie, the score viewed death as the worst outcome, days in the ICU and total hospitalized
days as the second and third most serious, then number of visits to the emergency department, and
lastly, sick visits to a physician.
A large number of covariates, as listed in Table 1 from the original paper [12] (refer to the
original paper for table content), were considered to ensure matched babies are comparable in every
conceivable way. The distance metric was defined as a Mahalanobis distance on key covariates plus a
time-dependent propensity score caliper. The propensity score was fitted with a Cox proportional
hazards model with two time-varying covariates (daily maturity score and current weight) and other
fixed covariates. The optimal nonbipartite matching algorithm was implemented to minimize the
total covariate distance between babies discharged at different times. Table 1 from the original paper
presents covariate summaries for early babies and late babies at the time early babies were discharged,
and their standardized differences (DIFFAVE, mean differences in units of standard deviations). The
standardized difference is a commonly used measure for assessing balance between two groups, with
values smaller than 0.1 indicating good balance. All covariates are well balanced as shown, implying
that the matched babies were indeed comparable on the day one baby went home and the other stayed
in the hospital.
As the matching algorithm achieved its goal, recreating a randomization-like scenario within
each pair, outcomes would be brought in for analysis. Median and its 95% nonparametric confidence
interval were reported for each outcome by groups in Table 2 from the original paper [12]. Wilcoxon’s
signed rank test was applied to compare outcomes in matched pairs, with associated Hodges-Lehmann
point estimate and confidence interval [13]. It turns out that TCs were higher in those who stayed
longer in the hospital, with a typical Late-Early difference of $5016, which was highly significant.
On the other hand, no significant differences were found for clinical outcomes. Within Late-Early
matched pairs, a typical Late baby had outcomes that ranked, slightly but not significantly, worse
than the early ones (p=0.21). Therefore, there is no evidence that early discharge could be harmful.
Overall, using risk set matching design, the findings suggest that delaying discharge would increase
the hospital costs significantly, and such increase cannot be counterbalanced by subsequent savings
derived from babies being more mature at discharge.
9.4.3 Drug effect for pregnant women with recurrent pre-term birth
Survival time is an important type of outcomes in clinical studies, which measures the time until a
specific event occurs, e.g., time until death after certain treatment or time until tumor relapse. With
an observational longitudinal cohort of patients, people may receive treatments at different time
points and the timing of treatment may have some effect on the survival outcomes. Patients with
180 Risk Set Matching
more rapid disease progression, hence shorter survival time without treatment, may choose to receive
the treatment at an early time point. On the other hand, patients who are healthier may have longer
survival time regardless of treatment. Hade et al. [6] examined the effect of a preterm-birth risk
reducing drug on pregnant women with a history of preterm birth (birth before 37 weeks’ gestation),
in which the risk set matching design was applied to balance covariates at different treatment weeks
during pregnancy.
For several decades, 17-alpha hydroxyprogesterone caproate (17P), a synthetic pro-gestin, was
used to treat female hormone disturbances, to prevent recurrent abortion, and to treat uterine cancer,
but the drug as withdrawn in 2000 because of better treatments for these conditions became available
[14]. A randomized trial in 2003 found that weekly injection of 17P at 250 mg between 16 and 20
weeks’ gestation have been found to reduce the risk of preterm birth and birth complications in a
select group women with a history of preterm birth [15]. Since then, 17P has been available through
local and national compounding specialty pharmacies, although it was never approved by the FDA to
prevent preterm labor. 17P has been the standard of care for more than ten years, but the use of the
drug is not without controversies. This is because the mechanisms by which it works have not been
fully determined and it remains unclear why prophylactic treatment is not more widely effective in
high risk women. It has been suggested that inconsistent findings of 17P in trials and observational
studies could be due to the variability of initiation of 17P during pregnancy [16, 17].
Inconsistent effectiveness of 17P by timing of initiation has been reported previously; however,
no studies have fully examined treatment delay effect beyond a naive “early” versus “late” two group
comparison. Recent work by Ning et al. [17] reported improved outcomes with earlier initiation,
before 17 weeks’ gestation. However, other published results suggested no detrimental effect of
late 17P initiation [16]. Erinn and colleagues [6] implemented risk set matching design to carefully
examine the delayed timing of 17P treatment on the time to delivery, in a cohort of women who
received prenatal care at a single academic medical center between 2011 and 2016.
Four hundred and twenty-one women with singleton pregnancies and a history of spontaneous
preterm birth who initiated 17P therapy were included in the cohort. All women had cervical length
(CL) measurements prior to initiation of 17P. All patient characteristics were measured at the time
of initial visit to the prematurity clinic (baseline) which occurred one or more weeks prior to 17P
initiation, with the exception of CL. The time-fixed covariates include demographics, insurance, and
some clinical measures (the full list shown in Table 5 from the original paper [6]; refer to the original
paper for table content). The time-varying CL measurements were obtained approximately every two
weeks and were most often initiated at week 16. Initiation of 17P is recommended between 16 and
20 weeks’ gestation and generally will not be initiated prior to 14 weeks’ gestation or after 24 weeks.
To answer the causal question whether delayed 17P initiation had a detrimental effect (shorter)
on time to delivery, risk set matching design was implemented to create pairs of treated and not-yet-
treated patients with similar covariate histories up to the moment that one woman received 17P first.
Patients in this cohort initiated treatment with 17P between 14 and 22 weeks’ gestation, creating
nine possible risk sets (one per week). The time-dependent propensity score was estimated via Cox
PH model with a time-varying CL measurement and other time-fixed covariates, to estimate the
hazard of receiving 17P at each gestational age week. Pairs were required to have an estimated
propensity score within a caliper of 20% of the propensity score standard deviation. Moreover, to
mimic a clinically meaningful delay pattern, treated and not-yet-treated patients were only allowed
to be paired if they were at least two weeks apart. At 14 weeks’ gestation, the first possible week
of receiving treatment, 30 patients were treated with 17P (and 391 were not-yet-treated) and all 30
patients were paired with another not-yet-treated patient. Matching continued for each subsequent
week among unused patients until no more pairs could be formed. Overall, a total of 126 matched
pairs, 256 total patients, were created. As an alternative measure of assessing covariate balance
(in addition to standardized difference of the mean), the effect of each covariate on the hazard of
treatment and the associated Wald test p-value, were calculated through a Cox PH model for the
time to treatment. Such measure was reported for both before and after matching in Table 5 from the
Summary 181
original paper. Ideally, if covariates are well balanced, significant impacts of covariates on hazard
ratios of treatment (i.e., a HR different form 1 and a p-value< 0.10) are not expected. As was shown
before matching, insurance status and earliest gestational age of prior preterm birth were strongly
associated with time to treatment. After matching, all covariates are sufficiently balanced.
The outcome of interest was time to delivery in weeks from the moment that the first treatment
of 17P occurred in each pair. For survival outcomes under a matched design, stratified logrank (SLR)
test or paired Prentice-Wilcoxon (PPW) test may be used [18, 19]. For estimating hazard ratios, a
stratified Cox PH model may be used, as long as the PH assumption holds [18]. All three tests failed
to reject the null hypothesis of no treatment delay effect in time to delivery, as shown in Table 6 from
the original paper. The stratified Cox PH model also reported a non-significant hazard ratio of 0.95.
Therefore, delaying 17P initiation in women with preterm birth history did not seem to cause any
change to the time to delivery later.
9.5 Summary
In longitudinal studies with time varying covariates, conventional fixed two group matching designs
may not produce matched sets that adequately incorporate time-related information. Because time-
varying covariates may have substantial impact on the treatment decision over time, it is important
to balance them at the time of treatment reception. Risk set matching provides a unique design to
form matched pairs in which patients have similar covariate history up to the moment of treatment,
mimicking a randomization mechanism of delaying treatment at certain time points. It is particularly
useful for situations where the majority of the cohort would receive the treatment eventually, but
the treatment might be delayed due to individual characteristics. The conventional pair design may
compare a large group of ever-treated patients to a small group of never-treated patients, leading to
biased treatment effect estimation.
There are several practical issues regarding applying risk set matching design to real data. First,
the matching design can incorporate time-varying covariates, but it cannot handle repeated treatments.
An individual may change her status from not-treated to treated at most once in entire study period.
This scenario is reasonable for many surgical interventions, but may not be a reasonable framework
for some drug studies when individuals’ drug taking behavior may change often over time. Second, to
obtain reasonable results, the number of treated subjects in each risk set should not vary dramatically
over time. If the majority of treatment occurs at a particular time point, the matching will result
in one huge risk set with many small ones. Then the effect estimate is probably a more accurate
reflection for that particular time point, rather than the entire study period. Third, like any study with
time-varying exposure, it requires some stability assumption about the treatment effect. For example,
to pool matched risk sets together for estimating an overall causal effect, it may need an underlying
assumption about homogeneous treatment delay effect, which assumes the same treatment effect
regardless the delay occurs early or late in time.
It is well known that matching only balances observed confounders. Unmeasured confounding is
a major concern in observational studies due to the lack of randomization. Rosenbaum [20] proposed
a comprehensive framework to assess the impact of potential hidden bias on the observed significant
association, based on matching design. Li et al. [3] specifically discussed the sensitivity analysis for
potentially unmeasured time-varying confounding on continuous outcomes. Lu [19] implemented a
sensitivity analysis for survival outcomes based on the PPW test in matched data.
More applications of risk set matching can be found in different subject areas. Haviland and
colleagues [21] used a simpler form of risk set matching to balance the covariate trajectory prior to
age 14 to examine if joining a gang at 14 initiate a violent career for boys. Nieuwbeerta et al. [22]
combined group-based trajectory modeling with risk set matching to balance a variety of measurable
182 Risk Set Matching
indicator of criminal propensity, in order to assess the impact of first-time imprisonment on offenders’
subsequent criminal career development. Apel et al. [23] adopted the risk set matching design to study
the impact of imprisonment on marriage and divorce, as a means of learning whether imprisonment
may impede people’s ability to reintegrate into society. Zubizarreta et al. [24] linked risk set matching
to a general class of devices that can extract a natural experiment from observational studies. Yoo and
colleagues [25] created risk set matched pairs between kidney transplant patients and not-yet-treated
patients to compare their survival outcomes. More recent use of risk set matching can be found
in [26, 27].
9.6 Acknowledgement
This work was partially supported by grant DMS-2015552 from National Science Foundation. This
work was also supported, in part, by the National Center for Advancing Translational Sciences of
the National Institutes of Health under Grant Number UL1TR002733. The content is solely the
responsibility of the authors and does not necessarily represent the official views of the National
Institutes of Health.
9.7 Glossary
Confounding: A causal inference concept. Generally speaking, when a common variable impacts
both the treatment assignment and the outcome, causing a spurious association, confounding is
present.
Longitudinal Study: In a longitudinal study, subjects are measured for an extended period of time.
Subjects usually have more than one observation; hence, the data are correlated over time.
Mahalanobis Distance: It is a multivariate measure of distance between two subjects, in units of
standard deviation. It takes account of the correlations among variables.
Nonbipartite Matching: An unconventional matching algorithm that pairs subjects that do not
come from a fixed two group setup. It is often used for matching with multiple groups or no
clearly defined groups.
Potential Outcomes: An outcome is a variable measured after treatment. In causal inference with
dichotomous treatment options, an outcome exists in two versions, one seen under treatment,
and the other seen under control. Both of them cannot be observed simultaneously; hence, they
are known as potential outcomes. Sometimes, also referred to as counterfactual outcomes.
Propensity Score: The conditional probability of treatment reception given observed covariates.
Proper adjustment of propensity score leads to unbiased estimate of the treatment effect.
Risk Set Matching: When people receive treatments at different times, risk set matching pairs
people who were similar up to the moment that one of them received treatment. Risk set
matching controls for the past, not for the future.
Time-varying Variable: A variable whose values may change over time. It could be a covariate or
an outcome.
Glossary 183
References
[1] P.R. Rosenbaum. Observation and Experiment: An Introduction to Causal Infernece. Cam-
bridge, Massachusetts, Havard University Press, 2017.
[2] M.H. Gail. Does cardiac transplantation prolong life? A reassessment. Annals of Internal
Medicine, 76:815–817, 1972.
[3] Y. Li, K.J. Propert, and P.R. Rosenbaum. Balanced risk set matching. Journal of the American
Statistical Association, 96(455):870–882, 2001.
[4] P.R. Rosenbaum and D.B. Rubin. The central role of propensity score in observational studies
for causal effects. Biometrika, 70:41–55, 1983.
[5] B. Lu. Propensity score matching with time-dependent covariates. Biometrics, 61:721–728,
2005.
[6] E.M. Hade, G. Nattino, H.A. Frey, and B. Lu. Propensity score matching for treatment delay
effects with observational survival data. Statistical Methods in Medical Research, 29(3):695–
708, 2020.
[7] B.B. Hansen and J. Bowers. Covariate balance in simple, stratified and clustered comparitive
studies. Statistical Science, 23(2):219–236, 2008.
[8] B. Lu, R. Greevy, X. Xu, and C. Beck. Optimal nonbipartite matching and its statistical
applications. The American Statistician, 65:21–30, 2011.
[9] K.J. Propert, A. Schaeffer, C. Brensinger, J.W. Kusek, L.M. Nyberg, J.R. Landis, and
ICDB Study Group. A prospective study of interstitial cystitis: Results of longitudinal follow-up
of the interstitial cystitis diabetes cohort. Journal of Urology, 163:1434–1439, 2000.
[10] M. Joffe, D. Hoover, L. Jacobson, L. Kingsley, J. Chmiel, B. Visscher, and J. Robins. Estimating
the effect of zidovudine on kaposi’s sarcoma from observational data using a rank preserving
structural failure time model. Statistics in Medicine, 17:1073–1102, 1998.
[11] G.W. Imbens and D.B. Rubin. Causal Inference: For Statistics, Social, and Biomedical Sciences,
and introduction. Cambridge University Press, 2015.
[12] J.H. Silber, S.A. Lorch, P.R. Rosenbaum, B. Medoff-Cooper, S. Bakewell-Sachs, A. Millman,
L. Mi, O. Even-Shoshan, and G.J. Escobar. Time to send the preemie home? additional maturity
at discharge and subsequent health care costs and outcomes. Hospital Services and Outcomes,
44:444–463, 2009.
[13] M. Hollander and D.A. Wolfe. Nonparametric Statistical Methods. John Wiley & Sons, 1999.
[14] Y. Patel and M.M. Runmore. Hydroxyprogesteroe caproate injection (makena) one year later.
Pharmacology and Therapeutics, 37:405–411, 2012.
[15] P.J. Meis, M. Klebanoff, and E. Thom. Prevention of recurrent preterm delivery by 17 alpha-
hydroxyprogesterone caproate. New England Journal of Medicine, 348:2379–2385, 2003.
[16] H.Y. How, J.R. Barton, N.B. Istwan, D.J. Rhea, and G.J. Stanziano. Prophylaxis with 17 alpha-
hydroxyprogesterone caproate for prevention of recurrent preterm delivery: does gestational age
at initiation of treatment matter? American Journal of Obstetrics and Gynecology, 197(260):e1–
e4, 2007.
184 Risk Set Matching
[17] A. Ning, C.J. Vladutiu, S.K. Dotters-Katz, and W.H. Goodnight. Gestational age at initiation
of 17-alpha hydroxyprogesterone caproate and recurrent preterm birth. American Journal of
Obstetrics and Gynecology, 217(371):e1–e7, 2017.
[18] P.C. Austin. The use of propensity score methods with survival or time-to-event outcomes:
reporting measures of effect similar to those used in randomized experiments. Statistics in
Medicine, 33(7):1242–1258, 2014.
[19] B. Lu, D. Cai, and X. Tong. Testing causal effects in observational survival data using propensity
score matching design. Statistics in Medicine, 37(11):1846–1858, 2018.
[20] P.R. Rosenbaum. Observational Studies. New York: Springer, 2002.
[21] A.M. Haviland, D.S. Nagin, P.R. Rosenbaum, and R.E. Tremblay. Combining group-based
trajectory modeling and propensity score matching for causal inferences in nonexperimental
longitudinal data. Developmental Psychology, 44:422–436, 2008.
[22] P. Nieuwbeerta, D.S. Nagin, and A.A.J. Blokland. Assessing the impact of first-time imprison-
ment on offenders’ subsequent criminal career development: A matched samples comparison.
Journal of Quantitative Criminology, 25:227–257, 2009.
[23] R. Apel, A.A.J. Blokland, P. Nieuwbeerta, and M. van Schellen. The impact of imprisonment
on marriage and divorce:a risk set matching approach. Journal of Quantitative Criminology,
26:269–300, 2010.
[24] J.R. Zubizaretta, D.S. Small, and P.R. Rosenbaum. Isolation in the construction of natural
experiments. The Annals of Applied Statistics, 8(4):2096–2121, 2014.
[25] K.D. Yoo, C.T. Kim, and J.P. Lee. Superior outcomes of kidney transplantation compared with
dialysis. Medicine, 95:33, 2016.
[26] D. Watson, A.B. Spaulding, and J. Dreyfus. Risk-set matching to assess the impact of hospital-
acquired bloodstream infections. American Journal of Epidemiology, 188(2):461–466, 2019.
[27] J.H. Silber, P.R. Rosenbaum, J.G. Reiter, A.S. Hill, S. Jain, D.A. Wolk, D.S. Small, S. Hashemi,
B.A. Niknam, M.D. Neuman, L.A. Fleisher, and R. Echenhoff. Alzheimer’s dementia after ex-
posure to anesthesia and surgery in the elderly: a matched natural experiment using appendicitis.
Annals of Surgery, 2020.
10
Matching with Multilevel Data
CONTENTS
10.1 Multilevel Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
10.2 Models for treatment assignment in multilevel studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
10.3 Matching with Individualistic Treatment Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
10.3.1 Key assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
10.3.2 Propensity score formulation and matching approaches . . . . . . . . . . . . . . . . . . . 187
10.3.3 Inference and sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
10.4 Matching with Clustered Treatment Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
10.4.1 Causal assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
10.4.2 An aggregated design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
10.4.3 Multilevel matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
10.4.4 Multilevel cardinality matching and implementation via integer programming 193
10.4.5 Multilevel minimum-distance matching and implementation via network flow
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
10.4.6 Additional features and practical advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
10.4.6.1 Balance prioritization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
10.4.6.2 Trimming treated clusters and individuals . . . . . . . . . . . . . . . . . . . . . 196
10.4.6.3 Control:treatment ratios and additional constraints . . . . . . . . . . . . . 196
10.4.7 Clustered randomization inference and sensitivity analysis . . . . . . . . . . . . . . . . 197
10.4.8 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
10.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
data depends critically on the treatment assignment mechanism. Different treatment assignment
mechanisms lead to very different applications of matching with multilevel data. In what follows we
describe two primary classes of multilevel observational studies, those with treatment assignment
at the individual level, and those with treatment at the cluster level, the latter class designated as
clustered observational studies (COSs). We briefly review matching in settings with individualistic
treatment assignment, which is closely related to existing matching methods for single-level data.
The balance of the chapter is devoted to a specific form of matching tailored to clustered treatment
assignment, where two matches are typically performed within the same dataset, one at the cluster
level and one at the individual. We also discuss implications for inference, sensitivity analysis, and
open problems in this general area.
If outcomes are reported at the student level and treatment is assigned at the classroom level, then
this is a clustered study at the classroom level. The school-level groupings may be important to
consider but are not fundamental to the design in the same way as the classroom-level groupings.
The school labels play a role similar to other covariate explaining variation in student outcomes such
as socioeconomic status or past performance. If instead we observe similar data in which treatments
are assigned to individual students, then treatment assignment is individualistic, although classroom
and school may both play important roles as covariates.
One reasonable question to ask is why analysts should apply multilevel matching rather than other
forms of statistical adjustment, such as regression. One answer is that matching tends to be more
robust—especially when treated and control covariate distributions do not have good overlap [2].
COSs in education often have relatively small numbers of treated and control schools which may
exacerbate this problem. Recent research found that multilevel matching generally outperformed
regression models [3]. Another potential solution might be to use inverse probability weighting.
However, propensity score estimation in multilevel data faces special challenges in modeling the
hierarchical relationships, and often suffers from convergence problems when the number of clusters
is large. Multilevel matching provides an alternative that avoids the need to explicitly estimate and
invert propensity scores in these settings.
We also assume that each individual’s propensity score P r(Zij = 1|xi,j , wj ) lies on the interval
(0, 1). These latter assumptions are similar to other ignorability or no-unmeasured confounding
assumptions; in particular, they differ from assumptions stated by Rosenbaum [23][§3.2] only in
explicitly representing both individual-level covariates xi,j and cluster-level covariates wj separately.
The primary methodological questions are how this special covariate structure shapes the causal
estimands and the matched estimation strategy. We discuss implications of the presence of both xi,j
and wj for estimation in the following section.
on cluster-level covariates. As such, it is natural to include such cluster-level covariates among the
variables for matching. Applying matching to the cluster level covariates will seek to ensure that
cluster-level characteristics are similar across the treated and control groups. A match of this type
will allow treated and control matches to occur across clusters.
With data of this type, however, we can implement an alternative type of match. If investigators
have collected wj , then there are group indicators that indicate which units are nested within each
cluster. We denote these group indicators as Gij . If Gij = 1 then unit i is a member of cluster j,
and if Gij = 0 unit i is not a member of cluster j. In this alternative match, the investigator would
match on xi,j and match exactly on Gij . In this match, we would ensure that we only match treated
and control units within the same cluster. For example, let’s say we are interested in the effect of
minimally invasive surgery compared to standard surgical methods, and that our sample of patients is
grouped within hospitals. Here, we would match on patient characteristics like age and diagnosis, but
we would also match exactly on hospital. That is, we would ensure that we only compare treated
and control patients within the same hospital. Intuitively, such a match is appealing, since it ensures
that we keep the cluster level factors fixed for all the treated and control units. Here, we ensure that
all the matched pairs of patients are within the same hospital. In addition, if the confounders are
suspected to be correlated within clusters, this type of cluster exact matching may remove the effects
of unobserved cluster level confounders. How can this be the case? Specifically, controlling for Gij
via an exact match will remove the effect of unobserved cluster level confounders when the true
model for the potential outcomes under the control condition is additive and linear in Gij .
The relative desirability of across-cluster matching and within-cluster matching in any specific
setting depends on several factors, including both the form of the true model for treatment and the
number and size of the clusters. In general when clusters are large the consensus is in favor of within-
cluster matching, since this method controls so strongly for the presence of cluster indicators or
cluster-level variables in the true treatment model. Kim and Seltzer [7] emphasize that within-cluster
matching is especially important when cluster-covariate interactions are present so that the form
of the true propensity score differs across clusters in a manner more complex than a fixed shift in
intercept, or in settings where there is interest in measuring treatment effect heterogeneity across
clusters. When many small clusters are present, however, within-cluster matching may struggle to
make full use of the data effectively. That is, when matching within clusters, it may be difficult to
balance individual level characteristics. When conducting across-cluster matching on a propensity
score in these settings, Arpino and Mealli [8] and Thoemmes and West [9] both emphasize the
importance of accounting for clusters in the propensity score model by including either random
or fixed cluster effects. In cases where fixed effects models cannot be estimated well due to the
large number of parameters involved or where standard random effects assumptions are deemed
implausible, Kim and Steiner [10] and Lee et al. [11] suggest a middle-ground approach in which
groups of small clusters with similar treatment prevalences are formed and using the resulting groups
to structure propensity score estimation, either by estimating separate propensity score models for
each group or by using group-level random effects in an overarching model.
Finally, Zubizarreta et al. [12] present an example of conducting both within-cluster and across-
cluster matching in the same dataset, yielding two independent tests of the null hypothesis. A major
advantage of this approach is its implications for robustness to unmeasured bias. If across-cluster
comparisons and within-cluster comparisons are subject to different types of unmeasured bias,
then agreement between the answers produced by the different methods builds confidence that the
observed effects are due to a common underlying true effect of treatment; in contrast, if the estimated
effects disagree sharply there is clear evidence that one of the comparisons is biased.
(yT ij , yCij , xi,j , wj ) as a random vector sampled from some population, or by considering only
the conditional distribution P (Zij | yT ij , yCij , xi,j , wj ). The latter approach is often described as
randomization inference because of its close connections to Fisher randomization tests for designed
experiments. Sampling inference is the more prevalent approach, but in the multilevel case it is
complicated by the need to account for intracluster correlation of outcomes; typically some kind
of hierarchical outcome model must be assumed; see for example the outcome models discussed
by Kim and Seltzer [7] and Arpino and Mealli [8]. Furthermore, although sampling inference for
various matching designs has been characterized for various versions of the classical setting without
multilevel structure, sampling distributions have not been worked out for the hierarchical case.
In contrast, since the randomization inference framework conditions on the match constructed
and the potential outcomes, methods of inference are essentially identical in the multilevel and the
classical settings. Randomization inference can be implemented by repeatedly permuting treatment
and control labels within each matched pair independently of other pairs, holding outcomes fixed. As
described in detail by Rosenbaum [23][§2], the values of the test statistic under all such permutations
constitute a null distribution under the sharp null hypothesis of no effect of treatment for any individ-
ual, and comparing the observed test statistic to this null distribution produces exact finite-sample
p-values (under the model off equal probability of treatment within pairs). In practice randomization
inference can be conducted either by Monte Carlo sampling of within-pair permutations or by using
an appropriate large-sample approximation to the randomization distribution of the test statistic.
Randomization inference for matched designs also offers the benefit of a natural associated
method of sensitivity analysis. While the initial model assumes equal probabilities of treatment are
equal within matched pairs, this assumption is plausible only if pairs are matched closely both on
observed covariates (or at least on a propensity score estimated with observed covariates) and also
share similar values of unobserved covariates. The latter condition cannot be checked empirically;
if it fails, probabilities of treatment may differ by some amount within pairs. A sensitivity analysis
asks whether unobserved confounding of a certain degree would be sufficient to reverse the results
observed under the equal-probability assumption. Specifically, for unit j in cluster i define:
πij = P r(Zij = 1 | yT ij , yCij , xi,j , wj )
The sensitivity analysis of Rosenbaum [23][§4] posits that for paired units (i, j) and (i0 , j 0 ) with and
some fixed parameter Γ > 1,
1 πij /(1 − πij )
≤ ≤ Γ. (10.2)
Γ πi0 j 0 /(1 − πi0 j 0 )
While this model admits a range of possible distributions with which to permute the treatment
labels, the worst-case distribution (i.e. the one producing the largest p-value) can be determined and
used for inference; if the test still rejects then unobserved confounding of strength Γ is not sufficient
to explain the observed effect. This procedure may be repeated for larger and larger values of Γ until
a value is found at which the test ceases to reject, which can be reported as a summary of a test’s
robustness to unmeasured bias. Again, for individualistic treatment assignment these methods apply
directly to matched datasets in multilevel regimes with no special adaptations from the classical
setting.
is, treatment is allocated to either schools or hospitals such that all the units within those clusters
are treated or not. We refer to this study design as a clustered observational study (COS). One can
view the COS as the observational study counterpart to the clustered randomized trial. In a COS,
the difference in the treatment assignment mechanism is reflected in the notation in that assignment
no longer depends on the units. In general, in a COS assignment will occur such that there are N1
treated clusters and N2 control clusters. In an education application, the clusters would be entire
schools. The key assumptions needed for the identification of causal effects in a COS mirror those
when treatment assignment is individualistic. In parallel to assumption (10.1) for the individualistic
treatment model, we require that there is some set of covariates such that treatment assignment is
random conditional on these covariates. We write that assumption in the following way:
In words, this says that after conditioning on observed characteristics, xi,j and wj , a given
cluster’s probability of assignment to treatment is related neither to the potential outcomes of its units
(yT ji , yCji ) nor to unobservables (ui,j ). That is, we assume there are no unobservable differences
between the treated and control groups. Next, we assume that all clusters have some probability of
being treated or untreated such that 0 < πj < 1.
Finally, we assume a cluster-level version of SUTVA, in which an individual’s outcome is affected
only by treatments given to individuals in the same cluster and cannot be influenced by treatments
assigned to individuals in other clusters. In contrast to the assumptions of Section 10.3.1, these
assumptions do not require an absence of interference between units in the same cluster, merely
across clusters. In principle this assumption allows for many more than two potential outcomes
for each individual, one for each of the different possible vectors of treatment across individuals
in a cluster. However, in a COS only two possible vectors of individual treatments are ever given
within a cluster (all zeroes for a cluster assigned to control, and all ones for a cluster assigned to
treatment), so it is still sufficient to express each observed outcome as a treatment-weighted sum of
two potential outcomes for the individual in question: Yij = Zj yT ij + (1 − Zj )yCij . In summary,
arbitrary within-cluster interference is not problematic in a COS because the restricted nature of
treatment assignment makes limits the number of potential outcomes that can be observed. For a
thorough discussion of the implications of relaxing SUTVA in more complicated versions of the
COS where not all units in a cluster receive the same treatment, see Hong and Raudenbush [13].
Design 2 it may be critical to adjust for individual-level covariates to ensure that differences in
selection within clusters do not introduce bias into causal comparisons. Moreover, under Design 2, it
is possible that the intervention spillovers from the within-school treated subset of students to the
untreated within school subgroup. The investigator may also be interested in understanding whether
such spillover effects occur. See Hong and Raudenbush [13] for a discussion of spillovers of this type.
In general, Design 2 may require critical refinements or tailoring of the research design depending
on specifics of the application. However, even under Design 1, one may opt to match on individual
level characteristics, since those covariates may be strongly correlated with cluster level covariates,
which will tend to induce a pattern where the distribution of the individuals also differ across treated
and control clusters.
be the set of ij index tuples for individuals in cluster j. We will represent a cluster-level match
as a set M ⊂ T × C of paired indices, each giving one treated cluster and its paired control
counterpart.
S Similarly, we can represent an individual-level match compatible with this cluster-level
match as (j,j 0 )∈M Mjj 0 where Mjj 0 ⊂ Sj × Sj 0 . Let g : {W : W ⊂ T × C} −→ R and
fjj 0 : {W : W ⊂ Sj × Sj 0 } −→ R for any j, j 0 ∈ {1, . . . , J}, j 6= j 0 be objective functions
measuring the desirability of the cluster-level and individual-level matches respectively and let the
overall objective value for match M be given by:
X
g(M) + fjj 0 (Mjj 0 ). (10.3)
(j,j 0 )∈M
In words, this says that the overall quality of the match can be represented as a sum of some measure
of quality in the cluster-level pairs plus some measure of quality in the individual-level pairs. For
other more detailed examples of formulating matching as an optimization problem see Bennett et
al. [8], Hansen and Klopfer [7], Pimentel et al. [9], Rosenbaum [4], and Zubizarreta [28].
Under this setup, the sequential matching procedure first searches over all cluster-level matches
and chooses the configuration of pairs M minimizing g(M). Then, subject to each (j, j 0 ) ∈ M,
individual-level pairs Mjj 0 are chosen to minimize fjj 0 . This procedure may not find the optimum,
since the choice of cluster-level match M∗ that minimizes g(·) alone may not contain the cluster
pairs (j, j 0 ) needed to minimize the overall objective function. In brief, because the cluster-level
match rules out many possible individual-level matches without paying attention to their relative
quality as characterized by the fjj 0 functions, it may miss the optimum. Keele and Zubizarreta [20]
offer a formal proof of this sub-optimality, and Pimentel et al. [14] provide additional simulation
evidence.
In contrast, consider the following procedure. For each unique pair of clusters (j, j 0 ), compute
the optimal match Mjj 0 with respect to objective function fjj 0 , storing the match and the objective
value obtained. Then select the optimal match M with respect to objective function (10.3), which
is known since all of the individual-level matches were pre-computed. Formally, this procedure
guarantees optimality because of the identity:
X
min 0 g(M) + fjj 0 (Mjj 0 )
M,{Mjj 0 :(j,j )∈M} 0
(j,j )∈M
X
= min g(M) + min fjj 0 (Mjj 0 ) .
M Mjj 0
(j,j 0 )∈M
Each of the terms in the sum is computed in advance, which works because the problem is separable:
decision variables for individual-level matching within each cluster pair are disjoint from the decision
variables for individual-level matching within any other cluster-pair in the same match.
For example, suppose in an educational COS, there are 3 treated schools and 5 control schools
indexed by kt ∈ Kt = {1, ..., 3} and kc ∈ Kc = {1, ..., 5}, respectively. We first evaluate the
student-pair matches across all the possible pairs of treated and control schools. Since there are 3
treated schools and 5 control schools, we conduct 3 × 5 = 15 student-level matches. In general,
where there are N1 treated schools and N2 control schools, the number of such possible pairings
will be N1 × N2 . Although this involves assessing a large number of possible matches, the form
of these matches is straightforward, since we are simply conducting standard student-to-student
matches within each potential school pair. Next, we score the quality of each of these student matches
by recording the value of function fjj 0 achieved in each, and the cluster-level match can now be
conducted, searching over all possible pairings of the three treated schools to three of the five control
schools, optimizing the linear combination of cluster-level quality g(M) and the individual-level
similarity of the associated student pairings fjj 0 (Mjj 0 ).
Matching with Clustered Treatment Assignment 193
One critical point to understand about multilevel matching is that student-level covariates are
incorporated into the match in an unconventional fashion. Given the multilevel structure of the data,
student-level covariates can be included as either student-level measures or as aggregate measures.
For example, a covariate such as student sex can be used as a binary, student-level measure or as the
proportion of female students in the school. The first stage of a multilevel match avoids the need to
create aggregate measures from student-level covariates. All student-level covariates will inform the
score, and this score is used directly in the school-level match in the next step. However, one can
also include aggregate covariates during the school matching process. To make these concepts more
concrete and address practical aspects of carrying out this procedure, we now consider two specific
versions of this problem, each with a different choice of objective function and solution algorithm.
& Klopfer [8], Pimentel & Kelz [7], Pimentel et al. [14], and Rosenbaum [22] for a more in-depth
discussion of network flow methods and their value in matching.
Many other choices are possible. In particular, Pimentel et al. [14] define fjj 0 as a linear combi-
nation of a measure of matched sample size and a fixed balance penalty multiplied by the number
of pre-specified covariates that do not meet predefined balanced benchmarks in the individual-level
match. This mimics the cardinality matching method of Keele and Zubizarreta [20] by emphasizing
sample size and balance rather than matched pair quality; following this same idea, they implement
the cluster-level match without any cluster distance, instead using the fjj 0 s alone as the distances but
imposing refined balance constraints [4] on the cluster-level covariates. As discussed by Pimentel
et al. [14], this choice of fjj 0 s offers the benefit of being applicable in Design 1 settings where
individual-level matches are not conducted but individual-level comparisons may still inform overall
match quality; if you simply replace the matched sample size with the harmonic mean of the raw
treated and control sample sizes, fjj 0 is calculable even without a pairing.
In fact, Pimentel et al. [14] conduct the individual-level matches as minimum-cost Mahalanobis
distance matches, which means they may not succeed in fully optimizing the distinct objective
function fjj 0 actually used for the overall objective criterion; to some extent. However, the minimum-
cost Mahalanobis distance offers major computational advantages in exchange for this deviation
from perfect optimality, since directly optimizing fjj 0 as defined would likely require an integer
programming method.
Notably, multilevel matching based on network flows is available in the R package multiMatch.
Keele et al [3] evaluate this form of multilevel matching in several different scenarios and present an
applied example.
covariates that may cause increased imbalance in other, less important covariates. In the case study
presented later, we demonstrate key aspects of balance prioritization in the context of COS.
Essentially, this test statistic takes the matched treated-control difference within each cluster pair
and computes a weighted average of these differences across cluster pairs, with cluster-pair weights
qjj 0 . Common choices of weights are equal weights or weights proportional to the total number
of individuals in the cluster pair; for more discussion of the best choice of weights under different
settings see Hansen et al. [30][§3.5].
To conduct randomization inference, we consider the distribution of this test statistic under the
null hypothesis, conditional on the configuration of matched cluster pairs and individual pairs, the
values of the cluster-level and individual-level covariates, and the individual-level potential outcomes.
Since the only random quantity remaining is treatment, and since under the sharp null hypothesis
observed outcomes Yjk are invariant to treatment status, it suffices to consider permuting Zj , Zj 0
values independently within each cluster pair, with equal probabilities, and recomputing test statistic
(10.4). This repeated process generates a null distribution to which the observed test statistic can be
compared.
198 Matching with Multilevel Data
While randomization inference methods can be applied after matching, using regression models
with the matched data is also useful strategy for estimating treatment effects. Under this approach, the
outcome is regressed on the treatment indicator using the matched data set. While standard regression
models can be used at this stage, one might opt to use a random intercept model, since it will account
for within-school correlations in the standard error estimates [31, 32]. Treatment effect estimation via
regression models after matching is also useful, since it allows for additional bias correction. That
is, any covariates that are not fully balanced by the match can also be included in the regression
model to reduce bias while estimating treatment effects [33]. Using regression models to further
reduce imbalance after matching is a well-known idea [34, 35]. Moreover, when regression is applied
after matching, it is applied to a subsample of the data that is well-balanced and in which covariate
distributions overlap. Since regression is used locally in the covariate space, the corresponding results
should be less sensitive to minor changes in the specification of the regression function [36].
For sensitivity analysis, one simply applies model (10.2) to entire cluster-level treatment indicators
and replaces the uniform randomization inference just described with a worst-case randomization
inference under which probabilities of treatment differ by up to a factor of Γ within matched pairs.
The details of the worst-case analysis for test statistic (10.4) are slightly different from typical
worst-case calculations in cases with individualistic treatment assignment, in ways that tend to make
the inference less sensitive to unmeasured bias than would have been the case if treatments had been
assigned individually. For more details, see Hansen et al. [30].
TABLE 10.2
Balance in the Constructed Observational Data. p-values from either Fisher exact test or Wilcoxon
rank test.
thumb is that matched standardized differences should be less than 0.20 and preferably 0.10 [38].
Table 10.2 contains the balance from the constructed observational data set. The differences are
generally larger, with six standardized differences exceeding 0.50. As such, the control schools differ
from treated schools in a number of observable ways.
The next step is to use multilevel matching to find the set of control schools that are most similar
to the treated schools in terms of the observed covariates. However, as we outlined above, analysts
have a number of different options in terms of how they implement multilevel matching. First, the
investigator should decide whether students should be matched. In a WSC, we would argue, it makes
little sense to match students, since we have no reason to suspect there was variation in terms of
further student selection. That is, the treatment was applied to all students in each treated school.
The next key question is whether or not to include any covariate prioritization. That is, should
we seek to prioritize balance on some school level covariates? In the CRT, Achieve3000 was a
literacy intervention, and the primary outcome measure is student level reading scores. As such,
one might seek to prioritize balance for the school level reading scores. Alternatively, one might
simply prioritize the school level covariates with the largest imbalances prior to matching. To that
end, we implemented three different matches. The first match used the defaults in the matchMulti
matching package. This match doesn’t include any balance prioritization. In the second match, we
200 Matching with Multilevel Data
TABLE 10.3
Balance in RCT, Unmatched, and 3 Matched Groups. Cell entries are standardized differences in
means. Match 1 includes no balance refinements. Match 2 prioritizes balance on prior reading test
scores. Match 3 prioritizes overall balance
prioritized school-level reading scores. That is, we sought to balance reading scores more than any
other covariate. The final match prioritized balance on the covariates with the worst imbalances
to produce the best overall balance. In the two final matches, we gave the lowest level of priority
to the school enrollment covariate. For this covariate, the standardized difference is large, but the
difference in school sizes across treated and control is not expected to matter as much for differences
in outcomes.
Table 10.3 contains balance statistics for the RCT, the unmatched sample, and the three matched
samples. First, matching at the defaults, improves balance substantially. In the second match, where
we prioritized balance on reading scores, we find that the standardized differences for both reading
score covariates are now both less than the 0.10 threshold. In the final match, one of the reading
scores now displays higher imbalance, but we reduce imbalances on other covariates. Figure 10.1
includes a graphical comparison of the distribution of standardized differences. Here, it is clear that
the match based on defaults does not improve on the balance from the RCT. Second, it is obvious
that the two matches with balance prioritization improve on the first match and on the balance from
the RCT.
Finally, in Table 10.4 we include outcome estimates. The RCT estimate shows a positive, but
modest effect of the intervention that is not statistically significant. The estimate from the unmatched
sample, without any statistical adjustment, is negative and significant at the 0.10 level. This is not
surprising, since the treated group was targeted for enrollment in the RCT. This result is consistent
with a pattern where the treated group generally has poorer reading performance than many schools
in the district. We find that for Match 1, the bias between the COS and RCT estimates is reduced but
not substantially. However, the estimates from the two matches based on balance prioritization are
much closer to the RCT estimates. See Keele et al. [3] for a more complete outcome analysis.
10.5 Discussion
Multilevel matched designs may be viewed as an alternative to specialized regression models fre-
quently used to analyze multilevel data. For example, fixed-effects regression models are commonly
employed with multilevel data. A fixed-effects model includes an intercept term for each of the
clusters in the data. Certain similarities between the methods are present. For instance, within-cluster
Discussion 201
0.8
0.6
0.4
0.2
0.0
FIGURE 10.1
Distribution of Standardized Differences for RCT, Unmatched, and 3 Matches.
TABLE 10.4
Estimates of the effect of Achieve 3000 Intervention on Reading Test Scores. Outcome is standardized
reading score. Results are from a mixed model with random intercept. Match 1 includes no balance
refinements. Match 2 prioritizes balance on prior reading test scores. Match 3 prioritizes overall
balance.
matching in cases with individualistic assignment only compares treated and control units within the
same cluster, much as the regression coefficient for treatment in a fixed-effects regression is a partial
coefficient for the adjusted regression in which each cluster receives its own intercept. However,
multilevel matching offers several advantages. Most importantly, constructing a matched design in a
multilevel setting involves thinking primarily about the distribution of treatment assignment rather
than the distribution of the outcome variable, and whenever treatment assignment is easier to model or
more plausibly random than outcomes, matching may provide substantial benefits relative to outcome
regression. Constructing a matched design in multilevel data also requires careful consideration of
whether treatment is at the individual or the cluster level and provides clear guidance for inference
in either case, unlike outcome regression where distinctions between treatment mechanisms and
handling of clustering in inference can become murky [39]. Matched designs also avoid some of
the parametric assumptions associated with linear modeling, offer transparent balance summaries
202 Matching with Multilevel Data
describing their success in adjusting for measured variables, and offer support for randomization
inference and associated methods of sensitivity analysis. We find them to be a compelling design
option in a wide variety of practical settings in the biomedical and social sciences.
One frontier worthy of exploration is the use of multilevel matching in two-stage observational
studies. Two-stage studies involve treatment assignments at both levels; for instance, a school may
decide to participate in an experimental treatment, and then individual teachers within that school
may decide on an individual basis whether to participate. Such two-stage studies essentially involve
two separate treatment processes. Another version of this setting arises frequently in clustered ob-
servational studies where individuals may fail to comply with treatment in treated clusters. While
weighting methods have been introduced for two-stage observational studies [40, 41], and randomiza-
tion inference methods similar to those introduced for matched estimators here have been introduced
for two-stage randomized trials [42–44], multilevel matched designs for this setting remain to be
introduced.
References
[1] Lindsay C Page, Matthew Lenard, and Luke Keele. The design of clustered observational
studies in education. AERA Open, 6(3):1–14, July-Sept 2020.
[2] Guido W Imbens. Matching methods in practice: Three examples. Journal of Human Resources,
50(2):373–419, 2015.
[3] Luke J Keele, Matthew Lenard, and Lindsay Page. Matching methods for clustered observational
studies in education. Journal of Educational Effectiveness, 14 (3), 696–725.
[4] Donald B Rubin. Comment: Which ifs have causal answers. Journal of the American statistical
association, 81(396):961–962, 1986.
[5] David Roxbee Cox. Planning of experiments. Wiley, New York, 1958.
[6] Paul R Rosenbaum. Observational Studies. Springer, New York, NY, 2002.
[7] Junyeop Kim and Michael Seltzer. Causal inference in multilevel settings in which selection
processes vary across schools. cse technical report 708. Technical report, National Center
for Research on Evaluation, Standards, and Student Testing (CRESST), Center for Student
Evaluation (CSE)., 2007.
[8] Bruno Arpino and Fabrizia Mealli. The specification of the propensity score in multilevel
observational studies. Computational Statistics & Data Analysis, 55(4):1770–1780, 2011.
[9] Felix J Thoemmes and Stephen G West. The use of propensity scores for nonrandomized
designs with clustered data. Multivariate Behavioral Research, 46(3):514–543, 2011.
[10] Jee-Seon Kim and Peter M Steiner. Multilevel propensity score methods for estimating causal
effects: A latent class modeling strategy. In Quantitative psychology research, pages 293–306.
Springer, Switzerland, 2015.
[11] Youjin Lee, Trang Q Nguyen, and Elizabeth A Stuart. Partially pooled propensity score models
for average treatment effect estimation with multilevel data. arXiv preprint arXiv:1910.05600,
2019.
[12] José R Zubizarreta, Mark Neuman, Jeffrey H Silber, and Paul R Rosenbaum. Contrasting
evidence within and between institutions that provide treatment in an observational study of
Discussion 203
[16] Ben B. Hansen and Stephanie Olsen Klopfer. Optimal full matching and related designs via
network flows. Journal of Computational and Graphical Statistics, 15(3):609–627, 2006.
[17] José R Zubizarreta. Using mixed integer programming for matching in an observational study
of kidney failure after surgery. Journal of the American Statistical Association, 107(500):1360–
1371, 2012.
[18] Samuel D Pimentel, Rachel R Kelz, Jeffrey H Silber, and Paul R Rosenbaum. Large, sparse
optimal matching with refined covariate balance in an observational study of the health outcomes
produced by new surgeons. Journal of the American Statistical Association, 110(510):515–527,
2015.
[19] Magdalena Bennett, Juan Pablo Vielma, and José R Zubizarreta. Building representative
matched samples with multi-valued treatments in large observational studies. Journal of
Computational and Graphical Statistics, 29(4):744–757, 2020.
[20] Luke J. Keele and José Zubizarreta. Optimal multilevel matching in clustered observational
studies: A case study of the effectiveness of private schools under a large-scale voucher system.
Journal of the American Statistical Association, 112(518):547–560, 2017.
[21] José R Zubizarreta, Ricardo D Paredes, Paul R Rosenbaum, et al. Matching for balance, pairing
for heterogeneity in an observational study of the effectiveness of for-profit and not-for-profit
high schools in chile. The Annals of Applied Statistics, 8(1):204–231, 2014.
[22] Samuel D Pimentel and Rachel R Kelz. Optimal tradeoffs in matched designs comparing
us-trained and internationally trained surgeons. Journal of the American Statistical Association,
115(532):1675–1688, 2020.
[23] Paul R Rosenbaum and Donald B Rubin. Constructing a control group using multivariate
matched sampling methods that incorporate the propensity score. The American Statistician,
39(1):33–38, 1985.
[24] Paul R. Rosenbaum. Optimal Matching of an Optimally Chosen Subset in Observational
Studies. Journal of Computational and Graphical Statistics, 21(1):57–71, 2012.
[25] José R Zubizarreta, Caroline E Reinke, Rachel R Kelz, Jeffrey H Silber, and Paul R Rosenbaum.
Matching for several sparse nominal variables in a case-control study of readmission following
surgery. The American Statistician, 65(4):229–238, 2011.
[26] Kewei Ming and Paul R Rosenbaum. Substantial gains in bias reduction from matching with a
variable number of controls. Biometrics, 56(1):118–124, 2000.
204 Matching with Multilevel Data
[27] Ben B Hansen. Full matching in an observational study of coaching for the sat. Journal of the
American Statistical Association, 99(467):609–618, 2004.
[28] Samuel D Pimentel, Frank Yoon, and Luke Keele. Variable-ratio matching with fine balance in
a study of the peer health exchange. Statistics in medicine, 34(30):4070–4082, 2015.
[29] Ronald A Fisher. The Design of Experiments. Oliver and Boyd, Edinburgh, 1935.
[30] Ben B Hansen, Paul R Rosenbaum, and Dylan S Small. Clustered treatment assignments and
sensitivity to unmeasured biases in observational studies. Journal of the American Statistical
Association, 109(505):133–144, 2014.
[31] Richard J Murnane and John B Willett. Methods matter: Improving causal inference in
educational and social science research. Oxford University Press, Oxford, UK, 2010.
[32] Stephen W Raudenbush. Statistical analysis and optimal design for cluster randomized trials.
Psychological Methods, 2(2):173, 1997.
[33] Alberto Abadie and Guido W Imbens. Bias-corrected matching estimators for average treatment
effects. Journal of Business & Economic Statistics, 29(1):1–11, 2011.
[34] Elizebeth A Stuart. Matching methods for causal inference: A review and a look forward.
Statistical Science, 25(1):1–21, 2010.
[35] Daniel E Ho, Kosuke Imai, Gary King, and Elizabeth A Stuart. Matching as nonparametric
preprocessing for reducing model dependence in parametric causal inference. Political analysis,
15(3):199–236, 2007.
[36] Guido W. Imbens and Donald B. Rubin. Causal Inference For Statistics, Social, and Biomedical
Sciences: An Introduction. Cambridge University Press, Cambridge, UK, 2015.
[37] Vivian C Wong, Peter M Steiner, and Kylie L Anglin. What can be learned from empirical
evaluations of nonexperimental methods? Evaluation review, 42(2):147–175, 2018.
[38] Paul R. Rosenbaum. Design of Observational Studies. Springer-Verlag, New York, 2010.
[39] Alberto Abadie, Susan Athey, Guido W Imbens, and Jeffrey Wooldridge. When should you
adjust standard errors for clustering? Technical report, National Bureau of Economic Research,
2017.
[40] Brian G Barkley, Michael G Hudgens, John D Clemens, Mohammad Ali, Michael E Emch,
et al. Causal inference from observational studies with clustered interference, with application
to a cholera vaccine study. Annals of Applied Statistics, 14(3):1432–1448, 2020.
[41] Lan Liu, Michael G Hudgens, Bradley Saul, John D Clemens, Mohammad Ali, and Michael E
Emch. Doubly robust estimation in observational studies with partial interference. Stat,
8(1):e214, 2019.
[42] Hyunseung Kang and Luke Keele. Spillover effects in cluster randomized trials with noncom-
pliance. arXiv preprint arXiv:1808.06418, 2018.
[43] Hyunseung Kang and Luke Keele. Estimation methods for cluster randomized trials with
noncompliance: A study of a biometric smartcard payment system in india. arXiv preprint
arXiv:1805.03744, 2018.
[44] GW Basse, A Feller, and P Toulis. Randomization tests of causal effects under interference.
Biometrika, 106(2):487–494, 2019.
11
Effect Modification in Observational Studies
CONTENTS
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
11.1.1 Motivating example: Malaria in West Africa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
11.2 Review of Effect Modification in Observational Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
11.2.1 Notation and review: experiments and observational studies . . . . . . . . . . . . . . . 208
11.2.2 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
11.2.3 Design sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
11.3 Effect Modification in a Few Nonoverlapping Prespecified Groups . . . . . . . . . . . . . . . . . 210
11.3.1 Combining independent P -values using their truncated product . . . . . . . . . . . 210
11.4 Discovering Effect Modification in Matched Observational Studies . . . . . . . . . . . . . . . . . 211
11.4.1 CART method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
11.4.2 Submax method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.4.3 Comparison between CART and submax methods . . . . . . . . . . . . . . . . . . . . . . . . 217
11.5 Discovering Effect Modification Using Sample-Splitting: Denovo . . . . . . . . . . . . . . . . . . 218
11.5.1 Discovery step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
11.5.2 Inference step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
11.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
11.1 Introduction
In an observational study of treatment effects, subjects are not assigned at random to treatment or
control, so they may differ visibly with respect to measured pretreatment covariates, x, and may also
differ with respect to a covariate not measured, u. Visible differences in x are removed by adjustments,
such as matching, but there is invariably concern that adjustments failed to compare comparable
individuals, that differing outcomes in treated and control groups reflect neither a treatment effect
nor chance but rather a systematic bias from failure to control some unmeasured covariate, u. A
sensitivity analysis asks: What would u have to be like to materially and substantively alter the
conclusions of an analysis that presumes adjustments for the observed x suffice to eliminate bias?
Once one can measure sensitivity to bias, it is natural to ask: What aspects of design and analysis
affect sensitivity to bias? An aid to answering this question is the power of a sensitivity analysis
and a number, the design sensitivity, that characterizes the power in large samples [1]. Some test
statistics tend to exaggerate the reported sensitivity to unmeasured biases [2], whereas some design
elements tend to make studies less sensitive to bias [1, 3]. Generally, larger effects are less sensitive
than smaller ones. This last point suggests that effect modification – that is, an interaction between
a pretreatment covariate and the magnitude of a treatment effect – might matter for sensitivity to
FIGURE 11.1
Age, sex, and parasite density in 1560 treated/control pairs matched for age and sex. After matching,
the distribution of age and sex are similar, whereas the after-minus-before changes in parasite density
exhibit a greater decline in the treated group. The treated-minus-control pair differences in changes
in parasite density are typically negative, though many are near zero, with a long thick negative tail
to their density.
unmeasured biases. Unfortunately, such an interaction may be uncertain or unexpected. How should
one conduct a sensitivity analysis in the absence of a priori knowledge of where the effect will turn
out to be large or small? Before developing the technical aspects, it is helpful to consider a motivating
example.
FIGURE 11.2
Densities of the treated-minus-control differences in changes in parasite density separately for pairs
of (i) children older than 10 years and children 10 years old or younger; and (ii) female and male
children.
Figure 11.1 displays (i) the close match for age and sex, (ii) after-minus-before changes in parasite
frequency in treated and control groups, ignoring the matching, (iii) the matched pair treated-minus-
control difference in after-minus-before changes in parasite frequencies, and (iv) a density estimate
for this difference in changes. Density estimates use the default settings in R but with double the
default bandwidth. Although declines in parasite frequency are more common in the treated group,
many differences in changes are close to zero.
Figure 11.2 splits the 1560 pairs into (i) two age groups, 447 pairs of young children aged 10
or less, and 1113 pairs of children older than 10 years; and (ii) two sex groups, 766 pairs of female
children, and 794 pairs of male children. The impression from Figure 11.2 is that the treatment was
of much greater benefit to young children than to older ones but no different among female and male
children.
Table 11.1 displays four sensitivity analyses and reports the upper bound on the one sided P -
values testing no treatment effect using Wilcoxon’s signed rank test. The sensitivity parameter, Γ,
represents odds that in a matched pair, one unit has Γ times higher odds of being treated because of
unmeasured confounding variable. In column I, all 1560 matched pairs differences were used. Using
Wilcoxon’s statistic, we would judge the results to be sensitive to a bias of magnitude Γ = 2, because
the upper bound on the one sided P -value testing no treatment effect exceeds the conventional 0.05
level. Columns II-III of Table 11.1 repeat the sensitivity analyses separately for younger and older
children, respectively. Despite the reduced sample size, the 447 pairs of young children exhibit an
association with treatment that is far less sensitive to unmeasured bias than the full sample of 1560
pairs, with the test statistic being insensitive at Γ = 3.8.
208 Effect Modification in Observational Studies
TABLE 11.1
Various sensitivity analyses for the treated-minus-control difference in after-minus-before changes in
parasite frequency in blood samples.
where γ = log (Γ). Let T be the sum of I independent random variables taking the value 0 with
probability 1 if qi = 0 and otherwise the value qi with probability Γ/ (1 + Γ) and the value 0 with
probability 1/ (1 + Γ), and define T similarly but with Γ/ (1 + Γ) and 1/ (1 + Γ) interchanged. It
is straightforward to show that under (11.2), if H0 is true, then
Pr T ≥ v | F, W ≤ Pr (T ≥ v | F, W )
≤ Pr T ≥ v | F, W for all u ∈ U, (11.3)
and, as I → ∞, the upper bound Pr T ≥ v | F, W in (11.3) may be approximated by
P
. v − {Γ/ (1 + Γ)} qi
Pr T ≥ v | F, W = 1 − Φ
rn o
;
(11.4)
2 P 2
Γ/ (1 + Γ) qi
see, for instance, [29,30] for the case of matched binary responses. For instance, a sensitivity analysis
may report the range of possible P-values or point estimates that are consistent with the data and a
bias of at most Γ for several values of Γ.
210 Effect Modification in Observational Studies
for w < 0, and the sum of k independent exponential random variables has distribution Fk (·). If the
P` are independent uniform random variables, then for 0 < w ≤ 1
L
X L k L−k
h n w oi
Pr (P∧ ≤ w) = e (1 − α
α e) 1 − Fk − log (11.5)
k ek
α
k=1
or in R,
Pr (P∧ ≤ w) =
e) ∗ (1 − pgamma(−log(w/(e
sum(dbinom(1 : L, L, α αˆ(1 : L))), 1 : L))). (11.6)
In a sensitivity analysis, suppose that the I pairs are divided to L groups and let P` be a P -value
testing H` using the pairs in group ` and computed from (11.2) for a specific unknown u and
γ = log (Γ), and let P Γ` be the corresponding
upperbound in (11.3). If the bias is at most Γ, then
(11.3) implies P` ≤ P Γ` and Pr P Γ` ≤ α | F, W ≤ Pr (P` ≤ α | F, W ) ≤ α for α ∈ [0, 1],
` = 1, . . . , L. Becausethe truncated
product is a monotone increasing function, it follows that
QL χ P Γ` ≤eα QL χ(P ≤eα)
P Γ∧ = `=1 P Γ` is an upper bound for P∧ = `=1 P` ` . Combining these
two facts, if the bias is at most Γ then Pr P Γ∧ ≤ w | F, W ≤ Pr (P∧ ≤ w | F, W ) where
Pr (P∧ ≤ w | F, W ) is at most (11.5). If we calculate w such (11.5) equals α, conventionally
α = 0.05, and if we reject H0 when P Γ∧ ≤ w, then we will falsely reject H0 with probability at
most α if the bias is at most Γ. Column IV of Table 11.1 performs these calculations for the malaria
data using α e = 0.05. In Table 11.1, P Γ` , ` = 1, 2 are computed for young and old pairs, and these
are combined into the truncated product P Γ∧ , whose P -value is determined from (11.5). The results
in column IV of Table 11.1 testing H0 using P Γ∧ are much less sensitive to bias than the results
in column I using all of the pairs in a single analysis. To emphasize, combining two independent
sensitivity analyses yields less sensitivity to unmeasured bias than a single sensitivity analysis that
uses all of the data, and this occurred because the treatment effect appears to be much larger for
children aged 10 or less. Indeed, the sensitivity Γ for P Γ∧ is only slightly worse than knowing a
priori that attention should focus on the young pairs in Table 11.1.
Rejecting H0 = H∧ suggests there is an effect in at least one subgroup `, but it does not provide
an inference about specific subgroups. Of course, it would be interesting to know which subgroups
are affected. Closed testing was proposed by Marcus et al. as a general method for converting a
test of a global null hypothesis into a multiple inference procedure for subhypotheses [32]. Using
the Wilcoxon’s signed rank statistic in Table 11.1 with Γ = 1, closed testing rejects no effect
H0 = H1 ∧ H2 and then rejects both H1 and H2 . Using the Wilcoxon’s signed rank statistic in Table
11.1 with Γ = 3, closed testing rejects no effect H0 = H1 ∧ H2 and then rejects H1 but not H2 . In
words, there is some evidence of a treatment effect for both those under and over ten years of age,
the evidence about the young children being insensitive to a large bias of Γ = 3, while the evidence
for older individuals is sensitive to some biases smaller than Γ = 1.4.
In this section, we introduce two methods, CART and Submax, for discovering effect modification.
The CART method is useful when there is large effect modification. By using part of outcome
information, promising groups can be discovered and they can be treated as pre-specified groups.
For moderate effect modification, the Submax method can be alternatively considered. Instead of
discovering promising groups, each variable is examined whether it is an effect modifier or not.
FIGURE 11.3
Top: The regression tree formed four groups of matched pairs fitting the ranks of |P Di | using age
and gender. Bottom: Four hypotheses were tested simultaneously: (i) H1 no effect for age <7.5; (ii)
H12 no effect for age <17.5; H123 no effect for age <32.5; H1234 = H0 no effect for all ages.
TABLE 11.2
Sensitivity analysis for the treated-minus-control difference in after-minus-before changes in parasite
frequency in blood samples among subgroups identified by the regression tree.
test of H0 , while being uncertain as to which one test to perform, we perform four tests of one H0 ,
adjusting for multiple testing using the technique in [35]. The four tests concern hypotheses H1 ,
H12 , H123 , and H1234 in the bottom portion of Figure 11.4, where H1234 is the same as Fisher’s
hypothesis H0 of no effect. Because these are four tests of one null hypothesis, all computed from
the same data, the tests are highly correlated, and a correction for multiple testing that takes the high
correlation into account is a small correction; see §4.3 in [9] for more details.
Though multiple testing corrections can be applied to any tree to control the familywise error
rate, it is more important to choose an appropriate tree at the beginning. A CART tree can be selected
by some predetermined criterion or cross-validation. However, there is no theoretical support that
such trees perform well. Instead, domain knowledge can be helpful for choosing a tree. It can give
some guidance whether we need to grow a tree further or trim it. For instance, in the study of surgical
mortality at hospitals with superior nursing [36], the 130 surgical procedures were considered to
examine effect modification. As a tree grows, CART will create several clusters of the procedures.
When irrelevant surgical procedures are in the same cluster, we need to divide this cluster further,
by growing a tree, to ensure each cluster contains relevant surgical procedures. Conversely, some
clusters can be fragmented, so we need to integrate them into one larger cluster by trimming it. See
§3 in [36] for more details. To control for false discovery rate when searching for effect modification
in observational studies, Karmakar et al. [37] developed an approach for a collection of adaptive
hypotheses identified from the data on matched-pairs design.
testing. According to levels of each covariate x, the entire population can be divided into two smaller
subgroups. If one of the subgroups has a larger effect size than the other, it is likely to detect effect
modification, which usually leads to a smaller P -value. Two tests are defined for each covariates, and
thus 2p tests can be defined for all covariates. It is still possible that there is no effect modification
at all. To account for this situation, we include one test for the entire population. In total, there are
2p + 1 tests for testing H0 . As p increases, the number of the tests increase, which makes more
difficult to find effect modification when there are too many uninformative covariates. To mitigate
such issue, we create tests that are associated with each other. Association can be naturally generated
when designing test statistics.
Let’s revisit the malaria example. We have p = 2 binary variables, age and sex. Therefore, we
consider 2p + 1 = 5 comparisons: (1) all, (2) young, (3) old, (4) female, and (5) male. The test for
young is associated with the test for female since the two groups share young female. The group
of young female can be understood as two-way interaction. However, tests for some pairs are not
associated with each other. To construct the association, we consider two-way interaction groups
are disjoint. The tests for these groups are used as building blocks for constructing comparisons.
Let (T1 , T2 , T3 , T4 ) be the test statistics for young female, young male, old female and old male.
Then, each comparison can be constructed by these four statistics. For instance, the test statistic for
young is defined as S2 = T1 + T2 . Similarly, the statistic S4 for female is defined as S4 = T1 + T3 .
The statistic S1 for all is defined as S1 = T1 + T2 + T3 + T4 . In brief, we test the hypothesis of
no treatment effect at all (using S1 ), plus the four hypotheses of no effect in 2p = 4 overlapping
subgroups.
Before evaluating the comparisons Sk , k = 1, . . . , K in general, we describe sensitivity analysis
with parameter Γ ≥ 1. To compute the upper bound on the P -value for overlapping Sk , we compute
the upper P -value bound for non-overlapping Tg first. An approximation to the bound is obtained by
considering a G-dimensional multivariate normal distribution. Subject to (11.1) for a given Γ ≥ 1,
find the maximum expectation, µΓg , of Tg . Also, among all treatment assignment probabilities that
satisfy (11.1) and that achieve the maximum expectation µΓg , find the maximum variance, νΓg , of
Tg ; see [38] for detailed discussion. If treatment assignments were governed by the probabilities
satisfying (11.1) that yield µΓg and νΓg , then, under H0 and mild conditions on the qgij , the joint
1/2
distribution of the G statistics (Tg − µΓg ) /νΓg , converges to a G-dimensional Normal distribution
T
with expectation 0 and covariance matrix I as min (Ig ) → ∞. Write µΓ = (µΓ1 , . . . µΓG ) and VΓ
for the G × G diagonal matrix with gth diagonal element νΓg .
For p binary covariates, we consider G = 2p disjoint p-way interaction groups. Let C be the
K × G matrix whose K rows are the cTk = (c1k , . . . , cGk ), k = 1, . . . , K. Thus, the statistic Sk is
PG
represented as a linear combination of Tg using a 0-1 vector ck , Sk = g=1 cgk Tg . In the malaria
example, cT1 = (1, 1, 1, 1) for all and cT2 = (1, 1, 0, 0) for young. The C matrix in this example is
1 1 1 1
1 1 0 0
C= 0 0 1 1
1 0 1 0
0 1 0 1
Define θΓ = CµΓ and ΣΓ = CVΓ CT , noting that ΣΓ is not typically diagonal. Write θΓk for the
2
kth coordinate of θΓ and σΓk for the kth diagonal element of ΣΓ . Define DΓk = (Sk − θΓk ) /σΓk
T
and DΓ = (DΓ1 , . . . , DΓK ) . Finally, write ρΓ for the K ×K correlation matrix formed by dividing
the element of ΣΓ in row k and column k 0 by σΓk σΓk0 . Subject to (11.1) under H0 , at the treatment
assignment probabilities that yield the µΓg and νΓg , the distribution of DΓ is converging to a Normal
distribution, NK (0, ρΓ ), with expectation 0 and covariance matrix ρΓ as min (Ig ) → ∞. Using this
null distribution, the null hypothesis H0 is tested using DΓ max = max1≤k≤K DΓk . The α critical
216 Effect Modification in Observational Studies
TABLE 11.4
Submax: Five standardized deviates DΓk , k = 1, . . . , 5 for Wilcoxon’s test and their maximum
DΓ max are shown. The critical value κΓ,α = 2.19 for Γ ≥ 1, α = 0.05. Deviates larger than 2.19
are in bold.
under H0 . In general, κΓ,α depends upon both Γ and α. The multivariate Normal approximation
to κΓ,α is obtained using the qmvnorm function in the mvtnorm package in R, as applied to the
NK (0, ρΓ ) distribution; see [39]. Notice that this approximation to κΓ,α depends upon Γ only
through ρΓ , which in turn depends upon Γ only through νΓg . The critical value κΓ,α for DΓ max is
larger than Φ−1 (1 − α) because the largest of K statistics DΓk has been selected, and it reflects the
correlations ρΓ among the coordinates of DΓ .
Table 11.4 performs the test for the malaria data using the Wilcoxon signed rank test statistic.
The row of Table 11.4 for Γ = 1 consists of Normal approximations to randomization tests, while the
rows with Γ > 1 examine sensitivity to bias from nonrandom treatment assignment. For Γ = 1, the
test statistic DΓ max = 12.30 ≥ κΓ,α = 2.19. Thus, Fisher’s hypothesis of no treatment effect would
be rejected at level α = 0.05 if the data had come from a randomized experiment with Γ = 1. The
maximum statistic is based on 447 young pairs, DΓ max = DΓ2 , but DΓk ≥ κΓ,α = 2.19 for every
subgroup, k = 1, . . . , 5. At Γ = 1.6, the deviates except for DΓ2 no longer exceed 2.19. The deviate
DΓ2 far exceeds 2.19, which leads to strong evidence of effect modification even when the random
treatment assignment is violated. The maximum deviate DΓ max is attained as DΓ2 for any value
of Γ and exceeds κΓ,α up to Γ = 3.7. This indicates that the strongest evidence, the least sensitive
evidence, of a treatment effect on malaria parasites is for young. Though we pay a price for multiple
testing, the conclusion is not different from the CART method.
Unlike the CART method, when applying the Submax method, the number of comparisons
to test linearly increases as the number of covariates increases. There are 2p interaction groups,
but only 2p + 1 comparisons are used to test. Consider a simple, balanced case under H0 and
outcomes are continuously distributed and hence untied with probability one. Also, assume the same
number of matched pairs in each group, I1 = · · · = IG = I. ¯ The Wilcoxon signed rank statistic
Tg is computed from the ¯ pairs in each group g. In this case, µΓg = {Γ/(1 + Γ)}I(
I ¯ I¯ + 1)/2 and
¯ I¯ + 1)(2I¯ + 1)/6; see [40]. Because of the symmetry, the correlation matrix
νΓg = Γ/(1 + Γ)2 I(
Discovering Effect Modification in Matched Observational Studies 217
TABLE 11.5
The critical constant κΓ,α in a simple, balanced case with Wilcoxon’s signed rank statistics Tg and
¯
I1 = · · · = IG = I.
p K = 2p + 1 κΓ,α Bonferroni
0 1 1.64 1.64
1 3 2.03 2.13
2 5 2.20 2.33
3 7 2.32 2.45
4 9 2.40 2.54
7 15 2.55 2.71
12 25 2.70 2.88
√
ρΓ of DΓk has the simple form. DΓ1 has correlation 1/ 2 = 0.707 with DΓk for k ≥ 2, the two
consecutive comparisons for the two categories of the same binary variable are uncorrelated, and
all other comparisons have correlation 0.5. In this simple and balanced case, Table 11.5 shows the
critical constant κα = κΓ=1,α for various p values. The key point in Table 11.5 is that κα increases
very slowly for larger values of p. The Submax method for 7 potential effect modifiers has the critical
value κα = 2.54 that is almost the same as the critical value of the Bonferroni method for screening
4 potential effect modifiers. Similarly, the Submax method for 12 effect modifiers and the Bonferroni
method for 7 effect modifiers require the similar level of the critical value.
The main goal is to discover effect modification by testing the global null hypothesis H0 ,
specifically Fisher’s hypothesis of no treatment effect in the study at all as a whole. However, we
may be interested in testing the hypothesis, say Hk , that asserts there is no effect in subgroup Sk ,
say no effect in the subgroup of young. We would like to simultaneously test all K hypotheses
Hk , k = 1, . . . , K while controlling for a familywise error rate α. Simultaneous inference can be
done by using the closed testing method [32]. The precise meaning of Table 11.4 can be further
examined with the closed testing method; see [41] for more details.
Submax may have more power to detect effect modification. More discussion about moderate effect
modification can be found in [41].
The Submax and CART methods can be combined in several ways. For instance, an investigator
may combine a few potential effect modifiers with a few subgroups suggested by CART and apply
the Submax method to the considered subgroups. Alternatively, some potential effect modifiers can
be chosen as the covariates consisting of the CART tree output, and then we run the Submax method
for the chosen covariates only. In the malaria example, there is evidence through CART that age is an
effect modifier, but no evidence about sex. We may be interested in checking whether there is effect
modification by sex within the young group; that is, comparison between young female and young
male. Instead of considering four two-way interaction groups, we can consider three groups (i.e.,
young female, young male, old) only. The Submax method can be applied with an appropriate C
matrix.
(1 ∪ 2 ∪ 3 ∪ 4) Total
(2) Male, 2.5 <Age< 7.5 (3) Female, 2.5 <Age< 7.5
FIGURE 11.4
The discovered tree using the discovery sample of 0.25n = 390.
efficient and thus leads to a causal conclusion more robust to unmeasured bias. We will show it with
the malaria example in the following section.
The discovery step is straightforward. We randomly split the total (matched) sample into two
subsamples: the discovery and inference subsamples. We suggest to use 0.25n for the discovery
subsample size. While the inference subsample is set aside, the discovery subsample is used to fit
CART. In this step, we use the sign, actually all the information about outcomes. Some subgroups
as the terminal nodes of the CART output can be discovered, and they are examined by using the
remaining inference subsample. Any methods that provide disjoint subgroups can be applied in this
discovery step. After trying several methods, an investigator can choose one (or a combined structure)
since the discovery subsample is not used for inference. Thus, this subsample can be considered as a
test sample in machine learning literature, and a method to minimize the test error can be chosen.
There is a main issue in applying the sample-splitting strategy. How do we divide the total
sample? The optimal ratio between the discovery and inference samples is critical in the overall
performance of discovery and inference. However, there is no theory discovered yet. Through several
simulation studies, we find that a (25%, 75%) splitting ratio has the best performance. The best
performance can be defined in several ways. During the discovery step, the primary interest is to find
the subgroups with heterogeneous treatment effects. We want to avoid missing true heterogeneous
subgroups. However, high sensitivity tests have low specificity. Falsely discovered subgroups (that
are thought to have heterogeneous effect, but actually do not have) can reduce the power of a test
during the inference step. The (25%, 75%) splitting ratio has a great balance between discovering
true heterogeneous subgroups and statistical power. In addition to the splitting-ratio, randomness in
splitting can be problematic to provide a consistent result. When a different subsample is considered,
a different CART output is discovered. Subsampling techniques are helpful to find a stable effect
modification structure. More discussions about the optimal ratios and randomness in sample-splitting
are in [42].
Let us look at the malaria example again. The Denovo method is applied to the malaria example
with the splitting ratio (25%, 75%). A random sample of 0.25n = 390 is chosen from the original
matched pair sample. Figure 11.4 shows the tree output by regressing 390 treated-minus-control
paired differences P Di on age and sex. There are three splits in the tree and four terminal nodes
(groups). The four disjoint groups are (1) Age< 2.5, (2) Male & 2.5 <Age< 7.5, (3) Female &
2.5 <Age< 7.5, and (4) Age> 7.5. The tree is quite different from what we evaluated in the previous
section. The cutoff values for age are 2.5 and 7.5 instead of 10 used in the Submax methods. Also,
We can confirm that the two data-driven methods show the same cutoff value 7.5. The second and
third subgroups are constructed by splitting twice and more specific than the previously considered
220 Effect Modification in Observational Studies
subgroups. During the discovery step, these specific groups are suspected to have large effects. In the
following subsection, we will see how to confirm effect modification with the tree like Figure 11.4.
Sk − θΓk
Reject H0 if DΓ max = max ≥ κΓ,α , (11.8)
1≤k≤2G−2 σΓk
where κΓ,α is the critical value obtained from a multivariate Normal distribution. In general, κΓ,α
varies with Γ, but for the Wilcoxon’s statistic, κΓ,α is constant for any Γ.
Now, we are ready to confirm effect modification as a form of the tree in Figure 11.4. The
inference subsample of size 1170 is used to test the global null hypothesis H0 . Table 11.6 shows
the deviates for the considered comparisons. The critical value κΓ,α = 2.57 increased quite com-
pared to the Submax method’s κΓ,α = 2.19. The Submax method actively utilizes the correlation
between the subgroups, but the Denovo method uses the correlation induced by the tree structure.
Among the disjoint subgroups {DΓ1 , DΓ2 , DΓ3 , DΓ4 }, DΓ2 has the largest deviate indicating that the
Discovering Effect Modification Using Sample-Splitting: Denovo 221
TABLE 11.6
Denovo: Six standardized deviates DΓk , k = 1, . . . , 6 for Wilcoxon’s test and their maximum DΓ max
are shown. The critical value κΓ,α = 2.57 for α = 0.05. Deviates larger than 2.57 are in bold.
group of male aged 3-7 is the least sensitive to unmeasured bias. Also, the large deviates of DΓ5 and
DΓ6 are due to DΓ2 . Interestingly, the sample size for DΓ2 is 96 that is much smaller than the young
group’s sample size 447 in Table 11.4. However, from the sensitive analysis, we find the conclusion
is unchanged until Γ = 4.7 that is largely improved from Γ = 3.7 of the Submax method. We spent
only 75% of the sample to test H0 , but the sensitive analysis results are significantly improved. In
addition, we can check again that age is an important effect modifier and gather another important
information that sex is an effect modifier within 2.5 <Age< 7.5.
We can imagine several situations to understand Table 11.6. First, we can consider a case when
there is only one split by the age cutoff value 7. In this case, we have two subgroups, 1 ∪ 2 ∪ 3 and
4. Using the Bonferroni correction, the critical value is 1.96. Thus, the sensitivity parameter can be
summarized as Γ = 3.6. This Γ value is not too different from the Submax method’s value Γ = 3.7.
If we further divide 1∪2∪3 into 1 and 2∪3, the comparisons are DΓ1 , DΓ4 , DΓ5 . The Γ value is then
3.7 using a new critical value 2.12, but not significantly improved. The above two situations imply
that discovering DΓ2 is critically important to improving the sensitive analysis result. In addition,
one more analysis is possible. If we use only the four terminal nodes (i.e., {DΓ1 , DΓ2 , DΓ3 , DΓ4 })
of the tree without the initial nodes, the critical value gets smaller as 2.24. The sensitivity analysis
can be further improved to Γ = 4.8.
As shown above, spending a part of the sample to discover more specific subgroups is a good
strategy as they can be easily missed by the other methods. However, there is one thing to be
careful about. A discovered tree often needs to be adjusted for the purpose of the study. Too specific
subgroups can be harmful if the discovered structure is needlessly more complicated than the true
structure. An easy remedy is to include internal nodes to avoid unfortunate situations. An investigator
can choose an appropriate tree by trimming or growing the discovered tree. For instance, for public
policy, discovering too complicated subgroups is undesirable. On the contrary, for personalized
medicine, discovering too simple subgroups is not informative. Selecting a suitable size of a tree can
be critical for the inference step. See [42] for more discussion about an appropriate tree.
222 Effect Modification in Observational Studies
11.6 Discussion
In this chapter, we provided an overview of studying effect modification in matched observational
studies. By discovering effect modification, we expect to report much firmer causal conclusions in
subgroups with larger effects. That is, we expect the design sensitivity and the power of the sensitivity
analysis to be larger, so we expect to report findings that are insensitive to larger unmeasured biases
in these subgroups. Such a discovery is important in three ways. First, the finding about the affected
subgroup is typically important in its own right as a description of that subgroup. Second, if there is
no evidence of an effect in the complementary subgroup, then that may be news as well. Third, if a
sensitivity analysis convinces us that the treatment does indeed cause effects in one subgroup, then
this fact demonstrates the treatment does sometimes cause effects, and it makes it somewhat more
plausible that smaller and more sensitive effects in other subgroups are causal and not spurious. This
is analogous to the situation in which we discover that heavy smoking causes lots of lung cancer, and
are then more easily convinced that secondhand smoke causes some lung cancer, even though the
latter effect is much smaller and more sensitive to unmeasured bias.
The methods we introduced have been applied to matched observational study data. In the malaria
example, two covariates, age and gender, were exactly matched among the 1560 matched pairs. In
studying effect modification, it is convenient to have treatment-control pairs matched to have the
same values of the effect modifiers. It is often difficult to match exactly for many covariates x, but not
so difficult to match for a few covariates and simply balance the rest. However, before we examined
the matched pairs, we did not know which covariates would be suggested as possible effect modifiers,
so we did not know which covariates should be exactly matched and which other covariates could
merely be balanced. The strength-k matching provides a solution to this task. Strength-k balance
means that every subset of k covariates is exactly balanced. In strength-k matching, all covariates
should be exactly matched in the maximum number of pairs. One can always rematch the inexact
pairs to be exact for k or fewer covariates. Also, this rematching does not alter the fact that all
rematched pairs constitute a strength-k match of all covariates, because that property is unaffected by
who is matched to whom. For more details, see §6 in [10].
The CART and Denovo methods used a tree structure to study effect modification. The size of the
tree is important for later analysis to justify a large sample approximation. However, as the sample
size grows, it is likely to obtain a larger tree with more specific subgroups. When discovering a
tree, its size is usually selected with cross-validation, so is not fixed. Theoretical results about how
cross-validation affects the performance of the methods are not known yet. To ensure the validity
of the proposed methods, the maximum depth of a tree can be specified initially. The depth may
depend on the sample size and research questions of interest. The R package for the Denovo method
is available at https://github.com/kwonsang.
References
[1] Paul R Rosenbaum. Design sensitivity in observational studies. Biometrika, 91(1):153–164,
2004.
[2] Paul R Rosenbaum. Design sensitivity and efficiency in observational studies. Journal of the
American Statistical Association, 105(490):692–702, 2010.
[3] Paul R Rosenbaum, PR Rosenbaum, and Briskman. Design of observational studies, volume 10.
Springer, 2010.
Discussion 223
[4] Louis Molineaux, Gabriele Gramiccia, World Health Organization, et al. The Garki project:
research on the epidemiology and control of malaria in the Sudan savanna of West Africa.
World Health Organization, 1980.
[5] Paul R Rosenbaum and Donald B Rubin. Constructing a control group using multivariate
matched sampling methods that incorporate the propensity score. The American Statistician,
39(1):33–38, 1985.
[6] Elizabeth A Stuart and David B Hanna. Commentary: Should epidemiologists be more sensitive
to design sensitivity? Epidemiology, 24(1):88–89, 2013.
[7] José R Zubizarreta, Magdalena Cerdá, and Paul R Rosenbaum. Effect of the 2010 chilean
earthquake on posttraumatic stress reducing sensitivity to unmeasured bias through study design.
Epidemiology (Cambridge, Mass.), 24(1):79, 2013.
[8] Paul R Rosenbaum. Heterogeneity and causality: Unit heterogeneity and design sensitivity in
observational studies. The American Statistician, 59(2):147–152, 2005.
[9] Jesse Y Hsu, Dylan S Small, and Paul R Rosenbaum. Effect modification and design sensitivity
in observational studies. Journal of the American Statistical Association, 108(501):135–148,
2013.
[10] Jesse Y Hsu, José R Zubizarreta, Dylan S Small, and Paul R Rosenbaum. Strong control of the
familywise error rate in observational studies that discover effect modification by exploratory
methods. Biometrika, 102(4):767–782, 2015.
[11] Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects.
Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016.
[12] Andrew Chesher. Testing for neglected heterogeneity. Econometrica: Journal of the Economet-
ric Society, 52(4):865–872, 1984.
[13] Richard K Crump, V Joseph Hotz, Guido W Imbens, and Oscar A Mitnik. Nonparametric tests
for treatment effect heterogeneity. The Review of Economics and Statistics, 90(3):389–405,
2008.
[14] Peng Ding, Avi Feller, and Luke Miratrix. Randomization inference for treatment effect
variation. Journal of the Royal Statistical Society: Series B: Statistical Methodology, pages
655–671, 2016.
[15] Steven F Lehrer, R Vincent Pohl, and Kyungchul Song. Targeting policies: Multiple testing
and distributional treatment effects. Technical report, National Bureau of Economic Research,
2016.
[16] Xun Lu and Habert White. Testing for treatment dependence of effects of a continuous treatment.
Econometric Theory, 31(5):1016–1053, 2015.
[17] Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects
using random forests. Journal of the American Statistical Association, 113(523):1228–1242,
2018.
[18] Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized
studies. Journal of Educational Psychology, 66(5):688, 1974.
[19] Jerzy Splawa-Neyman, Dorota M Dabrowska, and TP Speed. On the application of probability
theory to agricultural experiments essay on principles. section 9. Statistical Science, 5(4):465–
472, 1990.
224 Effect Modification in Observational Studies
[20] Ronald Aylmer Fisher. Statistical methods for research workers. In Breakthroughs in statistics,
pages 66–70. Springer, 1992.
[21] B Li Welch. On the z-test in randomized blocks and latin squares. Biometrika, 29(1/2):21–52,
1937.
[22] BM Brown. Symmetric quantile averages and related estimators. Biometrika, 68(1):235–242,
1981.
[23] Gottfried E Noether. Some simple distribution-free confidence intervals for the center of a
symmetric distribution. Journal of the American Statistical Association, 68(343):716–719,
1973.
[24] JS Maritz. A note on exact robust confidence intervals for location. Biometrika, 66(1):163–170,
1979.
[25] Paul R Rosenbaum. A new u-statistic with superior design sensitivity in matched observational
studies. Biometrics, 67(3):1017–1027, 2011.
[26] W Robert Stephenson. A general class of one-sample nonparametric test statistics based on
subsamples. Journal of the American Statistical Association, 76(376):960–966, 1981.
[27] Erich L Lehmann and Joseph P Romano. Unbiasedness: Applications to normal distributions.
In Testing statistical hypotheses, pages 150–211. Springer, 2005.
[28] Paul R Rosenbaum. Randomized experiments. In Observational studies, pages 19–70. Springer,
2002.
[29] Paul R Rosenbaum. Sensitivity to hidden bias. In Observational studies, pages 105–170.
Springer, 2002.
[30] Paul R Rosenbaum. Attributing effects to treatment in matched observational studies. Journal
of the American statistical Association, 97(457):183–192, 2002.
[31] Dmitri V Zaykin, Lev A Zhivotovsky, Peter H Westfall, and Bruce S Weir. Truncated prod-
uct method for combining p-values. Genetic Epidemiology: The Official Publication of the
International Genetic Epidemiology Society, 22(2):170–185, 2002.
[32] Ruth Marcus, Peritz Eric, and K Ruben Gabriel. On closed testing procedures with special
reference to ordered analysis of variance. Biometrika, 63(3):655–660, 1976.
[33] Leo Breiman, Jerome H Friedman, Richard A Olshen, and Charles J Stone. Classification and
regression trees. Routledge, 2017.
[34] Terry M Therneau, Elizabeth J Atkinson, et al. An introduction to recursive partitioning using
the rpart routines. Technical report, Technical report Mayo Foundation, 1997.
[35] Paul R Rosenbaum. Testing one hypothesis twice in observational studies. Biometrika,
99(4):763–774, 2012.
[36] Kwonsang Lee, Dylan S Small, Jesse Y Hsu, Jeffrey H Silber, and Paul R Rosenbaum. Dis-
covering effect modification in an observational study of surgical mortality at hospitals with
superior nursing. Journal of the Royal Statistical Society: Series A (Statistics in Society),
181(2):535–546, 2018.
[37] Bikram Karmakar, Ruth Heller, and Dylan S Small. False discovery rate control for effect
modification in observational studies. Electronic Journal of Statistics, 12(2):3232–3253, 2018.
Discussion 225
[38] Joseph L Gastwirth, Abba M Krieger, and Paul R Rosenbaum. Asymptotic separability in
sensitivity analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
62(3):545–555, 2000.
[39] Alan Genz and Frank Bretz. Computation of multivariate normal and t probabilities, volume
195. Springer Science & Business Media, 2009.
[40] Paul R Rosenbaum. Observational studies. Springer, 2002.
[41] Kwonsang Lee, Dylan S Small, and Paul R Rosenbaum. A powerful approach to the study of
moderate effect modification in observational studies. Biometrics, 74(4):1161–1170, 2018.
[42] Kwonsang Lee, Dylan S Small, and Francesca Dominici. Discovering heterogeneous exposure
effects using randomization inference in air pollution studies. Journal of the American Statistical
Association, 116(534):569–580, 2021.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
12
Optimal Nonbipartite Matching
CONTENTS
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
12.2 Optimal Nonbipartite Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
12.2.1 Bipartite matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
12.2.2 Nonbipartite matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
12.2.3 A small example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
12.3 Illustrative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
12.3.1 The impact of the office of national drug control policy media campaign . . . . 231
12.3.2 The cross-match test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
12.3.3 Strengthening instrumental variable analyses, near-far matching . . . . . . . . . . . 233
12.3.4 Matching before randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
12.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
12.5 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
Matching is a powerful design tool to remove measured confounding in both observational studies
and experiments. Conventional matching designs focus on two-group setup, namely, treated and
control groups, which is also known as bipartite matching. In more complex real-world scenarios,
multiple treatment groups (more than two) are not uncommon, either in the form of different dose
levels or as several unordered intervention arms. This chapter reviews the methodology for creating
matched pairs in the presence of multiple groups, or in situations without clear grouping, which is
referred to as nonbipartite matching. Such a design may be used to match with doses of treatment,
or with multiple control groups, or with various time points in a longitudinal study, or as an aid to
strengthen the instrumental variable analysis.
12.1 Introduction
We usually think of matching for settings where there are two distinct groups, an exposed group
and an unexposed group. Smokers versus non-smokers is a classic example. We call these settings
“bipartite,” referring to two separate parts. However, exposures are often more continuous in their
nature. Consider studying smokers with the exposure being measured as a self-reported average
cigarettes per day smoked. Here someone smoking 10 cigarettes is a heavy smoker compared to
someone smoking 1 cigarette per day, but they are a light smoker compared to someone smoking 20
cigarettes per day. We could think of the levels of cigarettes per day as creating many groups or there
being one big group, smokers, with varying levels of exposure. Studies that are not with two distinct
groups, being either one big group or many groups, are referred to as “nonbipartite” (sometimes
hyphenated “non-bipartite”). This chapter discusses nonbipartite matching methods for such settings.
Matching is a popular tool for causal inference utilizing data with non-randomized treatment
assignment or exposure status. Matching forms groups that are similar in the distributions of their
observed covariates. For covariates to be confounders, they must be associated with both the outcome
and the exposure. By balancing the distributions of covariates between groups, matching breaks the
covariates’ associations with the exposure, thus controlling for confounding. The covariates remain
predictive of the outcome, but they no longer induce biased inference when comparing the outcome
between the matched groups [1].
Matching can be an appealing alternative to model-based analyses for several reasons. Matching
is akin to experimental trials in that the burden of controlling for confounding is placed on the
study’s design, rather than its analysis. This can facilitate a simple comparison of outcomes between
groups, similar to the unadjusted analysis of a randomized trial. The study design can allow using
methods with no parametric assumptions. Following Fisher’s suggestion of randomization being the
reasoned basis for causal inference, randomization-based testing and estimation strategies can be
readily implemented in a matched design [2]. Matching methods that create nonoverlapping sets, e.g.
matching without replacement, preserve an independence structure that facilitates using conventional
estimates of standard errors [3]. Additionally, matching does not preclude the use of model-based
analyses. Model-based analyses may still be performed on a matched cohort, as they are in some
randomized trials. Through balancing the covariate distributions, matching makes these adjusted
analyses more robust [4].
The control for confounding is only as good as the covariates being balanced. Imagine a study of
smoking and cardiovascular disease that did not include sex as a covariate when men are known to be
heavier smokers and to have greater cardiovascular risk. The application of matching methods places
clear focus on which covariates are being balanced and on the balance being achieved. The approach
encourages the integration of existing knowledge when designing and evaluating the study. The
quality of covariate balance may be assessed and iteratively improved without creating an opportunity
to cherry-pick the study’s results [5]. Unlike checking model fit and examining residuals, which are
generally performed with the analysis results unblinded, checking the performance of the matching
can and should be done fully blinded to the analysis results, i.e., prior to any analysis of the outcome.
Finally, many matching methods select the cohort where the data is most capable of providing
meaningful information. Imagine a study of the impact of smoking among school aged children
in first through twelfth grade. A matched analysis would put little to no emphasis on the youngest
children, among whom smoking is extremely rare. Finding the cohort most suited for analysis may
not be trivial because of the interplay between the exposure and the covariates. Consider a common
dichotomization for smoking adults: <10 cigarettes per day is low and ≥10 is high. This would not
be a good choice for smoking children where smoking intensity is strongly correlated with age and
other covariates. What threshold is appropriate for a “heavy” smoker may vary across regions of the
covariate space. As will be described below, a nonbipartite matched design would seek out the high
and low smoking levels for each region of the multidimensional covariate space.
This chapter will first review optimal matching in the bipartite case briefly and then extend to the
nonbipartite case with statistical software availabilities. A small numerical example is presented to
fix the idea. Three examples from the literature are reviewed with more details to illustrate different
ways of using the optimal nonbipartite matching design.
Optimal Nonbipartite Matching 229
For bipartite matching, subjects from the same group are not allowed to match to each other.
Thus the distances between those subjects are set to infinity, or in practical terms, a very large
positive number. By taking certain edges out of the matching, it effectively reduces the search space
for the optimal solution, hence leads to a shorter running time. But it also limits the possibility of
otherwise feasible matches and makes the design less flexible. For example, in a cohort of smokers
230 Optimal Nonbipartite Matching
with different number of cigarettes smoked per day (CPD), a simple two-group split may classify 10
CPD or less as light smoking and more than 10 CPD as heavy smoking. Then, matching someone
smoking 15 CPD to another person smoking 10 CPD is allowable, but matching between persons
with 20 CPD and 15 CPD is not possible, even though both scenarios have a dose difference of
5 units. Nonbipartite matching may be more appropriate in situations involving various doses of
exposure, as it does not require grouping.
ID 1 2 3 4 5 6
1 0 1 ∞ 2 ∞ 10
2 1 0 2 10 100 ∞
3 ∞ 2 0 ∞ 10 ∞
4 2 10 ∞ 0 30 100
5 ∞ 100 10 30 0 100
6 10 ∞ ∞ 100 100 0
TABLE 12.2
A 8 × 8 distance matrix for nonbipartite matching with sinks.
ID 1 2 3 4 5 6 7* 8*
1 0 1 ∞ 2 ∞ 10 0 0
2 1 0 2 10 100 ∞ 0 0
3 ∞ 2 0 ∞ 10 ∞ 0 0
4 2 10 ∞ 0 30 100 0 0
5 ∞ 100 10 30 0 100 0 0
6 10 ∞ ∞ 100 100 0 0 0
7* 0 0 0 0 0 0 0 ∞
8* 0 0 0 0 0 0 ∞ 0
infinity distance among themselves. Because the distance from original units to sinks is 0, the total
distance equals the total distance between paired original units. The algorithm minimizes that total
distance, so it makes the optimal choice of two units to remove. As a result, four pairs are created
under the optimal nonbipartite match and they are {[1, 4], [2, 3], [5, 7], [6, 8]} with a total distance of
2 + 2 + 0 + 0 = 4. The two sinks take away two original units, so the final match is {[1, 4], [2, 3]}.
This is the optimal set of four original units if two units have to be removed. The matching quality is
much improved with a maximal individual edge distance of 2 versus 10 from the previous match.
12.3.1 The impact of the office of national drug control policy media campaign
Evaluating media campaigns or marketing strategies is challenging due to the fact that not only
the exposure to media is self-selected, but there is also no natural control group if the large scale
232 Optimal Nonbipartite Matching
campaign is carried out across the country. An example of such study is the nationwide media
campaign launched by the United States Office of National Drug Control Policy (ONDCP) to reduce
illegal drug use by young Americans. Since almost all teenagers with access to TV, radio, newspaper,
etc., were exposed to the media campaign at some degree, a sensible evaluation approach is to
compare teens who received different exposures to the campaign, but who were similar in terms of
baseline characteristics. It is referred to as a matching-with-doses design, which forms pairs with
very different doses of treatment in such a way that the final contrast groups with high versus low
dose have similar distributions of observed covariates. Then it resembles a randomized study with
high/low dose assignment, barring any unmeasured confounding.
To evaluate the impact of this antidrug media campaign, Lu et al. classified 521 teenagers, from
a pilot dataset of this study, into five ordinal dose groups based on how often they reported seeing
anitdrug commercials (1 denoting least exposure and 5 for most exposure) [10]. The following
ordinal logit model was used to estimate the five-level dose distribution:
P r(Zk ≥ d)
log = θk + β T xk , for d = 2, 3, 4, 5,
P r(Zk < d)
where Zk is the dose level. Note that the distribution of doses depended on the observed covariates
only through the propensity score component, e(xk ) = β T xk . So the maximum likelihood estimtate
β̂ T xk was used in the matching to balance covariate distributions. If a dose-response relationship
existed, subjects with similar covariates but very different dose exposures were more likely to show
significant effects. Therefore, in matching with doses, the goal was not only to balance the observed
covariates, but also to produce pairs with very different doses. The following distance metric was
constructed to combine both covariates and dose information:
(β̂ T xi − β̂ T xj )2 +
∆(xi , xj ) = ,
(Zi − Zj )2
where > 0 was a sufficiently small positive number that would not affect the optimal matching
result. It served two purposes: first, it ensured the distance between two subjects with the same dose
equal ∞, regardless of covariate values; second, for subjects with identical covariates, the distance
became smaller as the dose discrepancy increased.
Since there are multiple dose groups, the conventional bipartite matching algorithm does not
work. The optimal nonbipartite matching is used to create pairs with similar covariates but different
dose levels. One technical issue is that, when the sample size is odd, a sink needs to be used to make
the total size an even number to facilitate matching. The sink unit has zero distance to all 521 original
subjects and infinity distance to itself. Following the procedures described in Lu et al. [10], 260
teen pairs were formed and one subject was discarded because he was matched to the sink. All 22
covariates were balanced after matching, as no significant differences were observed between high-
and low-dose groups. Table 2 from the original paper (refer to the original paper for table content)
presents the number of pairs for each high-low dose combination, e.g., there are 14 pairs in which
the high dose teen had a dose level of 5 and the low dose teen had a dose level of 1. About 40% of
the pairs had dose difference of one, and another 40% of the pairs had dose difference of two.
Lu and colleagues analyzed the pilot data based on 251 pairs (those who had complete information
on outcomes) to illustrate the methodology, and also acknowledged that this data might be too limited
in scope and too unrepresentative of the whole sample. There were four questions about intention for
drug use over the next year, two about marijuana and two about inhalants. All four questions used
the same four point scale: (1) I definitely will not, (2) I probably will not, (3) I probably will, and (4)
I definitely will, which was coded as 1-4 indicating an increasing likelihood of drug use. For each
question, they calculated the difference of response between the two subjects in each matched pair. If
the high dose teen reported “I definitely will not” or 1 and the low dose teen reported “I definitely
will” or 4, then the difference would be 1 − 4 = −3. Generally, negative values signified greater
Illustrative Examples 233
intentions to use drugs by the low-dose teen in a pair, and positive values signified the opposite. The
Wilcoxon signed rank test was applied to each of the four questions, and none showed significant
results. Therefore, there was no evidence that dose level was associated with stated drug use intention.
They also considered a strengthened test by combining the results from the four questions and adding
the corresponding signed rank statistics [15]. This combined test also showed no significance.
2a1 I!
P r(A1 = a1 ) = N
n a0 !a1 !a2 !
where ak ≥ 0 was the number of pairs with type Ak and n was the number of treated subjects, where
a1 + 2a2 = n. Rosenbaum also derived the normal approximation version of the exact test and
compared the test with the well-known KS test in a simulation study, which showed both tests had
similar type-I errors but the cross-match test had higher powers. It was shown that the cross-match
test was consistent for comparing any two discrete distributions with finitely many mass points. It
could also be extended to comparing continuous distributions given that the distributions can be
approximated well by discrete distributions with finitely many mass points.
still exist some essentially random influences to treatment acceptance. Such random component of
treatment assignment can be extracted to obtain unbiased causal effect estimation. An instrument
is weak if such random influences barely affect treatment assignment, or strong if they are critical
in determining the treatment. Weak instruments are more susceptible to bias when IV analysis
assumptions are violated [17].
Baiocchi et al. tried to address a causal question regarding regionalization of intensive care
for premature infants, using an IV approach. Hospitals varied in their ability to care for premature
infants, so regionalization of care suggested that high-risk mothers deliver at hospitals with greater
capabilities [18]. Under such system, within a region, mothers were sorted into hospitals of varied
capability based on the risks faced by the newborn, rather than affiliation or proximity. In practice, it
was not clear whether delivering high risk infants at more capable hospitals could actually reduce
mortality, as sorting by risk might be too inaccurate to affect health. Since a randomized study
was deemed infeasible, Baiocchi and colleagues decided to exploit the IV approach by considering
proximity as the instrument. A high-risk mother was more likely to deliver at a high-level hospital
if it was close to home. The travel time to a high-level hospital was likely to affect the outcome
only through the way it might alter whether the mother delivered at that hospital. If this was true,
proximity would be an instrument for care at high-level hospitals. The mother’s risk, however, might
be related to geography, largely through socioeconomic factors that vary with geography, which
would consequently violate the assumption of the IV approach. To tackle this challenge, the authors
adopted the nonbipartite matching design to build a stronger instrument which was more robust to
potential violation of IV assumptions.
Baiocchi and colleagues used a Pennsylvania dataset consisting of nearly 200,000 premature
births in years 1995–2005. The hospitals were classified as high-level and low-level. Travel time was
calculated as the time from the centroid of mother’s zip code to the closest low-level and high-level
hospitals. The degree of encouragement to deliver at a low-level hospital was the difference in travel
times, high-minus-low, which was termed as excess travel time. Distance strongly encouraged the
mother to deliver at a low-level hospital if the excess travel time was positive and large. To pair
similar mothers together while allowing as large excess travel time as possible, the discrepancy
metric (using the term “discrepancy” to avoid confusion with geographical distance to hospitals)
was defined with two key components. The first component measured the discrepancies between
mothers’ characteristics and the second component was about excess travel time. The construction
of discrepancy matrix took several steps, as the authors tried to balance many covariates, including
year of birth, gestational age, socioeconomic status, numbers of congenital disorders, etc. (detailed
description could be found in their paper) To explore the potential impact of weak versus strong
instrument, they also considered two levels of excess travel time discrepancies. A substantial penalty
was added to the discrepancy between any pair of babies whose excess travel time differed in absolute
value by at most Λ, with Λ = 0 in the first match and Λ = 25 minutes in the second match.
When Λ = 0, it allowed any pair of babies to match as long as their excess travel times were
not the same. When Λ = 25, it had a more stringent requirement that only pairs with 25 minutes or
more excess travel time difference could be matched. The first match was likely to yield a weaker
instrument with much larger sample size since it had minimum requirement on the instrument. The
second match would yield a stronger instrument with much smaller sample size as the two babies in
any pair needed to have excess travel times at least 25 minutes apart. With excess travel time being
the “treatment,” there was no clear cut for two treatment groups. Instead, babies had various doses of
“treatment,” so the optimal nonbipartite matching was utilized to form pairs. Note that sinks were
used to take away unmatchable babies in the second match. The post-matching balance checking
revealed that the first match (99,174 pairs of babies) produced pairs with very good balance but small
mean excess travel time difference, which implied a weak encouragement to deliver at low-level
hospitals. The second match showed bigger mean excess travel time difference, but the covariate
balance was not acceptable. Therefore, they considered a third match called Half-25, in which half
of the babies were required a difference in excess travel time of 25 minutes. Half-25 match (49,587
Discussion 235
pairs of babies) turned out to work really well with very good covariate balance and much bigger
mean excess travel time difference, which indicated a stronger instrument. Table 2 from the original
paper (refer to the original paper for table content) showed the magnitude of encouragement, use
of low-level hospital, and mortality in two matched comparisons. The stronger instrument match
produced more encouragement as the travel time difference was bigger than the weaker instrument
match. The encouragement seemed to work in the sense that more mothers delivered at the low-level
hospital in the far group where the excess travel time to high-level hospital was large. But it did not
seem to produce much effect on infant mortality.
As discussed by Angrist et al., a meaningful causal effect estimator would be the effect ratio,
which was the ratio of two average treatment effect – the effect of distance on mortality over the effect
of distance on where mother delivered [19]. It was interpreted as the average increase in mortality
caused by delivering at the low-level hospitals among mothers who would deliver at a low-level
hospital if and only if no high-level hospital was close by. Baiocchi and colleagues developed the
inferential procedure for this effect ratio estimator in matched observational studies and applied it
to the perinatal care study. The point estimates were quite similar under both weaker and stronger
instruments (0.0092 vs. 0.009) and both were significant. But the length of 95% confidence interval
of the stronger instrument was about half of that for weaker instrument, which made the estimate
more robust. Overall, the evidence suggested a nearly one percent increase in mortality due to the use
of low-level hospitals, provided that there were no unmeasured confounders and the IV assumption
held.
12.4 Discussion
Nonbiparite matching, also known as matching without groups [21], extends the conventional
matching between two groups to a more flexible structure, hence provides many additional options
for matched designs in both observational studies and experiments. In randomized studies, when all
236 Optimal Nonbipartite Matching
subjects are set to be randomized at the same time, nonbipartite matching can be used to conduct
matching before randomization to improve covariate balance, which will result in efficiency gain in
the analysis [20]. With observational data, the nonbipartite matching can form pairs between multiple
ordinal dose groups, multiple unordered groups, or within a dataset without specific grouping [22,23].
Such design can also be used in longitudinal data, when the treatment assignment may depend on
time-varying covariates [24]. This is referred to as risk-set matching , which will be covered in detail
in chapter 9.
A technical note is that the nonbiparite matching still forms matched pairs, even though there are
multiple treatment groups. To make a direct comparison across multiple groups, polygon-shaped
matched sets with one subject from each treatment group may be more desirable. This is referred to as
poly-matching in the literature [25], where the optimal solution is an NP-hard problem. Approximate
optimal solutions based on either bipartite or nonbipartite matching algorithms are available [26–28].
12.5 Acknowledgement
This work was partially supported by grant DMS-2015552 from National Science Foundation. This
work was also supported, in part, by the National Center for Advancing Translational Sciences of
the National Institutes of Health under Grant Number UL1TR002733. The content is solely the
responsibility of the authors and does not necessarily represent the official views of National Science
Foundation and National Institutes of Health.
References
[1] W.C. Cochran and S.P. Chambers. The planning of observational studies of human populations.
Journal of the Royal Statistical Society, Series A, 128:234–255, 1965.
[2] R.A. Fisher. The Design of Experiments. Edinburgh: Oliver & Boyd, 1935.
[3] B.B. Hansen and S.O. Klopfer. Optimal full matching and related designs via network flows.
Journal of Computational and Graphical Statistics, 15:609–627, 2006.
[4] D. Ho, K. Imai, G. King, and Stuart E. Matching as nonparametric preprocessing for reducing
model dependence in parametric causal inference. Potilical Analysis, 15:199–236, 2007.
[5] E.A. Stuart and D.B. Rubin. Best practices in quasi–experimental designs: matching methods
for causal inference. In Osborne, J. (Ed.), Best Practices in Quantitative Methods, 155–176,
SAGE Publications, Inc., 2008.
[6] R.A. Greevy Jr, C.G. Grijalva, C.L. Roumie, C. Beck, A.M. Hung, H.J. Murff, X. Liu, and
M.R. Griffin. A non-bipartite propensity score analysis of the effects of teacher–student
relationships on adolescent problem and prosocial behavior. Journal of Youth and Adolescence,
46:1661–1687, 2012.
[7] P.R. Rosenbaum and D.B. Rubin. The central role of propensity score in observational studies
for causal effects. Biometrika, 70:41–55, 1983.
[8] C.H. Papadimitriou and K. Steiglitz. Combinatorial Optimization: Algorithms and Complexity.
Englewood Cliffs: Prentice Hall, 1982.
Acknowledgement 237
[9] K. Ming and P.R. Rosenbaum. A note on optimal matching with variable controls using the
assignment algorithm. Journal of Computational and Graphical Statistics, 10:455–463, 2001.
[10] B. Lu, E. Zanutto, R. Hornik, and P.R. Rosenbaum. Matching with doses in an observational
study of a media campaign against drug abuse. Journal of the American Statistical Association,
96(456):1245–1253, December 2001.
[11] B. Lu, R. Greevy, X. Xu, and C. Beck. Optimal nonbipartite matching and its statistical
applications. The American Statistician, 65:21–30, 2011.
[12] U. Derigs. Solving non-bipartite matching problems vis shortest path techniques. Annals of
Operations Reserach, 13:225–261, 1988.
[13] W. Cook and A. Rohe. Computing minimum-weight perfect matchings. INFORMS Journal on
Computing, 11:138–148, 1999.
[14] M. Baiocchi, D. Small, L. Yang, D. Polsky, and P. Groeneveld. Near/far matching: a study design
approach to instrumental variables. Health Services and Outcomes Research Methodology,
12:237–253, 2012.
[15] P.R. Rosenbaum. Signed rank statistics for coherent predictions. Biometrics, 53:556–566, 1997.
[16] P.R. Rosenbaum. An exact distribution-free test comparing two multivariate distributions based
on adjacency. Journal of the Royal Statistical Society, Series B, 67:515–530, 2005.
[17] J. Bound, Jaeger D.A., and Baker R.M. Problems with instrumental variables estimation when
the correlation between the instruments and the endogenous explanatory variable is weak.
Journal of the American Statistical Association, 90:443–450, 1995.
[18] M. Baiocchi, D. Small, S. Lorch, and P.R. Rosenbaum. Building a stronger instrument in an
observational study of perinatal care for premature infants. Journal of the American Statistical
Association, 105:1285–1296, 2010.
[19] J.D. Angrist, Imbens G.W., and Rubin D.B. Identification of causal effects using instrumental
variables. Journal of the American Statistical Association, 91:444–455, 1996.
[20] R. Greevy, B. Lu, J. Silber, and P.R. Rosenbaum. Optimal multivariate matching before
randomization: An algorithm and a case-study. Biostatistics, 5:263–275, 2004.
[21] P.R. Rosenbaum. Design of Observational Studies. New York: Springer, 2010.
[22] B. Lu and P.R. Rosenbaum. Optimal pair matching with two control groups. Journal of
Computational and Graphical Statistics, 13:422–434, 2004.
[23] I. Obsuth, A. L. Murray, T. Malti, P. Sulger, D. Ribeaud, and M. Eisner. A non-bipartite
propensity score analysis of the effects of teacher–student relationships on adolescent problem
and prosocial behavior. Journal of Youth and Adolescence, 46:1661–1687, 2017.
[24] B. Lu. Propensity score matching with time-dependent covariates. Biometrics, 61:721–728,
2005.
[25] G. Nattino, C. Song, and B. Lu. Poly-matching algorithm in observational studies with multiple
treatment groups. Computational Statistics & Data Analysis, 167:107364, 2022.
[26] B. Karmakar, D. Small, and P.R. Rosenbaum. Using approximation algorithms to build evidence
factors and related designs for observational studies. Journal of Computational and Graphical
Statistics, 28:698–709, 2019.
238 Optimal Nonbipartite Matching
[27] M. Bennett, Vielma J.P., and Zubizarreta J.R. Building representative matched samples with
multi-valued treatments in large observational studies. Journal of Computational and Graphical
Statistics, 29:744–757, 2020.
[28] G. Nattino, B. Lu, J. Shi, S. Lemeshow, and H. Xiang. Triplet matching for estimating causal
effects with three treatment arms: A comparative study of mortality by trauma center level.
Journal of the American Statistical Association, 116:44–53, 2021.
13
Matching Methods for Large Observational Studies
Ruoqi Yu
CONTENTS
13.1 Introduction: Challenges and Opportunities for Large Datasets . . . . . . . . . . . . . . . . . . . . . 239
13.2 Optimal Pair Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
13.2.1 Challenges of working with dense graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
13.2.2 A practical solution with unattractive limitations: Dividing the large
population into smaller groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
13.2.3 A new approach: sparsifying the network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
13.2.3.1 Optimal caliper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
13.2.3.2 Optimal number of nearest neighbors inside a caliper . . . . . . . . . . 244
13.2.3.3 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
13.3 Incorporating Other Matching Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
13.3.1 Exact matching for a nominal covariate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
13.3.2 Fine balance and related techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
13.3.3 Directional and non-directional penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
13.3.3.1 A simple illustration: Symmetric and asymmetric calipers for
matching on a single Normal covariate . . . . . . . . . . . . . . . . . . . . . . . . 248
13.3.3.2 Optimal matching with asymmetric adjustments to distances . . . 250
13.4 Other Matching Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
13.4.1 Extension to multiple controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
13.4.2 Full matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
13.4.2.1 Sparsifying the network for full matching . . . . . . . . . . . . . . . . . . . . . 252
13.4.2.2 Generalized full matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
13.4.3 Matching without pairing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
13.4.4 Coarsened exact matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
13.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
13.6 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
matched pairs of two children, comparing the surgical outcomes of children on Medicaid and other
health insurance.
The increasing sample size has posed tremendous challenges to matching in observational studies
in recent decades. Commonly used methods to construct matched observational studies include
propensity scores [3, 4], prognostic scores, fine balance and related techniques [4–6], minimum
cost flow algorithms that minimize the total distance within matched sets [7, 8], and mixed-integer
programs that directly target the covariate balance [9]. These techniques have satisfactory performance
for small or moderate data sets with thousands of people but confront computational difficulties
for larger data sets consisting of millions of individuals. How can we build matched samples for
large-scale observational studies efficiently without detracting the appealing balance properties?
A simple and common practice divides a large sample with millions of people into smaller groups
by matching exactly on a few discrete or rounded covariates; then, each subgroup consisting of
thousands of individuals is considered separately for matching. This approach is reasonable and
practical, but it has some unattractive aspects. First, exact matching in groups gives overriding
importance to the covariates that define the groups, which may have no scientific basis. Due to
this hard constraint of prioritization, other covariates of equal importance may not have the chance
of achieving adequate balance. In addition, cross-group matches can sometimes be closer than
within-group matches. Categorizing continuous covariates to form exact-match groups forbids
close matches that cross the category boundaries, risking more significant gaps inside categories.
Furthermore, without carefully dividing the whole sample, some groups may still be too large to
match, whereas some other groups need to be merged before the group is large enough for matching.
As a result, creating groups of practical size can be subjective to investigators who construct the
match, leading to unreproducible analysis. More importantly, matching everyone at once brings
substantial statistical advantages that other matching techniques can be applied on a larger scale,
improving their effectiveness. For instance, a matching technique called “fine balance” tries to make
groups comparable by counter-balancing without constraining individuals in each matched set to
have the same covariate values [5]. Splitting people into groups unnecessarily limits the performance
of fine balance, as the more people considered for matching simultaneously, the more opportunities
fine balance has to work well [12]. Can we overcome these limitations?
This chapter is organized as follows. Section 13.2 focuses on size-scalable matching techniques
for the common case of optimal pair matching. Section 13.3 is devoted to incorporating other
matching techniques when the sample size is large. In Section 13.4, we briefly discuss several different
matching designs – matching with multiple controls, matching without pairing, full matching, and
coarsened exact matching.
(i) Dense Graph (i) Cut Graph (iii) Infeasible Caliper (iv) Optimal Caliper (v) Caliper+Constant
Treated Control Treated Control Treated Control Treated Control Treated Control
0.7
0.7
0.7
0.7
0.7
0.6
0.6
0.6
0.6
0.6
0.5
0.5
0.5
0.5
0.5
Propensity Score
Propensity Score
Propensity Score
Propensity Score
Propensity Score
0.4
0.4
0.4
0.4
0.4
0.3
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0.1
0.0
0.0
0.0
0.0
0.0
FIGURE 13.1
Five bipartite graphs with different numbers of edges, where the vertical axis is the propensity score.
Graph (i) is a dense graph containing all possible pairings. Graph (ii) cuts the graph into four parts
based on propensity score quartiles. Graph (iii) has a caliper that is too small, so pair matching is not
feasible. Graph (iv) uses the smallest caliper that permits a feasible pair matching. Graph (v) has the
smallest caliper and the smallest upper bound on the number of closest candidate controls for treated
units.
When the treatment W is binary, the main part of the network is a bipartite graph consisting of
two sets of nodes corresponding to the treated and control groups. Each edge in the bipartite graph
connects nodes in different sets, representing the binary decision variable: Should this treated unit be
matched to this connected control unit or someone else? Each edge has a cost attached to it, which is
a distance measuring the similarity between any treated-control pair in terms of observed covariates.
The choice of distance may involve the propensity score, a Mahalanobis distance or its variation, and
other considerations. Traditional optimal matching considers all possible candidate treated-control
pairs, therefore deals with a dense graph with edges connecting each treated node to all control nodes.
As a result, it takes O(C 3 ) steps to find an optimal matching [10], which can be computationally
intensive for large observational studies.
More concretely, consider a toy example in Figure 13.1, omitting for the moment incor-
porating other matching techniques. The example uses a random sample of the public data
from the 2005-2006 National Health and Nutrition Examination Survey (NHANES), consisting
of 10 daily smokers as the treated group and 20 nonsmokers as potential controls. This data
can be obtained from the R package bigmatch, following the command “set.seed(79);
nhs<-nh0506[sample(1:(dim(nh0506)[1]),30),]”, using R version 4.1.0. For a much
larger real application using data from US Medicaid, see a recent paper by Yu et al. [11].
In Figure 13.1, nodes represent 30 people in the toy example, and edges represent candidate
matches. Each panel of Figure 13.1 is a so-called bipartite graph – nodes are divided into two parts
for T = 10 treated and C = 20 controls, and each edge connects a treated node to a control node.
Figure 13.1(i) contains all possible 200 = 10 × 20 edges, so it is a complete and dense bipartite graph.
242 Matching Methods for Large Observational Studies
In other words, the edge set B is the direct product, B = T × C, so |B| = T × C. In the simplest
case of pair matching, we would like to pick 10 edges that do not share a node in Figure 13.1(i). The
problem is not trivial because two treated nodes may want the same potential control, so pairing
each treated node to the closest control may not produce a feasible match. In contrast, Figure 13.1(i)
entails an optimization problem with 200 = 10 × 20 binary decision variables subject to constraints
requiring 10 non-overlapping treated-control pairs. Following this idea, optimal matching with 1000
treated units and 2000 potential controls utilizes Figure 13.1(i) containing 2 × 106 = 1000 × 2000
edges, a reasonable size in practice. If we had a small administrative database with 30,000 treated
units to 60,000 controls, Figure 13.1(i) would have 1.8 × 109 = 30, 000 × 60, 000 edges. In this
case, optimal matching using Figure 13.1(i) would not be practical with current computational tools.
It is easy to see that many of the 200 = 10 × 20 possible pairings have poor quality, therefore,
do not deserve serious consideration. For instance, we do not want to match the upper left treated
node to the lower right control node because their propensity scores are far apart. Motivated by
this observation, one solution to make matching methods size-scalable is to reformulate the optimal
matching problem in Figure 13.1(i) into a different problem that is reasonable and can be solved
more efficiently. How can we accomplish this?
techniques. Consider fine balance of the indicator for being Hispanic as an example. As shown in
Table 13.1, this toy dataset has two Hispanic units and eight non-Hispanic units in the treated group,
plus six Hispanic units and 14 non-Hispanic units in the control group. Therefore, it is possible to
select two Hispanic and eight non-Hispanic units from the 20 potential controls to finely balance the
marginal distribution of the indicator of being Hispanic. However, in the group with propensity score
in (0.080, 0.161], the only treated unit is Hispanic, whereas all six controls are not. As a result, fine
balance of the Hispanic indicator is impossible in this quartile. Splitting into groups unnecessarily
limits what fine balance can do [12].
Figure 13.1(iv) is more attractive than Figure 13.1(ii) because, unlike Figure 13.1(ii), there are
no boundaries in Figure 13.1(iv) to prevent us from matching similar individuals. By applying an
optimal caliper in a large dense bipartite graph, Figure 13.1(iv) removes the maximum number of
edges that a caliper of the form ±w can possibly remove without creating infeasibility issues. We
then use this new, sparser graph to minimize the total distance within matched pairs, constrained
by the optimal caliper plus additional matching constraints. This revised problem can be solved
more efficiently than the original optimal matching problem due to the reduced number of decision
variables.
Suppose each individual has a real-valued score, ρ : T ∪ C → R. How can we find an accurate
estimate of the optimal caliper for ρ (·)? Yu et al. [11] proposed a new technique to quickly find
a short interval containing the optimal caliper by utilizing an iterative use of a variant of Glover’s
algorithm [14] for matching in a convex bipartite graph. In a doubly convex graph defined below,
Lipski and Preparata [15] used a doubly-ended queue to efficiently implement Glover’s algorithm
with complexity O (C), which runs much faster than minimum distance matching in either a dense or
sparse graph. Specifically, for a bipartite graph (T ∪ C, B) with nodes T ∪ C and edges B ⊆ T × C,
we call it convex if it is possible to order the nodes of T ∪ C so that for each treated node τ ∈ T , all
control nodes γ ∈ C connected to τ are consecutive. A convex bipartite graph (T ∪ C, B) is doubly
convex if it is also convex with the roles of T and C reversed.
To find the optimal caliper, first sort the nodes of T and C separately by the score ρ (·). Then any
caliper κ > 0 generates a doubly convex graph (T ∪ C, B), with the edge set B only including an
edge (τ, γ) if and only if |ρ (τ ) − ρ (γ)| ≤ κ. To determine the feasibility of a caliper κ > 0, we
apply Glover’s algorithm in the corresponding doubly convex graph (T ∪ C, B) to study whether
pair matching is feasible in B. Although Glover’s algorithm can provide more information, this is
all we need. More importantly, the time needed to find the optimal caliper is negligible compared
with solving the minimum cost flow problem since both sorting and Glover’s algorithm are much
faster than finding an optimal match. We determine the optimal caliper κ by binary search with
initial choices κmin = 0 and κmax = maxι∈T ∪C ρ (ι) − minι∈T ∪C ρ (ι), and a tolerance > 0. Let
glover (κ) = 1 if pair matching is feasible in the bipartite graph (T ∪ C, B) with caliper κ on the
score ρ (·); otherwise, let glover (κ) = 0. Then the optimal caliper can be estimated with error
using the following algorithm:
1. If glover (κmin ) = 1, stop; pair matching is feasible with the smallest caliper κ = κmin = 0.
2. If glover (κmax ) = 0, stop; pair matching is infeasible for every κ.
3. If κmax − κmin < , stop; caliper κmax is feasible and within of the optimal caliper.
4. Define κ = (κmax + κmin ) /2. If glover (κ) = 1, set κmax ← κ; if glover (κ) = 0, set
κmin ← κ. Go to step 3.
For the toy example in Figure 13.1, a binary search suggests [κmin , κmax ] = [0.0763, 0.0764].
That is, pair matching is infeasible with caliper 0.0763 in Figure 13.1(iii) but feasible with caliper
0.0764 in Figure 13.1(iv). If ρ (·) is the propensity score, then κmax − κmin ≤ 1, and the interval
[κmin , κmax ] has length at most 2−I after I iterations of steps 3 and 4. For instance, after I = 10
iterations, the length of the interval [κmin , κmax ] is no more than 2−10 = 0.000977 < 0.001.
Suppose that we would like to retain at most the ν nearest neighbors of each treated unit in
Figure 13.1(iv). Formally speaking, for each treated subject τ ∈ T , sort |ρ (τ ) − ρ (γ)| into increasing
order, and define ov (τ ) to be the νth of the C sorted values of |ρ (τ ) − ρ (γ)|. We then define the
bipartite graph (T ∪ C, B) so that it includes edge (τ, γ) in B if and only if |ρ (τ ) − ρ (γ)| ≤
min {κ, ov (τ )}. How small can ν be to remain a feasible pair matching within an optimal caliper
in Figure 13.1(iv)? It is clear from Figure 13.1(iii) that ν = 2 is too small, because the four treated
units with propensity scores in the range (0.3, 0.4) are competing for three controls. A pair match
is possible in Figure 13.1(v), indicating ν = 3 is feasible. With the optimal caliper, κ, estimated
as above, we can determine the minimum feasible value of ν by a second iterative use of Glover’s
algorithm. That is, the first application of Glover’s algorithm determines the optimal caliper, and
then the second application determines the smallest feasible ν among subgraphs of the optimal
caliper graph. By combining these two techniques, we can significantly reduce the number of edges
in the bipartite graph. As demonstrated in Figure 13.1(v), some treated subjects are connected to
less than ν = 3 controls. For instance, the treated subject with the largest propensity score has only
one neighbor, not ν = 3 neighbors, because the caliper has sensibly eliminated distant controls as
neighbors. As a result, Figure 13.1(v) has 25 edges. In a more realistic example with 30,000 treated
units and 60,000 potential controls, the dense graph would have 1.8 × 109 = 30, 000 × 60, 000 edges
or decision variables. With ν = 100, there would be 3 × 106 = 30, 000 × 100 edges or decision
variables, comparable to a dense bipartite graph with one twentieth as many nodes or people.
Knowing the minimum feasible ν does not require investigators to use this minimal ν. Instead,
it informs an important message that matching within an optimal caliper will remain feasible if
attention is restricted to at most ν nearest neighbors. An intermediate graph with a slightly larger ν
would offer more candidate pairs for matching, hence may lead to a smaller Mahalanobis distance
on covariates other than the propensity score or other desired properties such as covariate balance.
So considerations besides the caliper on the propensity score would have a substantial influence
on the final match. For instance, a feasible graph satisfying the optimal caliper and with at most
ν = 4 nearest neighbors is an intermediate graph between Figure 13.1(iv) and Figure 13.1(v), with 5
more edges than Figure 13.1(v). If (T ∪ C, B)were complete with B = T × C, as in Figure 13.1(i),
Theorem 9.13 of [10] gives a bound of O C 3 for optimal matching. On the other hand, even with
growing values of ν, we have a time bound of O C 2 log (C) providing ν = O {log (C)}, which
is much smaller than O(C 3 ).
Second, in a graph like Figure 13.1(v), the bigmatch package offers a suite of standard
techniques for optimal matching in observational studies, such as propensity score calipers,
near-fine balance, minimizing a robust covariate distance, exact matching, near-exact matching.
For a review of these standard methods, see Part II of Rosenbaum’s book [16]. Although there
are various nonstandard implementations of these standard methods to avoid computing or storing
information in Figure 13.1(i) that plays no role in Figure 13.1(v), these methods work in their usual
way from the user’s point of view. In this step, with a caliper and nearest neighbors restriction that
are at least as large as the minimums determined in step one and your other matching requirements,
the nfmatch function computes the optimal match subject to your specifications.
The minimum feasible caliper and constraint on the number of nearest neighbors yield a sparse
graph and perhaps the fastest computation in step two. However, speed is one consideration among
others. Setting the caliper and nearest neighbor restriction to be higher than their minimum feasible
values gives nfmatch more options of searching for a close, balanced match, perhaps producing a
better match in terms of covariate balance. It is reasonable to construct and compare a few matched
samples and pick the most satisfactory one for the outcome analysis.
0.7
0.6 Non−Hispanic Female, Hispanic
Female, Non−Hispanic
Male, Hispanic
0.5
Male, Non−Hispanic
Auxiliary
Propensity Score
0.4
0.3
0.2
Female−Female
Hispanic Male−Male
Fine balance
0.1
Bypass
Duplicate
0.0
FIGURE 13.2
A bipartite graph matching exactly for gender, with near-fine balance of Hispanic or not. The optimal
caliper is now 0.2133, and the minimum number of nearest neighbors is ν = 3 with this caliper.
The duplicate edges connect γ to γ 0 , with capacity 1 to ensure that each control may be matched at
most once. The solid grey edges retain feasibility through a penalized bypass, β, of the fine balance
constraints.
The algorithm above finds a uniform caliper and constraint on the nearest neighbors for all exact
matching categories. A straightforward but valuable variation is using different calipers and numbers
of nearest neighbors for each category to account for the various natures of exact matching groups,
e.g., hospitals or disease types. This trick can reduce more candidate pairs than a uniform choice
which is equivalent to the most conservative option that is feasible among all categories.
Another generalization deals with the case when the nominal variable has a lexical order, e.g., the
International Classification of Diseases (ICD-10) code for diagnosis or procedure and zip code. The
hierarchical relationship helps control the degree of exact matching if it is infeasible match exactly
on the original variable. For instance, if exact matching with the 5-digit zip code is not feasible for a
particular area, we could match the first four digits or less in that area. On the other hand, for regions
with more controls than treated subjects, we will keep using the 5-digit zip code. Generally speaking,
investigators need to conduct several contingency tables to determine the degree of exact matching
for each category (the optimal depth for matching exactly on a variable with lexical order) and then
find the corresponding optimal caliper and constraint on the number of nearest neighbors.
to Hispanics or non-Hispanics. Fine balance is not always feasible. If there were fewer than two
Hispanics or eight non-Hispanics in the control group, fine balance would be infeasible. Let Υ be the
number of possible values of the fine balance nominal covariate; so Υ = 2 in this toy example. We
can determine whether fine balance is feasible in Figure 13.1(i) by constructing a 2 × Υ contingency
table recording the treatment by the nominal covariate. However, Figure 13.2 imposes additional
constraints – the exact matching for gender, the caliper on the propensity score, and the limit on
the number of nearest neighbors. Even if the control group did include two Hispanics and eight
non-Hispanics, fine balance might be infeasible with these additional constraints. Therefore, no
simple tabulation can determine the feasibility of fine balance with a reduced number of candidate
pairs.
Near-fine balance means the deviation from fine balance is as small as possible [6]. The formal
definition of near-fine balance requires the total absolute differences in frequencies between the treated
and control groups over the categories to be minimized. For instance, if fine balance for the indicator
of Hispanics is infeasible, the next minimal difference of 2 = |2 − 1| + |8 − 9| = |2 − 3| + |8 − 7|
can be achieved by including either one Hispanic and nine non-Hispanics or three Hispanics and
seven non-Hispanics in the matched control group. The structure on the right in Figure 13.2 imposes
a near-fine balance constraint for Hispanic or not as a soft constraint. Here, a soft constraint is
implemented by altering the objective function – the total covariate distance within matched pairs –
to penalize violations of fine balance through the grey edges. The minimum cost flow algorithm tries
to avoid penalized violations of fine balance, but tolerates the minimum number of violations needed
to ensure feasibility. Utilizing a soft constraint is necessary when dealing with large observational
studies because the calipers and nearest neighbors constraints produce a reduced bipartite graph.
Pimentel et al. [4] generalized the concept of near-fine balance, introducing a tree-structured
hierarchy of near-fine balance constraints, allowing the user to express a preference for certain kinds
of deviations from fine balance over other deviations. This method is handy when trying to balance
the joint distribution of several nominal variables with different priorities. The proposed idea of
refined balance can be applied in conjunction with the optimal caliper and restriction on nearest
neighbors described in §13.2 by adding multiple layers of near-fine balance nodes in the auxiliary
structure on the right in Figure 13.2.
13.3.3.1 A simple illustration: Symmetric and asymmetric calipers for matching on a single
Normal covariate
In 1973, Cochran and Rubin [12, §2.3] discussed an intuitive but wise technique of caliper matching
to ensure the covariate similarity of matched pairs. For instance, caliper matching for age with a
non-negative caliper η ≥ 0 requires a treated subject to be matched to a control whose age differs
no more than η years from the treated subject. With η = 4, a 16-year old treated unit can only
Incorporating Other Matching Techniques 249
TABLE 13.2
Bias after caliper matching with a symmetric/asymmetric caliper xt − xc ∈ [−η1 , η2 ] on a single
Normally distributed covariate. The initial bias before matching is µ. The smallest absolute bias
after caliper matching for each µ is labeled in bold. This table is an extension of Table 3 of [13] by
considering more caliper and initial bias choices.
be matched to any control who is 12-20 years old. Such a caliper of the form [−η, η] is called a
symmetric caliper. However, if treated subjects are typically older than controls, we are more likely
to include matched pairs with an older treated unit and a younger control than the opposite within the
symmetric caliper. As a result, the residual imbalances after symmetric caliper matching are likely
to remain tilted in the original direction. In this situation, an asymmetric caliper of the same length
might be preferable, requiring xt − xc ∈ [−η1 , η2 ] where η1 > η2 ≥ 0. Such a caliper of the form
[−η1 , η2 ] with η1 6= η2 is called an asymmetric caliper.
To demonstrate the merits of asymmetric calipers, consider the simple case of matching two
individuals on one Normally distributed covariate, as discussed in [13]. Specifically, suppose the
treated subject has covariate xt ∼ N (µ, 1) and the control has covariate xc ∼ N (0, 1). Then the
bias before matching is E (xt − xc ) = µ. How much bias can caliper matching reduce? Suppose that
we first sample xt , then independently sample xc conditional on the caliper constraint xt − xc ∈
[−η1 , η2 ]. We use numerical integration to calculate the bias after caliper matching, namely
Table 13.2 is an extension of Table 3 of [13] and summarizes the bias after matching for various
choices of initial biases µ and calipers [−η1 , η2 ]. The symmetric caliper [−1, 1] works very well
when the initial bias is zero, achieving the smallest residual bias of almost zero. However, when the
initial bias is non-zero, the asymmetric caliper with the same length can achieve a much smaller bias
than the symmetric one. For instance, from the last column of Table 13.2, we can observe that with a
relatively large initial bias µ = 1, the symmetric caliper reduces the bias to 0.25, but the asymmetric
caliper [−1.3, 0.7] removes 98% of the initial bias and leaves a residual bias of 0.02.
Table 13.2 provides several insightful information. First, the best asymmetric caliper often works
better than the symmetric caliper with the same length in reducing biases. In addition, the optimal
degree of asymmetry among calipers with the same length depends on the direction and scale of
the initial bias µ. More asymmetry is needed to offset a larger initial bias. Using a caliper with an
inappropriate asymmetry may worsen the situation, increasing the bias rather than decreasing it.
With the merits of asymmetric calipers, a natural question arises: can we use an asymmetric caliper
to remove edges in the network and reduce the computational difficulties in large observational
studies? The good news is yes. But how can we proceed? A simple way is to find the optimal
symmetric caliper and then extend the caliper in one direction to offset the initial bias. With an
extended caliper covering the original optimal caliper, no feasibility issue occurs. For instance, if we
go back to the toy example in Figure 13.1, we have the optimal symmetric caliper for the propensity
250 Matching Methods for Large Observational Studies
score is [−0.0764, 0.0764]. Since the propensity score in the treated group is typically larger than
that in the control group, it is natural to use an asymmetric caliper such as [−0.0764 × 2, 0.0764] to
achieve a smaller bias in the propensity score. A more formal approach is directly applying Glover’s
algorithm to find the optimal asymmetric caliper with a specified asymmetry. For instance, what is the
best η > 0 in the calipers of the form [−2η, η]? Glover’s algorithm is still applicable in this case since
a caliper can produce a doubly convex bipartite graph no matter its asymmetry. The investigators may
want to try several choices of asymmetry and select the one that gives the best covariate balance.
The smaller dk is, the closer the matched treated and control groups are. How can we choose the
comparison dktc ? In this section, we describe several easy-to-use measures discussed in [13].
The difference dktc = xk,τt − xk,γc is an intuitive and natural comparison of xk,τt and xk,γc .
However, the corresponding dk – the average difference of xk between the matched treated and
control groups – has a limitation: it permits the existence of large positive/negative differences,
allowing them to cancel with each other to give a small average. To avoid significant differences
in both signs, a better choice of dktc might be the absolute difference |xk,τt − xk,γc |, which can be
decomposed into two hockey-stick measures of the form
d0ktc = max (0, xk,τt − xk,γc ) ≥ 0, d00ktc = max (0, xk,γc − xk,τt ) ≥ 0.
Then we have |xk,τt − xk,γc | = d0ktc + d00ktc . We can control the average absolute difference by
0 00
focusing on the two averages dk and dk , the corresponding averages dk obtained by replacing
0 00
dktc with dktc and dktc , respectively. More importantly, we can tilt against the direction of bias by
0 00
adjusting the weights in their weighted sum λ1 dk + λ2 dk .
Another popular choice of dktc is sign (xk,τt − xk,γc ), where
1
if xk,τt − xk,γc > 0
sign (xk,τt − xk,γc ) = 0 if xk,τt − xk,γc = 0 .
−1 if xk,τt − xk,γc < 0
Then the corresponding dk = 0 if the median of xk,τt − xk,γc for all matched pairs is zero. In
addition, we would prefer obtaining a zero difference to canceling out differences with opposite signs.
To control the median difference with this preferred property, we apply a similar trick as before and
decompose sign (xk,τt − xk,γc ) into two parts,
where −η1 ≤ 0 ≤ η2 . If all matched pairs satisfy the caliper constraint on xk , i.e., xk,τt − xk,γc ∈
[−η1 , η2 ], the corresponding dk = 0,
One common concern about optimal matching is that the matched sample may include a few
pairs with
PT large PCdistances δtc since optimal matching minimizes the total distance among all matched
pairs t=1 c=1 atc δtc without directly focusing on the single within-pair distances. To control
the maximum distance in the matched set, Rosenbaum [18] discussed an efficient algorithm to find
the smallest possible threshold ψ for the maximum paired distance that permits a feasible match. In
practice, it is more straightforward and usually adequate to pick a moderate but somewhat arbitrary
value for ψ in most applications. Suppose ψ > 0 is a moderate threshold for the within-pair distances.
By choosing (
1 if δtc > ψ
dktc = ,
0 otherwise
controlling d¯k can reduce the number of distances exceeding ψ.
After choosing K comparisons dktc , k = 1, . . . , K, suppose that we would like to have dk ≤ k ,
k = 1, . . . , K for fixed k ≥ 0. Recall in §13.2, the original matching problem without these
new balance constraints dk ≤ k can be solved by a relatively efficient algorithm in polynomial
time [10, Theorem 11.2]. In contrast, the new problem with additional constraints dk ≤ k is a
general integer program and can be much more challenging to solve. Specifically, the general integer
programming problem is NP-complete and cannot be solved in polynomial time [19, Theorem 18.1].
Lagrangian relaxation, proposed by Fisher [17], is one of the approximation techniques to solve
the new constrained problem. It involves solving several easy problems in the original form with
adjusted objective function, so it is computationally tractable. In light of Lagrangian relaxation of the
linear imbalance summary requirements, Yu and Rosenbaum [13] proposed the following iterative
method:
1. Choose a set of directional penalties λ1 ≥ 0, . . . , λK ≥ 0.
∗
PK
2. Define new distances δtc = δtc + k=1 λk (dktc − k ).
∗
3. Solve the conventional matching problem with revised distances δtc .
3
Each iteration can be accomplished efficiently, in O C steps, if we consider all possible treated-
control pairs. Combining these techniques with the methods
discussed in §13.2 to remove some
candidate pairs, we can solve this revised problem in O C 2 log (C) steps. We can repeat these three
steps several times, adjusting the directional penalties/Lagrangian multipliers λk at each iteration to
enforce the new balance constraints dk ≤ k , k = 1, . . . , K. A rule of thumb is to increase λk if the
balance constraints are not satisfied and decrease λk otherwise.
Using large penalties is common in many matching techniques to improve covariate balance.
Investigators can implement balance requirements as soft constraints to meet the desired conditions
as closely as possible. For instance, a large penalty can be applied to enforce a propensity score
caliper without generating infeasibility issues or to balance the marginal distribution of a nominal
252 Matching Methods for Large Observational Studies
variable with minimum deviations. On the other hand, directional penalties work against the direction
of bias. An over-adjustment can reverse the direction of the bias but not adequately reduce the bias.
Therefore, investigators must tune directional penalties carefully to achieve satisfactory results. In
practice, a few adjustments to directional penalties can efficiently improve the quality of a matched
sample, as shown in the real data application in [13, §3].
0.7
0.7
0.6
0.6
0.5
0.5
Propensity Score
Propensity Score
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
control node in the bipartite graph analogous to Figure 13.1. Thus, the optimal caliper for unrestricted
full matching is the maximum over the treated group of minimum distances to all potential controls,
κf = maxτt ∈T minγc ∈C dtc , as discussed in [24]. Notably, finding this caliper does not impose
convexity on the graph structure or other conditions on the distances dtc , so the choice of dtc can be
flexible, e.g., the Mahalanobis distance based on all observed covariates or the distance based on
a single score. If calculating a distance is takes constant time, finding the optimal caliper for full
matching takes O(C 2 ) steps. In the situation where dtc = |ρ(τt ) − ρ(γc )| for some real-valued score
ρ(·), which is the same scenario considered by Yu et al. [11], Fredrickson et al. [24] proposed an
efficient algorithm that determines the optimal caliper κf in O {C log(C)} operations.
Applying narrow calipers can reduce the computational burden for full matching in large observa-
tional studies but may create highly imbalanced structures including matched sets with many treated
subjects sharing the same control or matching one treated subject to many controls. To overcome the
drawbacks of highly skewed matching ratios described above, one option is to consider full matching
with symmetric restrictions: matching in ratios of h : 1 up to 1 : h, for some positive integer h. This
is equivalent to pair matching when h = 1 and unrestricted full matching when h = ∞. Note that the
optimal caliper for unrestricted full matching κf provides a lower bound on the optimal caliper for
any restricted match, such as pair matching or restricted full matching, but may not permit a feasible
restricted match. Therefore, to find the optimal caliper for restricted full matching, one can compute
κf and then check if the caliper is also feasible for full matching with the additional structural
constraints. If not, we need to increase the caliper size until a feasible restricted full matching exists.
As illustrated in Figure 13.3, with our toy dataset in §13.2, the optimal caliper for propensity score
with unrestricted full matching is 0.0579. In this example, this caliper is also feasible for restricted
full matching with ratios of 2 : 1 up to 1 : 2. On the other hand, the full matching caliper is too
narrow for pair matching, for which the smallest feasible caliper is 0.0764.
254 Matching Methods for Large Observational Studies
0.7
0.7
0.6
0.6
0.5
0.5
Propensity Score
Propensity Score
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
FIGURE 13.4
(i) A simplified version of graph G constructed in step 1 without the self-loops for each node. Solid
arrows are from treated units to controls, and dashed arrows are from controls to treated units. (ii)
The constructed full matching by the approximation algorithm. Gray lines define the matched sets.
Figure 13.4 demonstrates the approximation algorithm with the toy dataset used in the previous
sections. Figure 13.4(i) is a simplified version of graph G constructed in step 1 without the self-
loops for each node. In the second step, nine seeds are identified, which leads to 9 matched sets in
Figure 13.4(ii).
Instead, the objective function can target the discrepancy of distributions in the matched treated and
control groups, e.g., the Kolmogorov-Smirnov statistics of marginal distributions and the difference
in covariate means and quantiles. With an appropriately chosen objective function suitable for both
formulations, Bennett et al. [28] demonstrated the equivalency of these two formulations in terms
of the optimal objective value and strength of formulation (i.e., the gap between the optimal value
of the original integer program and its linear program relaxation). In particular, when there are two
covariates or nested covariates, the problem is equivalent to its linear program relaxation, hence can
be solved in polynomial time. This reduced formulation is computationally intractable in theory
without any special structures, but it usually works well in practice.
This new formulation is implemented in the designmatch package. After matching for balance,
the selected controls can be re-matched to reduce outcome heterogeneity and sensitivity to hidden
biases [29].
13.5 Summary
We have outlined several recently proposed methods for finding different matching designs with large
observational studies, illustrated them with a toy example of 30 individuals from NHANES data,
and mentioned the corresponding software. These methods improve the computational efficiency in
different ways. The first class of techniques focuses on excluding the candidate matches that are far
apart and do not deserved to be included in the matched data. Alternatively, matched samples can
be constructed using approximation algorithms or in two steps by first deciding whom to include in
the match then pairing within the selected samples. Another type of matching method directly uses
tabulation on the coarsened covariates.
Acknowledgments 257
13.6 Acknowledgments
Part of this chapter is adapted from the author’s joint paper with Paul Rosenbaum and Jeffrey Silber,
“Matching methods for observational studies derived from large administrative databases,” which
appeared with discussion in Statistical Science in 2020. The author is grateful to the Institute of
Mathematical Statistics for their copyright policy which permits use by authors of publications in
IMS journals.
References
[1] Rachel R Kelz, Morgan M Sellers, Bijan A Niknam, James E Sharpe, Paul R Rosenbaum,
Alexander S Hill, Hong Zhou, Lauren L Hochman, Karl Y Bilimoria, Kamal Itani, et al. A
national comparison of operative outcomes of new and experienced surgeons. Annals of Surgery,
273(2):280–288, 2021.
[2] Jeffrey H Silber, Paul R Rosenbaum, Wei Wang, Shawna R Calhoun, Joseph G Reiter, Orit Even-
Shoshan, and William J Greeley. Practice style variation in medicaid and non-medicaid children
with complex chronic conditions undergoing surgery. Annals of Surgery, 267(2):392–400,
2018.
[3] Paul R Rosenbaum and Donald B Rubin. Constructing a control group using multivariate
matched sampling methods that incorporate the propensity score. American Statistician,
39(1):33–38, 1985.
[4] Samuel D Pimentel, Rachel R Kelz, Jeffrey H Silber, and Paul R Rosenbaum. Large, sparse
optimal matching with refined covariate balance in an observational study of the health outcomes
produced by new surgeons. Journal of the American Statistical Association, 110(510):515–527,
2015.
[5] Paul R Rosenbaum, Richard N Ross, and Jeffrey H Silber. Minimum distance matched sampling
with fine balance in an observational study of treatment for ovarian cancer. Journal of the
American Statistical Association, 102(477):75–83, 2007.
[6] Dan Yang, Dylan S Small, Jeffrey H Silber, and Paul R Rosenbaum. Optimal matching with
minimal deviation from fine balance in a study of obesity and surgical outcomes. Biometrics,
68(2):628–636, 2012.
[7] Ben B Hansen and Stephanie Olsen Klopfer. Optimal full matching and related designs via
network flows. Journal of Computational and Graphical Statistics, 15(3):609–627, 2006.
[8] Paul R Rosenbaum. Optimal matching for observational studies. Journal of the American
Statistical Association, 84(408):1024–1032, 1989.
[9] José R Zubizarreta. Using mixed integer programming for matching in an observational study
of kidney failure after surgery. Journal of the American Statistical Association, 107(500):1360–
1371, 2012.
[10] Bernhard H Korte and Jens Vygen. Combinatorial Optimization. New York: Springer, 2012.
258 Matching Methods for Large Observational Studies
[11] Ruoqi Yu, Jeffrey H Silber, and Paul R Rosenbaum. Matching methods for observational studies
derived from large administrative databases. Statistical Science, 35(3):338–355, 2020.
[12] William G Cochran and Donald B Rubin. Controlling bias in observational studies: A review.
Sankhyā: The Indian Journal of Statistics, Series A, 35(4):417–446, 1973.
[13] Ruoqi Yu and Paul R Rosenbaum. Directional penalties for optimal matching in observational
studies. Biometrics, 75(4):1380–1390, 2019.
[14] Fred Glover. Maximum matching in a convex bipartite graph. Naval Research Logistics
Quarterly, 14(3):313–316, 1967.
[15] Witold Lipski and Franco P Preparata. Efficient algorithms for finding maximum matchings in
convex bipartite graphs and related problems. Acta Informatica, 15(4):329–346, 1981.
[16] Paul R Rosenbaum. Design of Observational Studies. New York: Springer, 2010.
[17] Marshall L Fisher. The lagrangian relaxation method for solving integer programming problems.
Management Science, 27(1):1–18, 1981.
[18] Paul R Rosenbaum. Imposing minimax and quantile constraints on optimal matching in
observational studies. Journal of Computational and Graphical Statistics, 26(1):66–78, 2017.
[19] Alexander Schrijver. Theory of Linear and Integer Programming. New York, John Wiley &
Sons, 1998.
[20] Frank B Yoon. New methods for the design and analysis of observational studies. PhD thesis,
University of Pennsylvania, 2009.
[21] Paul R Rosenbaum. A characterization of optimal designs for observational studies. Journal of
the Royal Statistical Society: Series B (Methodological), 53(3):597–610, 1991.
[22] Ben B Hansen. Full matching in an observational study of coaching for the sat. Journal of the
American Statistical Association, 99(467):609–618, 2004.
[23] Xing Sam Gu and Paul R Rosenbaum. Comparison of multivariate matching methods: Struc-
tures, distances, and algorithms. Journal of Computational and Graphical Statistics, 2(4):405–
420, 1993.
[24] Mark M Fredrickson, Josh Errickson, and Ben B Hansen. Comment: Matching methods
for observational studies derived from large administrative databases. Statistical Science,
35(3):361–366, 2020.
[25] Fredrik Sävje, Michael J Higgins, and Jasjeet S Sekhon. Generalized full matching. Political
Analysis, 29(4):423–447, 2021.
[26] Michael J Higgins, Fredrik Sävje, and Jasjeet S Sekhon. Improving massive experiments with
threshold blocking. Proceedings of the National Academy of Sciences, 113(27):7369–7376,
2016.
[27] Cinar Kilcioglu and José R Zubizarreta. Maximizing the information content of a balanced
matched sample in a study of the economic performance of green buildings. Annals of Applied
Statistics, 10(4):1997–2020, 2016.
[28] Magdalena Bennett, Juan Pablo Vielma, and José R Zubizarreta. Building representative
matched samples with multi-valued treatments in large observational studies. Journal of
Computational and Graphical Statistics, 29(4):744–757, 2020.
Acknowledgments 259
[29] José R Zubizarreta, Ricardo D Paredes, and Paul R Rosenbaum. Matching for balance, pairing
for heterogeneity in an observational study of the effectiveness of for-profit and not-for-profit
high schools in chile. Annals of Applied Statistics, 8(1):204–231, 2014.
[30] Stefano M Iacus, Gary King, and Giuseppe Porro. Causal inference without balance checking:
Coarsened exact matching. Political Analysis, 20(1):1–24, 2012.
[31] Stefano M Iacus, Gary King, and Giuseppe Porro. cem: Software for coarsened exact matching.
Journal of Statistical Software, page Forthcoming, 30(9): 1–27, 2009.
[32] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational
studies for causal effects. Biometrika, 70(1), 41–55, 1983.
[33] Ben B Hansen. The prognostic analogue of the propensity score. Biometrika, 95(2), 481–488,
2008.
[34] Ruoqi Yu. How well can fine balance work for covariate balancing. Biometrics, https:
//doi.org/10.1111/biom.13771, 2022.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Part III
Weighting
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
14
Overlap Weighting
Fan Li
CONTENTS
14.1 Causal Estimands on a Target Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
14.2 Balancing Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
14.3 Overlap Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
14.4 Extensions of Overlap Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
14.4.1 Multiple treatments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
14.4.2 Time-to-even outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
14.4.3 Covariate adjustment in randomized experiments . . . . . . . . . . . . . . . . . . . . . . . . . 272
14.5 Implementation and Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
14.6 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
14.6.1 A simulated example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
14.6.2 The National Child Development Survey data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
14.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Weighting is one of the fundamental methods in causal inference. The main idea of weighting is
to re-weight each unit to create a weighted population, namely, the target population, where the
treatment and control groups are comparable in baseline characteristics. The dominant weighting
scheme in causal inference has been inverse probability weighting (IPW). IPW is a propensity score
weighting method and targets the average treatment effect (ATE) estimand. However, IPW has a
well-known limitation: it is subject to excessive variance due to extreme weights when covariates
are poorly overlapped between treatment groups. There has been an increasingly large literature on
new weighting methods in causal inference [1–4]. The majority of these methods target at the ATE
estimand and devise new ways to reduce the large variance in estimating ATE when covariates are
poorly overlapped. In this chapter, we describe the general framework of balancing weights, which
generalizes propensity score weighting beyond IPW. This framework allows analysts to flexibly
specify a target population first and then estimate the corresponding treatment effect. In particular,
we focus on a special case of balancing weights, the overlap weight (OW), which possesses desirable
theoretical and empirical properties, and scientifically meaningful interpretation [5–8].
Xi = (X1i , · · · , Xpi ). Assuming the Stable Unit Treatment Value Assumption (SUTVA) [9], for
each unit i, there are a pair of potential outcomes {Y1i , Y0i } mapped to the treatment and control
status, of which only the one corresponding to the observed treatment is observed, denoted by
Yi = Zi Y1i + (1 − Zi )Y0i ; the other potential outcome is counterfactual. The propensity score is the
probability of a unit being assigned to the treatment group given the covariates [10]: e(x) = Pr(Zi =
1|Xi = x).
Causal effects are contrasts of potential outcomes of the same units. We define the conditional
average treatment effect (CATE) at covariate value x as
τ (x) = E(Y1i − Y0i | Xi = x) = µ1 (x) − µ0 (x), (14.1)
where µz (x) = E(Yzi | Xi = x) is the conditional expectation of the potential outcomes under
treatment level z (z = 0, 1) over a population. Here we focus on comparing potential outcomes
averaged over a distribution of the covariates, namely, a target population. One can formulate a causal
estimand on any pre-specified target population as follows. Assume the observed sample is drawn
from a population with probability density of covariates f (x). Let g(x) denote the covariate density
of the target population, which may be different from f (x). We call the ratio h(x) = g(x)/f (x)
the tilting function, which re-weights the distribution of the observed sample to represent the target
population. Then we can represent the average treatment effect on the target population g by a
weighted average treatment effect (WATE) :
E[h(x)(µ1 (x) − µ0 (x))]
τ h = Eg (Y1i − Y0i ) = . (14.2)
E[h(x)]
In the formulation (14.2), when h(x) = 1, the target population is the same as the population
where the study sample is drawn from, and the WATE reduces to the standard ATE estimand
τ ATE = E(Y1i − Y0i ). ATE has been the focus of a large majority of causal inference literature until
recently. However, the automatic focus on ATE is not always warranted in practice. First, often the
available sample does not represent a natural population of scientific interest. For example, patients
usually have to meet some criteria in order to be included in a clinical trial and thus may not resemble
of the general patient population. Also, in observational comparative effectiveness studies, the study
sample is sometimes a so-called “convenience” sample, e.g., patients drawn from a few selected
hospitals or clusters, who may be substantially different from the general patient population. In such
cases, scientific interpretation of the ATE estimated from the available sample is opaque. Second,
the ATE corresponds to the effect of hypothetically switching every unit in the study population
from one treatment to the other, a scenario rarely conceivable in medical studies because physicians
are well aware that the treatment might be harmful to patients with certain characteristics. Third,
researchers commonly exclude some units from the final analysis, e.g. unmatched units or units with
extremely large weights. The remaining sample may only represent a subpopulation of the original
targeted population and this subpopulation can vary substantially depending on the specific algorithm.
In summary, the target population is often different from the population that is represented by the
available sample. Therefore, it is important to explore alternative target populations and estimands.
where ∝ means proportional to. For a given a g(x) or equivalently tilting function h(x) = g(x)f (x),
to estimate τ h , we can weight fz (x) to equal g(x) using the following weights (proportional up to a
normalizing constant):
(
w1 (x) ∝ ff (x)h(x) h(x)
(x)e(x) = e(x) , for Z = 1,
f (x)h(x) h(x) (14.3)
w0 (x) ∝ f (x)(1−e(x)) = 1−e(x) , for Z = 0.
The class of weights defined in (14.3) are called balancing weights because they balance the weighted
distributions of the covariates between comparison groups, both towards the target population:
Balancing weights include several widely used propensity score weighting schemes as special
cases. Choice of the tilting function h determines the target population, estimand, and weights.
Statistical, scientific, and policy considerations all may come into play in choosing h in a specific
application. For example, as discussed before, when h(x) = 1, the corresponding target population
f (x) is the overall (combining treated and control) population that is represented by the study sample,
the weights (w1 , w0 ) are the IPW {1/e(x), 1/(1 − e(x))} [11], and the estimand is the ATE for the
overall population. When h(x) = e(x), the target population is the treated population, the weights
are the ATT weight {1, e(x)/(1 − e(x))} [12], and the estimand is the average treatment effect for
the treated (ATT), τ ATT = E(Y1i − Y0i | Zi = 1). When h(x) = 1 − e(x), the target population is the
control population, the weights are {(1 − e(x))/e(x), 1}, and the estimand is the average treatment
effect for the control (ATC), τ ATC = E(Y1i − Y0i | Zi = 0). When h is an indicator function, one
can define the ATE on subpopulations with specific baseline characteristics, e.g. a subpopulation
with a certain age or gender. Other examples of balancing weight include the matching weight [13],
corresponding to h(x) = min{e(x), 1 − e(x)}, and the entropy weight [14], corresponding to
h(x) = −(e(x) log(e(x)) + (1 − e(x)) log(1 − e(x))}.
An important special case of balancing weight is propensity score trimming, which focus on
a target population with adequate covariate overlap. In particular, Crump et al. [1] recommended
to use h(x) = 1(α < e(x) < 1 − α), where 1(·) is an indicator function, with a pre-specified
threshold α ∈ (0, 1/2) to limit the analysis to the subpopulation with adequate covariate overlap.
Commonly used trimming threshold is 0.01, 0.05 or 0.1. A drawback of the trimming method is that
the results may be sensitive to the choice of the threshold. The above examples of balancing weights
are summarized in Table 14.1. In the next sections, we will focus on the special case of overlap
weight [5].
Once the target population and estimand is decided, the central task is to estimate the correspond-
ing WATE estimand. Following the convention in the literature, we maintain two standard assumptions
266 Overlap Weighting
throughout this chapter: (A1) unconfoundedness: Pr(Zi | Y1i , Y0i , Xi ) = Pr(Zi | Xi ), implying that
there is no unmeasured confounder; (A2) overlap (or positivity): 0 < Pr(Zi = 1 | Xi ) < 1, implying
that every unit’s probability of being assigned to each treatment condition is bounded away from 0
and 1. Under these two assumptions, a consistent moment estimator for the WATE estimand (14.2)
with any tilting function h is a Hájek estimator:
P P
i w1 (Xi )Zi Yi w0 (Xi )(1 − Zi )Yi
τ̂g = P − Pi . (14.5)
i w1 (Xi )Zi i w0 (Xi )(1 − Zi )
In observational studies, propensity scores are usually unknown and must be first estimated from
the data, e.g. from a logistic regression, and the weights w are calculated from plugging the estimated
propensity scores into formula (14.3). In theory, any type of balancing weights guarantee balance
in the overall multivariable distribution of covariates between treatment groups. But this does not
imply that using balancing weights with estimated propensity scores would balance the distribution
of each individual covariate. In practice, balance of individual covariate before and after weighting is
routinely checked by the metric of absolute standardized difference (ASD) [15]:
PN PN ,r
i=1 w1 (Xi )Zi Xik i=1 w0 (Xi )(1 − Zi )Xik s21 + s20
ASD = PN − PN , (14.6)
i=1 w1 (Xi )Zi i=1 w0 (Xi )(1 − Zi )
2
where k(= 1, ..., p) is a specific covariate, or the target population standardized difference (PSD)
[16], max{PSD0 , PSD1 }, where
PN PN ,r
i=1 w z (X i )1{Zi = z}X ik i=1 h(X i )X ik s21 + s20
PSDz = PN − P N
, (14.7)
i=1 wz (Xi )1{Zi = z} i=1 h(Xi )
2
where s2z is the unweighted variance of the covariate in group z. Setting w0 = w1 = 1 corresponds to
the unweighted mean differences. As we shall show in a real application in Section 14.6.2, imbalance
of individual covariate after propensity score weighting is common.
f(x)h(x)
f0(x) f1(x)
h(x)
FIGURE 14.1
Overlap weights for two normally-distributed groups with different means. In the upper panel, the
left and right solid lines, the thin and thick dashed lines represent the density of the covariate in the
control, treated, combined (h(x) = 1), and overlap weighted (h(x) = e(x)(1 − e(x))) populations,
respectively. In the lower panel, the two solid lines represent w0 (x), w1 (x) and the dashed line
represents h(x) = e(x)(1 − e(x)).
As a technical note, the minimal variance property is proved under (i) true propensity scores, and
(ii) a homoscedasticity assumption. Nonetheless, it has been consistently shown in simulations and
real applications that neither is crucial in practice [6, 20], and OW provides the most efficient causal
estimates under limited overlap comparing to IPW and trimming [21, 22]. Because the analytical
expression of ATO involves the true propensity score, one may question whether OW would be more
sensitive to misspecification of the propensity score model than IPW. Recent work has demonstrated
the opposite [14]. A possible explanation is that OW targets at the subpopulation with the most
overlap, whose propensity scores are close to 0.5. As such, the estimated propensity scores, even if
from a misspecified model, unlikely deviate much from the true scores. This phenomenon exemplifies
the dictum “design trumps analysis” for objective causal inference [23] in the sense that OW attempts
to solve the lack of overlap problem from a design rather than analysis perspective by shifting the
target population.
The second property of OW concerns finite-sample balance of individual covariates.
Property 2 (Exact balance). When the propensity scores are estimated by maximum likelihood
under a logistic regression model, logit e(Xi ) = β0 + Xi β 0 , overlap weights lead to exact balance
in the means of any included covariate between treatment and control groups. That is,
P P
i Xik Zi (1 − ê(Xi )) i Xik (1 − Zi )ê(Xi )
P = P , for k = 1, . . . , p, (14.9)
i Zi (1 − ê(Xi )) i (1 − Zi )ê(Xi )
where êi = {1 + exp[−(β̂0 + xi β̂ 0 )]}−1 and β̂ = (β̂1 , ..., β̂p ) is the MLE for the regression
coefficients.
The exact balance property is a finite-sample result, not relying on asymptotic arguments. And it
holds for any included predictor covariate and their functions – including high order and interaction
terms – in the logistic propensity model. Technically, the exact balance property is unique to the
logistic link and binary treatments. However, even if the propensity score is estimated via other
models such as probit or machine learning models [24], or via multinomial models in the case of
multiple treatments, the resulting OW generally leads to superior balance – in terms of both mean
and multivariate measure such as Mahalanobis distance – compared to other balancing weights. This
is as expected because overlap weight focuses on the population with most covariate overlap between
treatment groups.
A direct corollary of the exact balance property is that if the postulated logistic propensity score
model includes any interaction term with a binary covariate, then the resulting OW leads to exact
balance in the means in the subgroups defined by that binary covariate. Building on this result, Yang
et al. [25, 26] proposed to first use LASSO [27] to select important covariate–subgroup interactions
in the propensity score model and then use OW to estimate the causal effects within pre-specified
subgroups. Such an approach achieves covariate balance (and thus reduces bias) compared to IPW or
matching in estimating subgroup causal effects.
Variance of the Hájek OW estimator (14.5) can be estimated by an empirical sandwich estimator
[6], which also accounts for the uncertainty of estimating the propensity scores [28]. The variance
can also be estimated using bootstrap, where one re-estimates the propensity scores and re-calculates
the causal estimate in each of the bootstrap samples.
An increasingly popular estimator in the weighting literature is the augmented weighting es-
timator, which combines propensity score weighting and outcome model to improve robustness
and efficiency [29]. A general form of the augmented weighting estimator with an arbitrary tilting
function h and its corresponding balancing weight w is:
P P
i w1 (X )Z {Y − µ̂1 (Xi )} w0 (Xi )(1 − Zi ){Yi − µ̂0 (Xi )}
τ̂ h,aug = Pi i i − i P
i w1 (Xi )Zi i w0 (Xi )(1 − Zi )
P
h(X i ){µ̂ (X ) − µ̂ (X )}
+ i P1 i 0 i
, (14.10)
i h(X i )
Extensions of Overlap Weighting 269
where µ̂z (Xi ) = E(Yb i |Xi , Zi = z) is the predicted outcome of unit i from an outcome model.
With IPW, the augmented weighting estimator (14.10) is called the doubly–robust estimator [28, 29]
because it is consistent if either the propensity score model or the outcome model, but not necessarily
both, is correctly specified. When both models are correctly specified, the augmented IPW estimator
τ̂ h,aug is more efficient than the Hájek estimator τ̂ h . With OW, because the definition of the ATO
estimand involves the true propensity score, τ̂ h,aug is not consistent if the propensity model is
misspecified. Nonetheless, τ̂ h,aug is singly–robust in the sense that it is consistent for ATO if the
propensity model is correctly specified, regardless of the outcome model specification. Interestingly,
simulations and empirical applications consistently suggest that with OW, the Hájek estimator
and the augmented estimator are very similar in terms of bias and variance. This is not surprising
because it is well known that most methods lead to similar treatment effect estimates in randomized
experiments. OW mimics the design of a randomized experiment; therefore, similar to a randomized
study, outcome augmentation adds little value to the OW estimate of causal effect in an observational
study.
E{h(x)µj (x)}
µhj = Eg (Yj,i ) = . (14.11)
E{h(x)}
Causal estimands can then be constructed in a general manner as contrasts based on µhj . For example,
the most commonly seen estimands in multiple treatments are the pairwise average treatment effects
between groups j and j 0 : τj,j
h h h
0 = µj − µj 0 . This definition can be generalized to arbitrary linear
It is straightforward to generalize the balancing weights framework in Section 14.2 to the case of
multiple treatments. Let fj (x) = f (x | Zi = j) be the density of the covariates in the observed jth
group, we have fj (X) ∝ f (x)ej (x). For any pre-specified tilting function h(x) and equivalently the
target population g(x) = f (x)h(x), the balancing weight for treatment j (j = 1, ..., J) is:
f (x)h(x) h(x)
wj (x) ∝ = . (14.12)
f (x)ej (x) ej (x)
The above weights balance the weighted distributions of the covariates across the J comparison
groups, fj (x)wj (x) = f (x)h(x), for all j.
Similar as the case of binary treatments, to estimate µhj , we need to maintain two standard
assumptions: (A3) weak unconfoundedness: Yj,i ⊥ 1{Zi = j} | Xi , for all j, and (A4) overlap: the
generalized propensity score is bounded away from 0 and 1: 0 < ej (x) < 1, for all j and x. Under
these assumptions, a consistent estimator for µhj with any tilting function h(x) is a Hájek estimator:
PN
wj (Xi )Dij Yi
µ̂hj = Pi=1
N
. (14.13)
i=1 wj (Xi )Dij
The corresponding target estimand of each weighting scheme is its pairwise – between each pair of
treatments – counterpart in binary treatments.
Under homoscedasticity, the function h(x) = { j 1/ej (x)}−1 , i.e., the harmonic mean of the
P
generalized propensity scores of all groups, minimizes the sum of the asymptotic variance of the
Hájek estimator µ̂hj of all groups among all choices of h [16]. Consequently, one can define the
corresponding OW for treatment group j as:
1/ej (x)
wj (x) ∝ PJ , (14.14)
k=1 1/ek (x)
h
e ≈ (0, 1/2, 1/2) 10.0
● 7.5
5.0
2.5
● 0.0
e ≈ (1/3, 1/3, 1/3)
FIGURE 14.2
Ternary plot of the optimal tilting h – the harmonic mean – as a function of the generalized propensity
score vector with J = 3 treatments. Each point in the triangular plane represents a unit with certain
values of the generalized propensity scores. The value of each generalized propensity score is
proportional to the orthogonal distance from that point to each edge. Overlap weighting scheme
emphasizes the centroid region with good overlap, e.g., units with e(x) ≈ (1/3, 1/3, 1/3), and
smoothly down-weighs the edges, e.g., units with e(x) ≈ (0, 1/2, 1/2).
potential survival time Ti (z) and a potential censoring time Ci (z) under treatment z. let Ti and Ci
denote the observed survival time and censoring time for each unit, respectively. The survival time
might be right censored when Ci ≤ Ti so we observe Vi = min(Ti , Ci ) and the indicator whether
the subject failed within the study period δi = 1Ti ≤Ci . There are several causal estimands with
survival outcomes [37]. Below we focus on a class of WATE defined by the counterfactual survival
functions, denoted by S z (t | X) = Pr(T (z) ≥ t | X) for z = 0, 1. As before, we use a tilting
function h(x) to represent a target population g(x) = h(x)f (x). The survival probability causal
effect (SPCE), or causal risk difference, on target population g is
E[h(x){S 1 (t | x) − S 0 (t | x)}]
τhSP CE (t) = Eg {S 1 (t) − S 0 (t)} = . (14.15)
E{h(x)}
In order to identify the SPCE , besides the standard unconfoundedness and overlap assumptions,
we additionally need an assumption on the censoring mechanism: (A5) covariate dependent censoring.
Ti (z) ⊥ Ci (z) | {Xi , Zi }, which states that the censoring time is independent of survival time given
the covariates in each group. The key to drawing causal inference with survival outcomes is to
address the selection bias associated with the right censoring of survival outcomes. The censoring
process can be represented by the survival distribution of the potential censoring time, denoted by
Kcz (t, X) = Pr(C ≥ t | X, Z = z) for z = 0, 1. The censoring score Kcz (t, X) is usually unknown
and must be estimated from the observed data, e.g., from a Cox model or a parametric Weibull model.
Cheng et al. [20] proposed an estimator for SPCE that combines balancing weights and inverse
probability of censoring weights:
!
P
1(Vi ≤ t)δi /Kb c1 (Vi , Xi )
SP CE i w(Xi )ZiP
τ̂h (t) = 1−
i w(Xi )Zi
!
P
− Zi )1(Vi ≤ t)δi /K b c0 (Vi , Xi )
i w(Xi )(1P
− 1− (14.16)
i w(Xi )(1 − Zi )
b cz (Vi , Xi ) is
where w(Xi ) is the balancing weight corresponding to the tilting function h and δi /K
272 Overlap Weighting
the estimated inverse probability of censoring weights applied only to the non-censored observa-
tions throughout the study follow-up. When the censoring process does not depend on covariates,
estimator (14.16) reduces to a propensity score weighted Kaplan-Meier estimator [32, 38]. Cheng et
al. [20] proved that under Assumptions A1, A2 and A5, estimator (14.16) is point-wise consistent
of τhSP CE (t) for any time 0 ≤ t ≤ tmax with any titling function h. Moreover, under certain ho-
moscedasticity conditions, OW achieves the smallest asymptotic point-wise variance for estimating
τhSP CE (t) among the class of balancing weight. Extensive simulations in [20] show that, similar to
the case of non-censored outcomes, OW consistently outperforms IPW and trimming methods in
terms of bias, variance, and coverage in estimating the SPCE, and the advantage increases as the
degree of covariate overlap between groups decreases.
An alternative approach for propensity score weighting with survival outcome is through the
jackknife pseudo-observations [39, 40], which can handle several causal estimands, including SPCE
and the restricted average causal effect, in a unified fashion. Zeng et al. [8] proved the minimum
variance property of OW also holds with the pseudo-observation-based propensity score weighting
estimator.
of the true propensity score, its corresponding balancing weight is semiparametric efficient and
asymptotically equivalent to the ANCOVA estimator [44]. This includes OW and IPW as special
cases. Moreover, owing to its exact balance property, OW has a unique advantage of completely
removing chance imbalance when the propensity score is estimated by a logistic regression, which
improves the face validity of a randomized experiment as well as efficiency. Through extensive
simulations, Zeng et al. [43] demonstrated that OW consistently outperforms IPW in finite samples
and improves the efficiency over ANCOVA when the outcome model is incorrectly specified.
1. Estimate the propensity scores from a model, e.g., logistic model or machine learning methods.
Using that model, calculate the weight of each unit according to a selected weighting scheme.
2. Check the overall overlap and balance of the covariates between treatment groups, which is
usually visualized by overlaid histograms of the estimated propensity scores of each group.
Display ASDs or PSDs before and after weighting in a baseline characteristics table (often known
as “Table 1” in medical research articles; an example is given in Table 14.3) and visualized via a
Love plot (an example is given in Figure 14.5). A rule of thumb for determining adequate balance
is when ASD or PSD of all covariates is controlled
PN within 0.1
PN[15]. In the same table, also present
the weighted average of each covariate: i=1 h(Xi )Xi / i=1 h(Xi ), which characterizes the
corresponding target population.
3. Estimate the treatment effect by the Hájek estimator (14.5), or the augmented estimator (14.10),
in which case one needs to specify an additional outcome model.
both outcome and treatment assignment, then such balancing does not improve causal inference. In a
sense, OW already liberates analysts from the “how to balance” problem owing to the exact balance
property, and thus allows analysts to focus on finding “what to balance.” In practice, we recommend
to first identify the set of confounders from substantive knowledge, e.g. based on discussion with
domain scientists and then specify a propensity score model that includes all these confounders.
OW has been implemented in several R packages, including PSW [47] and WeightIt [48]. In
particular, the most comprehensive package for propensity score weight is the R package PSweight
developed by Zhou et al. [49], which supports (i) several balancing weights, including OW, IPW,
matching weights, entropy weights, trimming, (ii) binary and multiple treatments, (iii) Hájek and
augmented estimators, (iv) ratio estimands for binary and count outcomes, (v) time-to-event outcomes.
PSweight also provides diagnostic tables and graphs for visualizing target population and covariate
balance.
14.6 Illustration
This section provides two illustrative examples of propensity score weighting analysis and compares
several balancing weights.
PS OW IPW ATT
Z=0
Z=1
Density
Density
Density
Density
0.0 0.2 0.4 0.6 0.8 1.0 −2 0 2 4 −2 0 2 4 −2 0 2 4
e(X) X X X
FIGURE 14.3
Distribution of propensity scores, and weighted covariate distributions by OW, IPW, and ATT weights,
respectively, within each treatment group, of the simulated example in Section 14.6.1.
TABLE 14.2
Absolute standardized difference (ASD), estimated treatment effect, and corresponding standard
error from OW, IPW and ATT weights in the simulated example in Section 14.6.1. The true treatment
effect is τ = 1.
correctly specifying the outcome model. IPW and ATT weights both yield a relative bias of about
20%, and a standard error over 20 and 5 times of that of OW, respectively. This simple example
illustrates that, although in theory IPW and ATT weights lead to covariate balance and easy-to-
interpret target populations, in practice both methods can lead to substantial imbalance and inflated
bias and variance when there is limited covariate overlap between treatments. In contrast, OW
provides optimal covariate balance and unbiased point estimate with low variance regardless of the
degree of overlap or outcome generating process. Though our simulations focus on a simple case of
homogeneous treatment effects, these patterns are consistently observed in several comprehensive
simulation studies with heterogeneous treatment effects [6, 8, 14, 20, 50].
200
150
group
0
1
100
50
0
FIGURE 14.4
Histogram of the estimated propensity scores in the NCDS study.
the employment status of mother maemp was also collected; paed u and maed u are the years of
education of the parents.
We first estimate the propensity scores via a logistic regression model with main effects of
all covariates and several interactions suggested by [51]. Figure 14.4 presents the histogram of
the estimated propensity scores. The histogram suggests that though overall distributions of the
covariates are imbalanced between the two groups, the range of the propensity scores between
the groups is well overlapped. Figure 14.5 presents the Love plot of the PSD metric of each
covariate before and after three weighting schemes: OW, IPW, ATT. Both graphs are produced
by the plot.SumStat() function in the R PSweight package. Clearly, the unweighted mean
differences are substantially larger than the commonly used balance threshold 0.1, while propensity
score weighting in general improves covariate balance. Among the three weighting schemes, OW
and IPW have controlled the maximum PSD for each covariate to be below 0.1; as guaranteed by
theory, OW provides exact balance, with the maximum PSD for each covariate being zero. Table
14.3 presents the unadjusted mean by treatment status and weighted overall mean by each weighting
scheme of a subset of covariates. We can see different weighting methods result in marked difference
in several covariates. We recommend to provide such summary tables in real applications because
they concretely characterize the target population corresponding to any chosen weighting method.
Based on the estimated propensity scores, we can estimate the treatment effect on different target
populations using their corresponding weighting scheme by the Hájek estimator. For example, using
OW, we estimated the ATO to be 0.179 with standard error of 0.016; using IPW, we estimated the
Illustration 277
white
sib_u
paed_u:agepa
paed_u
maemp
maed_u:agema
maed_u
as.factor(scht)5
as.factor(scht)4
as.factor(scht)3
as.factor(scht)2
as.factor(qvab2)5
as.factor(qvab2)4 wt.type
as.factor(qvab2)3 IPW
as.factor(qvab2)2 overlap
as.factor(qvab)5 treated
as.factor(qvab)4 unweighted
as.factor(qvab)3
as.factor(qvab)2
as.factor(qmab2)5
as.factor(qmab2)4
as.factor(qmab2)3
as.factor(qmab2)2
as.factor(qmab)5
as.factor(qmab)4
as.factor(qmab)3
as.factor(qmab)2
agepa
agema
0.0 0.1 1.0
Standardized Mean Differences
FIGURE 14.5
Love plot of the maximum PSD metric of each covariate in the NCDS study.
TABLE 14.3
Unadjusted mean by group and weighted overall mean (by IPW, ATT, and OW, respectively) of
selected covariates in the NCDS study. Note: School type, math, and reading scores are categorical
variables, each with multiple categories. Here we only present the summary of one category for each
variable to illustrate.
ATE for the overall population to be 0.193 with standard error of 0.021. This suggests that obtaining
any academic attainment would statistically significantly increase the hourly wage. In this study,
the point estimates from different weighting schemes are similar, though OW leads to the smallest
variance. This is as expected given that the covariates between the two groups are well overlapped.
When there is severe lack of overlap or/and heterogeneous treatment effects, different weighting
278 Overlap Weighting
schemes are expected to lead to much more variable treatment effect estimates; for such an example,
see [6]. More details, including R code, of the NCDS example can be found in [49].
14.7 Discussion
In this chapter we have discussed the general class of propensity-score-based balancing weights
and highlighted a special member of the class, overlap weight (OW). OW shifts the traditional
goalpost from finding an “optimal” estimate of ATE on the overall population to finding an “optimal”
subpopulation on which the ATE can be estimated with the most precision and internal validity.
The target population of OW is the subpopulation with substantial chance of being assigned to
either treatment conditions, namely, the population in “clinical equipoise” in medical research. OW
possesses scientifically meaningful interpretation and desirable theoretical properties, including
minimum variance and exact mean balance, and has been repeatedly shown in simulations and
real applications to outperform alternative weighting methods. For these reasons, OW has been
increasingly popular in real world applications, including several high profile COVID-19 studies,
e.g., [7, 53].
The development of OW and more broadly balancing weights is motivated by a key yet often
overlooked question in causal inference: “what population is the ATE defined on?” In practice,
we apply a statistical method to a sample, but usually we want to interpret the results on a target
population where the sample is drawn from. If the study sample is representative of a scientifically
meaningful population, then ATE can be interpreted as the treatment effect on that population. But in
many real world situations, the sample does not represent a natural population; consequently, the
ATE estimated from such a sample has an opaque interpretation.
A central message from the balancing weight framework is that, instead of fixating on a specific
weighting method, analysts should first specify a target population and then estimate the treatment
effect on that population from the observed sample using the corresponding weighting scheme.
The singular focus on ATE is neither justified nor necessary. To better understand this point, it is
worthwhile to compare surveys where IPW originates from and causal inference. The starting point of
survey sampling is design, comprised of an explicitly specified target population and corresponding
survey weights or sampling probabilities. A well-designed survey rarely has extreme weights. In
contrast, the starting point of causal inference with observational studies is data, and one uses the
observed data to attempt to re-construct the design. Because observational data are not sampled
based on a pre-specified design, lack of overlap and consequently extreme weights are common.
Automatic focus on ATE (and equivalently IPW) implicitly assumes that the sample is representative
of a well-defined target population, which is often not the case in practice, and thus often obscures
the design element in observational studies. Furthermore, in comparative effectiveness studies, there
is a trade-off between interval validity and external validity [54], which measures how generalizable
the results are to a different population. OW achieves optimal internal validity at the cost of external
validity. In contrast, other weighting scheme such as IPW may be more suitable if external validity is
the priority.
OW can be extended in several directions. First, continuous treatments [55] are common in
practice, and the goal is usually to estimate a dose–response relationship. Ensuring adequate overlap
is challenging in continuous treatments because of the small sample size in each level of treatment.
Operationally, we can view a continuous treatment as the limit of multiple ordered treatments, and
thus it is straightforward to extend the OW method in Section 14.4.1 to continuous treatments.
However, conceptually the meaning of overlap is ambiguous with continuous treatments and requires
more formal investigation. Second, OW has so far been developed mostly in cross-sectional settings.
Many observational studies involve longitudinal treatments, where lack of overlap becomes more
Discussion 279
severe as time evolves. We could frame longitudinal treatments as multiple treatments where each
treatment path is a treatment level. However, to calculate OW, one need to know all the time-
varying covariates, including the counterfactual intermediate outcomes under each treatment path,
which are not observed for each unit. One possibility is to specify an outcome model to impute the
counterfactual intermediate outcomes, but this would counter the simplicity of OW. Wallace and
Moodie [19] independently discovered OW in the context of identifying dynamic treatment regimes,
but that method does not apply to estimate marginal effects of longitudinal treatments. Overall,
extension of OW to longitudinal treatments remains to be a desirable but challenging open topic.
Third, a recent trend in biopharmaceutical development is to augment clinical trials with real-world
evidence. In particular, in clinical trials of rare or severe diseases, there has been an increasing need to
construct external or synthetic controls using historical data [56]. Propensity score matching is often
used to match historical and concurrent data, but sometimes results in discarding a large number of
unmatched units, which is undesirable when the concurrent trial has a small sample size. OW can
bypass such a problem and construct external controls that are the most similar to the concurrent trial.
Furthermore, one may combine OW with the Bayesian dynamic pooling method [57] to adjust for
both measured and unmeasured confounding between historical and concurrent data.
References
[1] Richard Crump, Joseph Hotz, Guido Imbens, and Oscar Mitnik. Dealing with limited overlap
in estimation of average treatment effects. Biometrika, 96(1):187–199, 2009.
[2] Kosuke Imai and Marc Ratkovic. Covariate balancing propensity score. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 76(1):243–263, 2014.
[3] José R. Zubizarreta. Stable weights that balance covariates for estimation with incomplete
outcome data. Journal of the American Statistical Association, 110(511):910–922, 2015.
[4] Qingyuan Zhao. Covariate balancing propensity score by tailored loss functions. The Annals of
Statistics, 47(2):965–993, 2019.
[5] Fan Li, Kari Lock Morgan, and Alan M. Zaslavsky. Balancing Covariates via Propensity Score
Weighting. Journal of the American Statistical Association, 113(521):390–400, 2018.
[6] Fan Li, Laine E Thomas, and Fan Li. Addressing extreme propensity scores via the overlap
weights. American journal of Epidemiology, 188(1):250–257, 2019.
[7] Laine Thomas, Fan Li, and Michael Pencina. Overlap weighting: A propensity score method that
mimics attributes of a randomized clinical trial. Journal of the American Medical Association,
323(23):2417–2418, 2020.
[8] Shuxi Zeng, Fan Li, Liangyuan Hu, and Fan Li. Propensity score weighting analysis for survival
outcomes using pseudo observations. Statistica Sinica, 2021. doi:10.5705/ss.202021.0175
[9] Donald B Rubin. Randomization analysis of experimental data: The Fisher randomization test
comment. Journal of the American Statistical Association, 75(371):591–593, 1980.
[10] Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in observa-
tional studies for causal effects. Biometrika, 70(1):41–55, 1983.
[11] Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement
from a finite universe. Journal of the American Statistical Association, 47(260):663–685, 1952.
280 Overlap Weighting
[12] Keisuke Hirano and Guido W Imbens. Estimation of causal effects using propensity score
weighting: An application to data on right heart catheterization. Health Services and Outcomes
Research Methodology, 2:259–278, 2001.
[13] Liang Li and Tom Greene. A weighting analogue to pair matching in propensity score analysis.
International Journal of Biostatistics, 9(2):1–20, 2013.
[14] Yunji Zhou, Roland A Matsouaka, and Laine Thomas. Propensity core weighting under limited
overlap and model misspecification. Statistical Methods in Medical Research, 29(12):3721–
3756, 2020.
[15] Peter C Austin and Elizabeth A Stuart. Moving towards best practice when using inverse
probability of treatment weighting (iptw) using the propensity score to estimate causal treatment
effects in observational studies. Statistics in Medicine, 34(28):3661–3679, 2015.
[16] Fan Li and Fan Li. Propensity score weighting for causal inference with multiple treatments.
The Annals of Applied Statistics, 13(4):2389–2415, 2019.
[17] Eric C Schneider, Paul D Cleary, Alan M Zaslavsky, and Arnold M Epstein. Racial disparity
in influenza vaccination: Does managed care narrow the gap between African Americans and
whites? JAMA, 286(12):1455–1460, 2001.
[18] Richard Crump, Joseph Hotz, Guido Imbens, and Oscar Mitnik. Moving the goalposts: Ad-
dressing limited overlap in the estimation of average treatment effects by changing the estimand.
Technical Report 330, National Bureau of Economic Research, Cambridge, MA, September
2006.
[19] Michael P Wallace and Erica EM Moodie. Doubly-robust dynamic treatment regimen estimation
via weighted least squares. Biometrics, 71(3):636–644, 2015.
[20] Cao Cheng, Fan Li, Laine E Thomas, and Fan Li. Addressing extreme propensity scores in
estimating counterfactual survival functions via the overlap weights. American Journal of
Epidemiology, 191(6), 1140–1151, 2022.
[21] Huzhang Mao, Liang Li, and Tom Greene. Propensity score weighting analysis and treatment
effect discovery. Statistical Methods in Medical Research, 28(8):2439–2454, 2019.
[22] Kazuki Yoshida, Sonia Hernández-Dı́az, Daniel H. Solomon, John W. Jackson, Joshua J. Gagne,
Robert J. Glynn, and Jessica M. Franklin. Matching weights to simultaneously compare three
treatment groups comparison to three-way matching. Epidemiology, 28(3):387–395, 2017.
[23] Donald B Rubin. For objective causal inference, design trumps analysis. The Annals of Applied
Statistics, 2(3):808–840, 2008.
[24] Brian K Lee, Justin Lessler, and Elizabeth A Stuart. Improving propensity score weighting
using machine learning. Statistics in Medicine, 29(3):337–346, 2010.
[25] Siyun Yang, Elizabeth Lorenzi, Georgia Papadogeorgou, Daniel M Wojdyla, Fan Li, and
Laine E Thomas. Propensity score weighting for causal subgroup analysis. Statistics in
Medicine, 40:4294–4309, 2021.
[26] Siyun Yang, Fan Li, Laine E Thomas, and Fan Li. Covariate adjustment in subgroup analyses
of randomized clinical trials: A propensity score approach. Clinical Trials, 2021 Jul 16, 2021.
[27] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
Discussion 281
[28] Jared K Lunceford and Marie Davidian. Stratification and weighting via the propensity score in
estimation of causal treatment effects: a comparative study. Statistics in Medicine, 23(19):2937–
2960, 2004.
[29] Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal
inference models. Biometrics, 61(4):962–973, 2005.
[30] Laine Thomas, Fan Li, and Michael Pencina. Using propensity score methods to create target
populations in observational clinical research. Journal of the American Medical Association,
323(5):466–467, 2020.
[31] Guido W Imbens. The role of the propensity score in estimating dose-response functions.
Biometrika, 87(3):706–710, 2000.
[32] James M Robins and Dianne M Finkelstein. Correcting for noncompliance and dependent
censoring in an aids clinical trial with inverse probability of censoring weighted (ipcw) log-rank
tests. Biometrics, 56(3):779–788, 2000.
[33] Alan E Hubbard, Mark J Van Der Laan, and James M Robins. Nonparametric locally efficient
estimation of the treatment specific survival distribution with right censored data and covariates
in observational studies. In Statistical Models in Epidemiology, the Environment, and Clinical
Trials, pages 135–177. Springer, 2000.
[34] David R Cox. Regression models and life-tables. Journal of the Royal Statistical Society:
Series B (Methodological), 34(2):187–202, 1972.
[35] Peter C Austin and Elizabeth A Stuart. The performance of inverse probability of treatment
weighting and full matching on the propensity score in the presence of model misspecification
when estimating the effect of treatment on survival outcomes. Statistical Methods in Medical
Research, 26(4):1654–1670, 2017.
[36] Stephen R Cole and Miguel A Hernán. Adjusted survival curves with inverse probability
weights. Computer Methods and Programs in Biomedicine, 75(1):45–49, 2004.
[37] Huzhang Mao, Liang Li, Wei Yang, and Yu Shen. On the propensity score weighting anal-
ysis with survival outcome: Estimands, estimation, and inference. Statistics in Medicine,
37(26):3745–3763, 2018.
[38] Glen A Satten and Somnath Datta. The Kaplan–Meier estimator as an inverse-probability-of-
censoring weighted average. The American Statistician, 55(3):207–210, 2001.
[39] Per K Andersen and Maja Pohar P. Pseudo-observations in survival analysis. Statistical Methods
in Medical Research, 19(1):71–99, 2010.
[40] Per K Andersen, Elisavet Syriopoulou, and Erik T Parner. Causal inference in survival analysis
using pseudo-observations. Statistics in Medicine, 36(17):2669–2681, 2017.
[41] Elizabeth J Williamson, Andrew Forbes, and Ian R White. Variance reduction in randomised
trials by inverse probability weighting using the propensity score. Statistics in Medicine,
33(5):721–737, 2014.
[42] Changyu Shen, Xiaochun Li, and Lingling Li. Inverse probability weighting for covariate
adjustment in randomized studies. Statistics in Medicine, 33(4):555–568, 2014.
[43] Shuxi Zeng, Fan Li, Rui Wang, and Fan Li. Propensity score weighting for covariate adjustment
in randomized clinical trials. Statistics in Medicine, 40(4):842–858, 2021.
282 Overlap Weighting
[44] Anastasios A Tsiatis, Marie Davidian, Min Zhang, and Xiaomin Lu. Covariate adjustment
for two-sample treatment comparisons in randomized clinical trials: A principled yet flexible
approach. Statistics in Medicine, 27(23):4658–4677, 2008.
[45] Winston Lin. Agnostic notes on regression adjustments to experimental data: Reexamining
Freedman’s critique. The Annals of Applied Statistics, 7(1):295–318, 2013.
[46] Guido W Imbens and Donald B Rubin. Causal Inference in Statistics, Social, and Biomedical
Sciences. Cambridge University Press, 2015.
[47] Huzhang Mao and Liang Li. PSW: Propensity Score Weighting Methods for Dichotomous
Treatments, 2018. R package version 1.1-3.
[48] Noah Greifer. WeightIt: Weighting for Covariate Balance in Observational Studies, 2020. R
package version 0.10.2.
[49] Tianhui Zhou, Guangyu Tong, Fan Li, Laine Thomas, and Fan Li. PSweight: Propensity Score
Weighting for Causal Inference, 2020. R package version 1.1.8.
[50] Serge Assaad, Shuxi Zeng, Chenyang Tao, Shounak Datta, Nikhil Mehta, Ricardo Henao, Fan
Li, and Lawrence Carin Duke. Counterfactual representation learning with balancing weights.
In International Conference on Artificial Intelligence and Statistics, pages 1972–1980. PMLR,
2021.
[51] Erich Battistin and Barbara Sianesi. Misclassified treatment status and treatment effects: an
application to returns to education in the united kingdom. Review of Economics and Statistics,
93(2):495–509, 2011.
[52] Stef Van Buuren and Karin Groothuis-Oudshoorn. mice: Multivariate imputation by chained
equations in R. Journal of Statistical Software, 45(1):1–67, 2011.
[53] Neil Mehta, Ankur Kalra, Amy S Nowacki, Scott Anjewierden, Zheyi Han, Pavan Bhat,
Andres E Carmona-Rubio, Miriam Jacob, Gary W Procop, Susan Harrington, et al. Association
of use of angiotensin-converting enzyme inhibitors and angiotensin ii receptor blockers with
testing positive for coronavirus disease 2019 (COVID-19). JAMA Cardiology, 5(9):1020–1026,
2020.
[54] Elizabeth A Stuart, Stephen R Cole, Catherine P Bradshaw, and Philip J Leaf. The use of
propensity scores to assess the generalizability of results from randomized trials. Journal of the
Royal Statistical Society: Series A (Statistics in Society), 174(2):369–386, 2011.
[55] Keisuke Hirano and Guido W Imbens. The propensity score with continuous treatments. Applied
Bayesian Modeling and Causal Inference from Incomplete-data Perspectives, 226164:73–84,
2004.
[56] Jessica Lim, Rosalind Walley, Jiacheng Yuan, Jeen Liu, Abhishek Dabral, Nicky Best, Andrew
Grieve, Lisa Hampson, Josephine Wolfram, Phil Woodward, et al. Minimizing patient burden
through the use of historical subject-level data in innovative confirmatory clinical trials: review
of methods and opportunities. Therapeutic Innovation & regulatory Science, 52(5):546–559,
2018.
[57] Chenguang Wang, Heng Li, Wei-Chen Chen, Nelson Lu, Ram Tiwari, Yunling Xu, and Lilly Q
Yue. Propensity score-integrated power prior approach for incorporating real-world evidence in
single-arm clinical studies. Journal of Biopharmaceutical Statistics, 29(5):731–748, 2019.
15
Covariate Balancing Propensity Score
CONTENTS
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
15.2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
15.2.1 A continuous treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
15.2.2 A dynamic treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
15.3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
15.4 High-Dimensional CBPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
15.5 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
15.6 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
15.1 Introduction
Covariate balancing propensity score (CBPS) was originally proposed by [1] as a general method-
ology to improve the estimation of propensity score for causal inference. Propensity score, which
is defined as the conditional probability of treatment assignment given observed covariates [52],
plays an essential role as part of covariate adjustment methods including matching, weighting, and
regression modeling. If the propensity score is correctly estimated, these covariate adjustment meth-
ods make the distribution of covariates equal between the treatment and control group, allowing
researchers to draw valid causal inference under the standard unconfoundedness assumption. Yet,
in practice, propensity score is unknown and must be estimated from data. This can lead to the
misspecification of propensity score model and ultimately bias in causal effect estimation.
The basic idea of CBPS is simple. It estimates the propensity score such that covariates are
balanced between the treatment and control groups after being inversely weighted by the estimated
propensity score. In applied research, analysts often check covariate balance after making matching
or weighting adjustment based on the estimated propensity score in an attempt to diagnose the
potential misspecification of propensity score model. CBPS exploits this idea but directly optimizes
covariate balance when estimating the propensity score. This allows researchers to avoid repeating
the process of estimating the propensity score and checking the resulting covariate balance.
Suppose that we have a random sample of n observations from a population P. Let Ti represent
the binary (for now) treatment variable, which is equal to 1 if unit i receives the treatment and is equal
to 0 if the unit belongs to the control group. We use Xi to denote the set of observed pre-treatment
covariates. Then, the propensity score is defined as,
π(x) = Pr(Ti = 1 | Xi = x), (15.1)
for x ∈ X where X is the support of covariates Xi . Following [52], we assume that the propensity
Now, consider the maximum likelihood estimation of the propensity score, based on a parametric
model πβ (x) where β is a finite-dimensional parameter. A commonly used model includes a logistic
regression. We can show that this maximum likelihood estimation implicitly balances the first
derivative of the propensity score with respect to the model parameters. Formally, the score condition
implies,
n
1 X Ti πβ0 (Xi ) (1 − Ti )πβ0 (Xi )
− = 0 (15.3)
n i=1 πβ (Xi ) 1 − πβ (Xi )
where πβ0 (Xi ) = ∂πβ (Xi )/∂β. Equation (15.3) represents the sample analogue of the population
constraint that balances, by weighting each observation according to the inverse of its propensity
score, the covariates between the treatment and control groups through the derivative πβ0 (Xi ).
If the propensity score is correctly estimated, however, we should be able to balance any function
of observed covariates f (·). This observation leads to the following general definition of covariate
balancing condition,
Ti 1 − Ti
E − f (Xi ) = 0, (15.4)
πβ (Xi ) 1 − πβ (Xi )
where a typical choice of f (·) is a lower order polynomial function. [1] suggests the use of the
generalized method of moments estimator [3] to incorporate this additional covariate balancing
condition as well as the original score condition.
There are several alternative formulations of covariate balancing conditions. First, Equation (15.4)
weights the treatment and control groups such that their covariate distributions match with that of the
target population. This means that we can have two separate covariate balancing conditions,
Ti
E − 1 f (Xi ) = 0, (15.5)
πβ (Xi )
1 − Ti
E 1− f (Xi ) = 0. (15.6)
1 − πβ (Xi )
In addition we may be interested in adjusting the covariate distribution of the control group so that it
matches with that of the treatment group. Such an adjustment is useful when estimating the average
treatment effect for the treated (ATT). In this case the covariate balancing condition becomes,
(1 − Ti )π(Xi )
E Ti − f (Xi ) = 0. (15.7)
1 − πβ (Xi )
15.2 Extensions
15.2.1 A continuous treatment
It is possible to extend the idea of covariate balancing conditions to other settings. [4] considers
a continuous treatment. They first standardize both treatment and covariates using the following
transformation,
−1/2 fi = S −1/2 (X − X)
Tei = sT (T − T ), X X (15.8)
Extensions 285
where T and sT are the sample mean and variance of the treatment, respectively, and X and SX are
the sample mean and variance of the covariates, respectively. Then, the covariate balancing condition
is given by, !
p(Tei ) e f
E TX = 0 (15.9)
fi ) i i
p(Tei | X
where p(Tei | X
fi ) is the generalized propensity score [5,6] and p(Tei ) is the normalizing weight used in
the marginal structural models Fong et al [4,7] develops both parametric and nonparametric estimation
approaches (generalized method of moments and empirical likelihood estimation, respectively) and
term the resulting estimator CBGPS (covariate balancing generalized propensity score). We refer
readers to the original article for the details of these estimation strategies.
E{Xi2 (t1 )} = E[1{Ti1 = t1 , Ti2 = t2 }wi (t̄2 , X i2 (t1 ))Xi2 (t1 )] (15.10)
for (t1 , t2 ) ∈ T where Xi2 (t1 ) represents the potential values of the second period covariates
when the first time period treatment assignment is t1 , t̄2 = (t1 , t2 ) is the treatment history, and
X i2 (t1 ) = (Xi1 , Xi2 (t1 )) is the covariate history. The marginal structural model weight is given as
the product of propensity scores in the denominator with the normalizing factor in the numerator,
Pr(Ti1 = t1 ) Pr(Ti2 = t2 | Ti1 = t1 )
wi (t̄2 , X i2 (t1 )) = × . (15.11)
Pr(Ti1 = t1 | Xi1 ) Pr(Ti2 = t2 | Ti1 = t1 , Xi1 , Xi2 (t1 ))
[8] shows that these covariate balancing conditions can be written compactly as the following
orthogonal moment conditions using the observed data notation,
where wi = wi (T i2 , X i2 ) is the observed weight with T i2 = (Ti1 , Ti2 ) and X i2 = (Xi1 , Xi2 ).
For the baseline covariates, we simply balance them across all four treatment combinations, i.e.,
15.3 Theory
In the above formulation, the choice of covariate balancing conditions in Equation (15.4), i.e., f (·),
is left unspecified. Although in practice researchers often choose simple functions such as f (x) = x
and f (x) = x2 , it is important to study the optimal choice. Here, we briefly review the theoretical
results of [9]. We first derive the optimal choice of the covariate balancing conditions under a locally
misspecified propensity score model. From this result, we then construct an optimal CBPS estimator
(oCBPS) for the average treatment effect (ATE) and establish the double robustness property of this
estimator.
We begin by introducing the following notation,
where K(·) and L(·) represent the conditional mean of the potential outcome under the control
condition and the conditional average treatment effect (CATE), respectively. Then, the ATE can be
written as,
µ = E(Yi (1) − Yi (0)) = E(L(Xi )). (15.16)
Let β̂ denote the estimator of β by solving the covariate balancing condition as an estimating equation
(15.4). The Horvitz-Thompson estimator for the ATE is given by,
n
!
1X Ti Yi (1 − Ti )Yi
µ̂β̂ = − . (15.17)
n i=1 πβ̂ (Xi ) 1 − πβ̂ (Xi )
To study how the bias and variance of µ̂β̂ depend on the function f (·), we focus on the local
misspecification of propensity score model using the general framework of [10]. Specifically, we
assume that the true propensity score π(Xi ) is related to the working model πβ (Xi ) through the
exponential tilt for some β ∗ ,
where B(f ) and σ 2 (f ) are the asymptotic bias and variance of µ̂β̂ which depends on the choice of
the function f (·). The explicit form of B(f ) and σ 2 (f ) is given in [9].
Since the ultimate goal of the CBPS methodology is to estimate the treatment effects (i.e.,
ATE in this section), we define the optimal choice of the function f (·) as the one that minimizes
the asymptotic mean squared error (AMSE) of µ̂β̂ , i.e., B 2 (f ) + σ 2 (f ) using the result given in
Equation (15.19). A key theoretical result of [9] is that such optimal choice of f (·) exists and is
characterized by the following condition: f (·) is optimal if there exists a vector α such that
α> f (Xi ) = πβ∗ (Xi )E(Yi (0) | Xi ) + (1 − πβ∗ (Xi ))E(Yi (1) | Xi ). (15.20)
This result implies that the optimal choice of f (·) is not unique. Recall that f = (f1 , ..., fm )
where m is the number of functions used to balance covariates. If we choose f1 (Xi ) =
πβ∗ (Xi )E(Yi (0) | Xi ) + (1 − πβ∗ (Xi ))E(Yi (1) | Xi ), for example, then Equation (15.20)
is always satisfied regardless of how the other m − 1 functions are specified. The lack of unique-
ness means that we may need to use an initial estimator for β based on, for example, maximum
likelihood. In addition, we also need to estimate the conditional mean functions, E(Yi (0) | Xi ) and
E(Yi (1) | Xi ), using some parametric or nonparametric methods.
To overcome this problem, [9] introduced the optimal CBPS estimator (oCBPS) that does not
require any initial estimator. Assume that we have a class of pre-specified functions h(·) ∈ Rm1
and g(·) ∈ Rm2 . The oCBPS estimator β̂O is defined as the solution to the following estimating
equations,
n
1X Ti 1 − Ti
− h(Xi ) = 0, (15.21)
n i=1 πβ (Xi ) 1 − πβ (Xi )
and
n
1X Ti
− 1 g(Xi ) = 0. (15.22)
n i=1 πβ (Xi )
For simplicity, we only focus on the exact identified case with m1 + m2 = p where p is the
dimensionality of β. To see how Equations (15.21) and (15.22) are related to the optimality condition
in Equation (15.20), we rewrite Equation (15.22) as,
n
1X Ti 1 − Ti
− g(Xi )(1 − πβ (Xi )) = 0. (15.23)
n i=1 πβ (Xi ) 1 − πβ (Xi )
Thus, the oCBPS estimator is a special case of the general CBPS estimator given in Equation 15.4
with f (·) = [h(·), g(·)(1 − πβ )]. If the functions K(·) and L(·) lie in the linear space spanned by
h(·) and g(·) respectively (i.e., K(·) = α> >
1 h(·) and L(·) = α2 g(·) for some vectors α1 and α2 ),
then the optimality condition in Equation (15.20) holds. To see this, note
α> >
1 h(Xi ) + α2 g(Xi )(1 − πβ (Xi ))
= K(Xi ) + L(Xi )(1 − πβ (Xi ))
= πβ∗ (Xi )E(Yi (0) | Xi ) + (1 − πβ∗ (Xi ))E(Yi (1) | Xi ).
With the oCBPS estimator β̂O , [9] proved the following theorem that establishes the double
robustness and locally semiparametric efficiency of the ATE estimator µ̂β̂O . We reproduce the result
here.
Theorem 15.1. Under mild regularity conditions given in [9], the estimator µ̂β̂O is doubly robust in
p
the sense that µ̂β̂O −→ µ if either of the following conditions holds:
1. The propensity score model is correctly specified:
P(Ti = 1 | Xi ) = πβ (Xi ).
288 Covariate Balancing Propensity Score
d
n1/2 (µ̂β̂O − µ) −→ N (0, Vopt ),
where
V(Yi (1) | Xi ) V(Yi (0) | Xi )
Vopt = E + + {L(Xi ) − µ}2
π(Xi ) 1 − π(Xi )
is the semiparametric variance bound. Thus, µ̂β̂O is a locally semiparametric efficient estimator [11].
In addition, the estimator µ̂β̂O is shown to enjoy better high order asymptotic properties than the
standard doubly robust estimator, such as the augmented inverse probability weighting estimator
(AIPW). The oCBPS method can be also extended to the nonparametric regression setting. We refer
the interesting readers to [9] for additional theoretical results.
However, when p > n, the above covariate balancing equation does not have a unique solution and
therefore the CBPS is not well defined.
The main idea in [64] is to estimate the propensity score π(Xi> β) such that the following
covariate balancing condition is satisfied,
n
X Ti
− 1 α∗> Xi = 0, (15.25)
i=1
π(Xi> β)
where α∗ is the true value of α in the outcome regression when it is correctly specified. Otherwise,
α∗ corresponds to some least false value of α under model misspecification, see [64] for the precise
definition. We refer to Equation (15.25) as the weak covariate balancing equation.
High-Dimensional CBPS 289
It is easy to see that the propensity score, which satisfies Equation (15.24), also satisfies the
weak covariate balancing equation, but not vice versa. It turns out that it is sufficient to find an
estimate of the propensity score, that approximately satisfies the weak covariate balancing equation,
to remove the bias from the estimation of the propensity score model. The algorithm proposed by [64]
is summarized in Algorithm 1.
where the Lasso penalty can be replaced with the non-convex penalty [13].
Step 2: Estimate the outcome model by the weighted regression
n
1 X Ti π 0 (β̂ > Xi )
α̃ = arg min (Yi − α> Xi )2 + λ0 kαk1 ,
α∈ Rd n i=1 π 2 (β̂ > Xi )
Algorithm 1 has three steps. In Step 1, we obtain an initial estimate of the propensity score
by maximizing a penalized quasi-likelihood function. The proposed quasi-likelihood function is
obtained by integrating Equation (15.24). However, this initial estimator may have a large bias
because many covariates are not balanced. To remove this bias, we further select a set of covariates
that are predictive of the outcome variable based on a weighted regression in Step 2. The weight
may depend on the initial estimator β̂ and is chosen to reduce the bias of the estimator under model
misspecification. Finally, in Step 3, we calibrate the propensity score by balancing the selected
covariates XS̃ .
While in the last step only the selected covariates are balanced, it approximately satisfies the
weak covariate balancing equation. To see this, consider the following heuristic argument,
n n n
X Ti ∗>
X Ti >
X Ti
− 1 α Xi ≈ − 1 α̃ Xi = − 1 α̃>S̃
XiS̃ = 0,
i=1
π̃ i i=1
π̃ i i=1
π̃ i
where the approximation holds if the estimator α̃ is close to α∗ , the first equality follows from α̃S̃ c =
0 and the second equality holds due to the covariate balancing condition given in Equation (15.26).
Finally, we estimate E(Y (1)) by the Horvitz-Thompson estimator.
The asymptotic properties of the estimator µ̂1 have been studied thoroughly in [64]. In the
following, we only reproduce part of their main results. Recall that α∗ denotes the true value of
α in the outcome regression when it is correctly specified or the least false value under model
290 Covariate Balancing Propensity Score
misspecification. We define β ∗ in the propensity score in a similar manner. Let s1 = kα∗ k0 and
s2 = kβ ∗ k0 denote the number of nonzero entries in α∗ and β ∗ . Then, we have the following
theorem.
Theorem 15.2. Under the regularity conditions given in [64], by choosing tuning parameters
λ λ0 {log(p ∨ n)/n}1/2 , if (s1 + s2 ) log p = o(n) as s1 , s2 , p, n → ∞, then the estimator µ̂1
has the following asymptotic linear expansion
n
1X Ti
µ̂1 − µ∗1 = ∗> ∗> ∗
{Yi (1) − α Xi } + α Xi − µ1 + op (1),
n i=1 π(Xi> α∗ )
References
[1] Kosuke Imai and Marc Ratkovic. Covariate balancing propensity score. Journal of the Royal
Statistical Society, Series B (Statistical Methodology), 76(1):243–263, January 2014.
[2] Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in observa-
tional studies for causal effects. Biometrika, 70(1):41–55, 1983.
[3] Lars Peter Hansen. Large sample properties of generalized method of moments estimators.
Econometrica, 50(4):1029–1054, July 1982.
[4] Christian Fong, Chad Hazlett, and Kosuke Imai. Covariate balancing propensity score for a
continuous treatment: Application to the efficacy of political advertisements. Annals of Applied
Statistics, 12(1):156–177, 2018.
[5] Keisuke Hirano and Guido W. Imbens. The propensity score with continuous treatments. In
Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives: An
Essential Journey with Donald Rubin’s Statistical Family, chapter 7. Wiley, 2004.
[6] Kosuke Imai and David A. van Dyk. Causal inference with general treatment regimes: General-
izing the propensity score. Journal of the American Statistical Association, 99(467):854–866,
September 2004.
[7] James M. Robins, Miguel Ángel Hernán, and Babette Brumback. Marginal structural models
and causal inference in epidemiology. Epidemiology, 11(5):550–560, September 2000.
[8] Kosuke Imai and Marc Ratkovic. Robust estimation of inverse probability weights for marginal
structural models. Journal of the American Statistical Association, 110(511):1013–1023,
September 2015.
[9] Jianqing Fan, Kosuke Imai, Inbeom Lee, Han Liu, Yang Ning, and Xiaolin Yang. Optimal Co-
variate Balancing Conditions in Propensity Score Estimation. Journal of Business & Economic
Statistics, 41(1): 97–110, 2023.
[10] John Copas and Shinto Eguchi. Local model uncertainty and incomplete-data bias. Journal of
the Royal Statistical Society, Series B (Methodological), 67(4):459–513, 2005.
[11] James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients
when some regressors are not always observed. Journal of the American Statistical Association,
89(427):846–866, 1994.
[12] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
[13] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its
oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
[14] Yang Ning, Peng Sida, and Kosuke Imai. Robust estimation of causal effects via a high-
dimensional covariate balancing propensity score. Biometrika, 107(3):533–554, 2020.
[15] Max H Farrell. Robust inference on average treatment effects with possibly more covariates
than observations. Journal of Econometrics, 189(1):1–23, 2015.
292 Covariate Balancing Propensity Score
[16] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, and
Whitney Newey. Double machine learning for treatment and causal parameters. arXiv preprint
arXiv:1608.00060, 2016, 2016.
[17] Susan Athey, Guido W Imbens, and Stefan Wager. Approximate residual balancing: debiased
inference of average treatment effects in high dimensions. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 80(4):597–623, 2018.
[18] Zhiqiang Tan. Model-assisted inference for treatment effects using regularized calibrated
estimation with high-dimensional data. The Annals of Statistics, 48(2):811–837, 2020.
[19] Oliver Dukes, Vahe Avagyan, and Stijn Vansteelandt. Doubly robust tests of exposure effects
under high-dimensional confounding. Biometrics, 76(4):1190–1200, 2020.
[20] Jelena Bradic, Stefan Wager, and Yinchu Zhu. Sparsity double robust inference of average
treatment effects. arXiv preprint arXiv:1905.00744, 2019.
[21] Ezequiel Smucler, Andrea Rotnitzky, and James M Robins. A unifying approach for doubly-
robust `1 regularized estimation of causal contrasts. arXiv preprint arXiv:1904.03737, 2019.
[22] Yang Ning and Han Liu. A general theory of hypothesis tests and confidence regions for sparse
high dimensional models. The Annals of Statistics, 45(1):158–195, 2017.
[23] Cun-Hui Zhang and Stephanie S Zhang. Confidence intervals for low dimensional parameters
in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 76(1):217–242, 2014.
[24] Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing for
high-dimensional regression. The Journal of Machine Learning Research, 15(1):2869–2909,
2014.
[25] Matey Neykov, Yang Ning, Jun S Liu, and Han Liu. A unified theory of confidence regions and
testing for high-dimensional estimating equations. Statistical Science, 33(3):427–443, 2018.
[26] Christian Fong, Marc Ratkovic, and Kosuke Imai. CBPS: R package for covariate balancing
propensity score. available at the Comprehensive R Archive Network (CRAN), 2021. https:
//CRAN.R-project.org/package=CBPS.
16
Balancing Weights for Causal Inference
CONTENTS
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
16.2 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
16.2.1 Notations, estimand, and assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
16.2.2 The central role of inverse propensity score weights . . . . . . . . . . . . . . . . . . . . . . . 295
16.2.3 Bounding the error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
16.3 Two Approaches to Estimating the Inverse Propensity Score Weights . . . . . . . . . . . . . . . 296
16.3.1 The modeling approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
16.3.2 The balancing approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
16.3.3 Balance and model classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
16.3.4 The primal-dual connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
16.3.5 Connections to regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
16.4 Computing the Weights in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
16.4.1 Choosing what to balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
16.4.2 Maximizing the effective sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
16.4.3 Choosing the objective functions: the balance-dispersion trade-off . . . . . . . . . 303
16.4.4 Extrapolating and interpolating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
16.4.5 Additional options for balancing in practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
16.5 Estimating Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
16.5.1 The role of augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
16.5.2 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
16.5.3 Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
16.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
16.1 Introduction
Covariate balance is central to both randomized experiments and observational studies. In randomized
experiments, covariate balance is expected by design: when treatment is randomized, covariates are
balanced in expectation, and observed differences in outcomes between the treatment groups can
be granted a causal interpretation. In observational studies, where the treatment assignment is not
controlled by the investigator, such balance is not guaranteed, and, subject to certain assumptions,
one can adjust the data to achieve balance and obtain valid causal inferences. In fact, if investigators
believe that treatment assignment is determined only by observed covariates, then balance is sufficient
to remove confounding and establish causation from association.
Weighting is a popular method for adjusting observational data to achieve covariate balance
that has found use in a range of disciplines such as economics (e.g., [1]), education (e.g., [2]),
epidemiology (e.g., [3]), medicine (e.g., [4]), and public policy (e.g., [5]). In many cases the weights
are a function of the propensity score, the conditional probability of treatment given the observed
covariates [6]. Weighting by the true inverse propensity score guarantees covariate balance in
expectation. In practice, however, the propensity score must be estimated. In the modeling approach
to weighting, researchers estimate the propensity score directly, such as via logistic regression, and
then plug in these estimates to obtain unit-level weights (see, e.g., [7]). An important risk of this
approach is that, if the propensity score model is wrong, the balancing property of the resulting
weights no longer holds, and treatment effect estimates may be biased [8]. Additionally, even if the
model is correct, the resulting weights can fail to balance covariates in a given sample. This is often
addressed by re-estimating the weights and balance checking anew in an iterative, sometimes ad hoc
fashion [9].
Alternative weighting methods, which we term balancing weights, directly target the balancing
property of the inverse propensity score weights by estimating weights that balance covariates in
the sample at hand [10]. Such weights are found by solving an optimization problem that minimizes
covariate imbalance across treatment groups while directly incorporating other concerns related to the
ad hoc checks described earlier for modeling approaches (e.g., minimizing dispersion, constraining
the weights to be non-negative, etc.). A variety of balancing approaches have appeared in the literature
(e.g., [11–15]).
In this chapter, we introduce the balancing approach to weighting for covariate balance and causal
inference. In Section 16.2 we begin by providing a framework for causal inference in observational
studies, including typical assumptions necessary for the identification of average treatment effects. In
Section 16.3 we then motivate the task of finding weights that balance covariates and unify a variety
of methods from the literature. In Section 16.4 we discuss several implementation and design choices
for finding balancing weights in practice and discuss the trade-offs of these choices using an example
from the canonical LaLonde data [16, 17]. In Section 16.5 we discuss how to estimate effects after
weighting, also using this applied example. In Section 16.6 we conclude with a discussion and future
directions.
16.2 Framework
16.2.1 Notations, estimand, and assumptions
Our setting is an observational study with n independent and identically-distributed triplets
(Xi , Wi , Yi ) for i = 1, ..., n, where Xi ∈ Rd are observed covariates, WP i ∈ {0, 1} is a binary
treatment indicator, and Yi ∈ R is an outcome of interest. There are n1 = i=1 Wi treated units
n
and n0 = n − n1 control units. We operate under the potential outcomes framework for causal infer-
ence [1, 2] and impose the Stable Unit Treatment Value Assumption (SUTVA), which requires only
one version of treatment and no interference between units [20]. Under this setup, each unit has two po-
tential outcomes Yi (0) and Yi (1), but only one outcome is observed: Yi = (1 − Wi )Yi (0) + Wi Yi (1).
Since the study units are a random sample from the distribution of (X, W, Y ), and we are generally
interested in population-level estimands here, we often drop the subscript i.
In this chapter we focus on estimating the Average Treatment Effect on the Treated (ATT),
defined as:
Other causal estimands may be of interest to investigators, and we can easily extend the key ideas
here to those quantities; see, for example, [21].
We directly observePY (1) for units assigned to treatment; therefore E[Y (1)|W = 1] =
E[Y |W = 1], and n−1 1
n
i=1 Wi Yi is an unbiased estimator of this potential outcome mean. The
challenge then is to estimate µ0 := E[Y (0)|W = 1]. In this chapter we focus on identifying this
quantity assuming strong ignorability.
Assumption 5 (Strong ignorability).
1. Ignorability. W ⊥
⊥ (Y (0), Y (1))|X
2. Overlap. e(x) > 0, where e(x) := E[W |X = x] is the propensity score1
for all bounded functions f (·). That is, the weights balance all bounded functions of the covariates
between the control group and the target treated group. In fact, due to the Law of Large Numbers,
these weights can bePexpected (with probability tending toP 1) to balance such functions in a given
n n
sample: that is n−1 i=1 (1 − Wi )γ IPW (Xi )f (Xi ) ≈ n−1 i=1 Wi f (Xi ). Targeting this property
of the weights, rather than targeting the propensity score itself, motivates the balancing approach to
weighting that we consider below.
estimator µ
b0 of µ0 is driven solely by the first term: that is, the imbalance in the conditional mean
function m(X, 0) between the treated and control groups.
The question becomes, then, how to achieve balance on this typically unknown function m(X, 0).
One way of doing this is to consider a model class M for m(X, 0) (e.g., the class of models linear in
X) and the maximal imbalance over functions in that class, imbalanceM (b γ ).
design-conditional bias := E µ
b0 − µ0 X1 , ..., Xn , W1 , ..., Wn
≤ imbalanceM (b γ)
n n
1X 1X
:= max (1 − Wi )b
γi m(Xi , 0) − Wi m(Xi , 0) .
m∈M n i=1 n i=1
This reformulates the task of minimizing imbalance in m(X, 0) in terms of imbalance over the
candidate functions m ∈ M. Balance over all m ∈ M is in turn sufficient (though not necessary)
for controlling the bias in Equation 16.4. This objective therefore allows investigators to avoid the
difficult task of specifying the particular form of the outcome model m and to instead consider the
potentially much broader class to which m belongs. Furthermore, as discussed in the previous section,
because the IPW γ IPW satisfy the sample balance property (in probability), weighting by the IPW
should (with probability tending to 1) lead to an unbiased estimate of µ0 . Thus, seeking weights that
balance a particular class of functions of X is motivated both by the balancing property of the inverse
propensity score weights and by the error bound above.
A variety of methods exist for estimating the propensity score e(X). While the use of parametric
models such as logistic regression remain the most common, flexible machine learning approaches,
such as boosting [23], have become increasingly popular [24]. Modeling approaches are useful
for their well-established statistical properties [25], which allow for the calculation of confidence
intervals by using, e.g., an estimator for the asymptotic variance or resampling methods. Additionally,
if the model for the propensity score is correct (that is, produces consistent estimates of e(X)),
then the estimated weights are consistent for the true weights, yielding an asymptotically unbiased
estimator µ b0 of µ0 .
In practice, however, the propensity score model is unknown and must be posited based on subject
matter knowledge or determined in a data-driven manner. Errors in the propensity score model can
magnify errors in the estimated weights due to inversion and/or multiplication. Further, the estimated
weights, even if correctly specified, may not ensure balance in a given sample, since this balance
Two Approaches to Estimating the Inverse Propensity Score Weights 297
is only guaranteed in expectation. To confirm correct specification of the propensity score model,
a variety of best-practice post hoc checks are available [26], including assessing covariate balance
after weighting and examining the distribution of the weights (e.g., mean, median, minimum, and
maximum). Many heuristics and statistical tests exist to aid in these assessments in practice. If the
modeling approach underperforms on one of these diagnostics, the investigator may specify a new
model [27], truncate extreme weights [28], modify the estimand [29], or make other adjustments.
subject to [constraints],
where imbalanceM (γ) is the “imbalance metric,” e.g., the maximal imbalance over M; hλ (·) is the
“imbalance penalty,” e.g., hλ (x) = λ−1 x2 ; and f (·) is the “dispersion penalty,” e.g., f (x) = x2 .
The additional optional constraints target other desirable
Pnqualities of resulting weighted estimates.
For example, an average-to-one constraint, i.e., n−1 i=1 (1 − Wi )γi = 1, forces the weighted
estimate µb0 to be translation invariant; that is, if we estimate µ0 when m = f (x), then we esti-
mate µ0 + t when m = f (x) + t for t ∈ R.2 Coupling this with a nonnegativity constraint, i.e.,
γi ≥ 0, for all i : Wi = 0, restricts any weighted estimates to be an interpolation (rather than an
extrapolation) of the observed data. Together these constraints restrict the weights to the n0 -simplex,
and so together are sometimes called a simplex constraint. We discuss the role of these constraints
further in Section 16.4.4.
Many existing approaches can be written in the form of (16.5) for particular choices of hλ
and f , including the stable balancing weights [12], the lasso minimum distance estimator [13], the
regularized calibrated propensity score estimation [14], and entropy balancing [11]. Even the linear
regression of Y on X among the control units can be written in this form, and the weights implied
by this regression are a particular form of exact balancing weights in which the weights can be
negative [30, see].
model.
298 Balancing Weights for Causal Inference
additional constraints are included (e.g., the simplex constraint). This can be avoided by constraining
the norm of the coefficients β, such as ||β||1 ≤ 1 or ||β||2 ≤ 1. These constraints correspond to
particular imbalance metrics in Equation 16.5. Specifically, for M = {β T X : ||β||p ≤ 1}, the
imbalance metric is equal to:
n n
1X 1X
imbalanceM (γ) = (1 − Wi )γi Xi − Wi Xi 1/p + 1/q = 1 (16.6)
n i=1 n i=1
q
Thus, the one-norm constraint corresponds to an infinity-norm balancing metric, i.e., Equation
16.5 controls the maximal imbalance over the covariates; and the two-norm constraint corresponds
to a two-norm balancing metric, i.e., Equation 16.5 controls the Euclidean distance between the
covariate mean vectors.
Linear models are often too restrictive in practice. In this case investigators might posit that
the outcome model is instead linear in a finite collection of transformations of the covariates
φ1 (X), ..., φK (X), such as the covariates, their squares, and their two-way interactions. In this case,
M = {β T φ(X) : ||β||p ≤ 1} for β ∈ RK , and the imbalance metric appears as in (16.6), albeit
with Xi = (X1i , ..., Xpi )T replaced by φ(Xi ) = (φ1 (Xi ), ..., φK (Xi )T ).
When all covariates are binary, the outcome model m always lies in a model class that is linear in
a finite number of covariate functions: namely, all possible interactions of covariates. This is because
the implied model class M includes all possible functions of the covariates. However, this approach
can become untenable when the number of covariates increases, and is often not necessary in practice
if the investigator believes that only interactions up to a certain depth (e.g., only up to two-way
interactions) determine the outcome model (see, e.g., [31]). In other cases, the researcher may
attempt to balance all interactions but place less weight (or none at all) on higher-order interactions,
for example, by finding weights that exactly balance main terms and lower-order interactions and
approximately balance higher-order interactions [32].
A nonparametric approach when some covariates are continuous is more difficult, as functions of
continuous covariates can be nonlinear or otherwise highly complex. Thus, nonparametric models
involving even one continuous covariate require an infinite-dimensional basis (though a finite basis
can often be a good enough approximation). Common bases are the Hermite polynomials 1, x1 , x2 ,
x21 − 1, x22 − 1, x1 x2 , ... (written here for two covariates) or sines and cosines of different frequencies.
In general the model class M may be expressed in terms of basis functions φ1 (x), φ2 (x), ... and
corresponding λ1 , λ2 , ...:
X X
M = m(x) = βj φj (x) : βj /λ2j ≤ 1 (16.7)
j j
With an infinite number of basis functions to balance, we cannot expect to find weights that can
balance all of them equally well. However, the relative importance of balancing each basis function
can be measured by λj . This is reflected in the corresponding imbalance function induced by this
model class,
n n
!2
2
X
2 1X 1X
imbalanceM (γ) = λj φj (Xi ) − Wi γ(Xi )φj (Xi ) . (16.8)
j
n i=1 n i=1
This measure weights the imbalance in basis function j by the value of λj , corresponding to, e.g.,
placing higher priority on lower order terms in a polynomial expansion or lower frequency terms in a
Fourier expansion.
Taking another view, we note that Equation 16.7 is the unit ball of a reproducing kernel Hilbert
space (RKHS), which suggests examining the balancing problem from the perspective of RKHS
Two Approaches to Estimating the Inverse Propensity Score Weights 299
theory and focusing on the properties of the functions rather than their representations under a
particular basis expansion. Balancing weights for RKHS models are helpful because they allow
for optimization problems involving infinite components (e.g., in Equation 16.7) to be reduced
to finite-dimensional problems via the “kernel trick,” in which the inner products from (16.5) are
replaced with kernel evaluations. For kernel kj (Xi ) = k(Xi , Xj ), this leads to weights that minimize
n n
1 XX
{Wi Wj kj (Xi ) − 2(1 − Wi )Wj kj (Xi )γi + (1 − Wi )(1 − Wj )kj (Xi )γi γj } (16.9)
n2 j=1 i=1
This objective upweights control units that are similar to treated units (second term) while recognizing
that control units for which there are many others that are similar need not have too large a weight
(third term). Different choices of kernels correspond to different assumptions on the RKHS M.
See [21] for technical details and [33], [34], [35], [36], and [37] for more complete discussions of
RKHS models.
Another approach to defining the model class M begins with the Kolmogorov-Smirnov (KS)
statistic, which quantifies the distance between two cumulative distribution functions (cdfs). Weights
that minimize the maximal KS statistic (i.e., to control imbalance on the entire distribution of
continuous covariate) correspond to an additive model over Pp the covariatesPpwith the summed compo-
nentwise total variation as the imbalance metric: M = { j=1 fj (xj ) : j=1 ||fj ||T V ≤ 1}, where
||fj ||T V = j |fj0 (x)|dx.
P R
The dual form in (16.10) can be thought of as a regularized regression for the treatment odds γ IPW .
In expectation, it finds weights that minimize the mean square error for the true treatment odds,
among the control population, with a mean zero “noise” term in finite samples. The model class
M governs the type of regularization on this model through its gauge norm ||γ||M .3 For example,
if we choose the class of models linear in transformed covariates φ(x) with a p norm constraint in
Equation (16.6), then the weights are also linear with γ(x) = η · φ(x), and the model is regularized
via the p norm: kγkM = kηkp . In this way we can understand the balancing problem for different
model classes as estimating γ IPW under different forms of regularization, including, e.g., sparsity
penalties [14, 38] and kernel ridge regression [34].
3 The gauge of the model M is ||γ||M = inf{α > 0 : γ ∈ αM}.
300 Balancing Weights for Causal Inference
TABLE 16.2
Continuous covariate means for the LaLonde data, by subgroup.
Control Treated
Group Group
Subgroup Covariate Mean Mean
Black Age (years) 34.2 26.0
Reported earnings, 1974 (thousands of $) 14.6 2.2
Reported earnings, 1975 (thousands of $) 14.0 1.5
Non-Black Age (years) 35.1 24.9
Reported earnings, 1974 (thousands of $) 21.1 1.8
Reported earnings, 1975 (thousands of $) 20.8 1.8
High school degree Age (years) 33.1 26.9
Reported earnings, 1974 (thousands of $) 21.5 3.4
Reported earnings, 1975 (thousands of $) 21.3 1.4
No high school degree Age (years) 38.8 25.4
Reported earnings, 1974 (thousands of $) 14.5 1.5
Reported earnings, 1975 (thousands of $) 14.0 1.6
In this section, we consider the implications of different choices for the outcome model class M,
which determines what functions of covariates to balance and/or the balance metric. We use the form
of the problem in Equation 16.5, with the imbalance penalty hλ (x) = λ−1 x2 (λ is defined below),
the dispersion penalty f (x) = x2 , and including the simplex constraint to rule out extrapolation. We
standardize all covariates to have a mean 0 and standard deviation 1. Under this setup, we consider
the performance of the balancing approach for three designs corresponding to three model classes
M: (1) linear in the covariates with a one-norm constraint on the coefficients (“marginal design”),
(2) linear in the covariates and up to their three way interactions with a one-norm constraint on
the coefficients (“interaction design”), and (3) an RKHS equipped with the RBF kernel4 (“RKHS
design”). For the first two, this corresponds to balancing (1) main terms and (2) main terms and up
to three-way interactions (using Hermite polynomials), and (3) corresponds to the use of the RBF
kernel in the kernel trick (Equation 16.9). For the marginal and interaction designs, λ = 1 × 10−3 ,
and for the RKHS design, λ = 1 × 10−1 .
4 Specifically, we used the kernel k(x, y) = exp(−σ||x − y||2 ) with σ = 0.01. This corresponds to the assumption that
M is an RKHS with kernel k given above. In order for the weights that minimize (16.9), with k an RBF kernel, to reduce bias,
the functions in M must be continuous, square integrable, and satisfy a technical regularity condition (see [40]). However,
these restrictions are relatively weak; the associated feature space of this RKHS is infinite-dimensional, and the RBF kernel is
exceedingly popular among kernel regression methods.
302 Balancing Weights for Causal Inference
Up to Three-Way
Main Terms Kernel Basis
Interactions
Age Unweighted
Modeling Approach
Unweighted
Marginal Design Unweighted
Interaction Design (K = -23.6)
Black RKHS Design
Modeling
Modeling Approach Approach
Hispanic (K = -49.1)
Covariate
Marginal
Married Marginal Design Design
(K = -49.0)
No High
School Interaction
Degree Interaction Design Design
(K = -48.4)
Earnings
1974 RKHS
RKHS Design Design
(K = -4204.3)
Earnings
1975
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
ASAMD ASAMD Imbalance
FIGURE 16.1
ASAMDs after weighting to balance various functions of the covariates.
In Figure 16.1, we evaluate how the different balancing designs (including a traditional modeling
approach for the weights, where we model the propensity score using a logistic regression of
the treatment on main terms) produce covariate balance in the LaLonde data using the average
standardized absolute (weighted) mean differences (ASAMDs) for several functions of the covariates
of interest (e.g., means). The leftmost plot shows the balance for main terms, where we see that all
designs achieve relatively good balance on main terms compared to before weighting, particularly
the marginal and interaction designs, which directly target minimizing imbalance on these functions.
In the middle plot, the distributions of ASAMDs for up to three-way interactions are plotted before
and after weighting, showing improvements for all designs over the unweighted data. Again, the
method that directly targets these covariate functions for balance (i.e., the interaction design) shows
the best performance, though the marginal design does almost as well, leaving a few higher order
terms relatively unbalanced. The rightmost plot shows the distributions of ASAMDs for 5000 random
features generated as in [41] from the RBF kernel, and the RKHS design shows the best performance,
as expected. We also include the value of the objective function in (16.9) for each design, where, as
expected, RKHS has the smallest value.
While some heuristics exist (e.g., ASAMDs < 0.1) for assessing covariate balance, it is important
to evaluate balance in conjunction with considering the sample size, as there is a trade-off between
the two. We discuss this in Section 16.4.3.
For any covariate adjustment method, there is a balance-dispersion trade-off, where better balance
on more covariate functions typically leads to more highly dispersed weights. We discuss this more
thoroughly in the next section.
2.0
With Non-Negativity Constraint 0.02
Weight
Without Non-Negativity Constraint
0.01
0.00
-0.01
Rank
0.02
Weight
0.01
1.0
0.00
-0.01
Joint Distribution
(With Non-Negativity
0.03
Constraint)
0.02
Weight
0.01
0.00
-0.01
0.0
FIGURE 16.2
Balance-dispersion trade-off and plots of the weights for the LaLonde data, with and without the
non-negativity constraint.
0.04
0.03
Distance from simplex
0.02
0.01
0.00
FIGURE 16.3
Distance from the simplex versus effective sample size, without the non-negativity constraint.
computing weighted estimators. Considering this suggests a trade-off between extrapolation and
balance, where limiting extrapolation comes at the expense of balance.
We can also assess extrapolation by examining the extent to which non-negativity is violated when
the simplex constraint is relaxed. The righthand panels of Figure 16.2 show how the unit-specific
weights change before and after removing the non-negativity constraint, where negative weights are
highlighted in red. The top-right and middle-right panels of Figure 16.2 plot the weights with and
without the non-negativity constraint by order of their rank within that set. Whereas generally, the
negative weights are smaller in magnitude than the non-negative weights, there are some units with
Computing the Weights in Practice 305
0.8
0.8
0.8
Fn(x)
Fn(x)
Fn(x)
Fn(x)
KS = 0.729 KS = 0.271 KS = 0.120
0.4
0.4
0.4
0.4
0.0
0.0
0.0
0.0
0 40000 100000 0 40000 100000 0 40000 100000 0 40000 100000
Earnings, '74 Earnings, '74 Earnings, '74 Earnings, '74
Weighted (Marginal Balance), Weighted (Quantile Balance),
Unweighted, Treated Group Unweighted, Control Group
Control Group Control Group
0.8
0.8
0.8
0.8
Fn(x)
Fn(x)
Fn(x)
Fn(x)
KS = 0.774 KS = 0.053 KS = 0.025
0.4
0.4
0.4
0.4
0.0
0.0
0.0
0.0
0 50000 100000 0 50000 100000 0 50000 100000 0 50000 100000
Earnings, '75 Earnings, '75 Earnings, '75 Earnings, '75
FIGURE 16.4
Empirical pdfs of reported earnings covariates for various balancing approaches.
extreme negative weights, suggesting a high degree of extrapolation is admitted for this design. The
bottom-right panel shows the joint distribution of the unit-specific weights, with and without the
non-negativity constraint. This panel shows that, for the most part, the units that receive negative
weights receive very small weights (nearly zero in most cases) when non-negativity is enforced:
that is, there is a subset of units such that, with negativity allowed, they contribute non-trivially to
weighted estimates, but with negativity disallowed, they contribute almost nothing.
16.5.3 Estimates
We return to our goal of estimating the control potential outcome mean for the treated µ0 using the
observational control units from the LaLonde data. Now that the design is complete (insofar as the
weights have been constructed), we may proceed to the outcome analysis. For the sake of illustration,
we use the weights that balance up to three-way interactions. We construct estimates with and without
augmentation using the constructions in (16.3) and (16.12), respectively. In this example, we have
access to the control group from the randomized trial – which should provide an unbiased estimate
of µ0 —for comparison, and we use this to benchmark the observational results. Figure 16.5 provides
these results, along with 95% confidence intervals constructed via the bootstrap.
Figure 16.5 shows that, before weighting, the observational estimate is dramatically different
from the trial estimate (i.e., the ground truth). Both the weighted and augmented estimates are much
closer to the truth, and their confidence intervals contain it. The augmented estimate is slightly closer
to the truth than the non-augmented estimate, and its confidence interval is shorter.
308 Balancing Weights for Causal Inference
Experimental
Estimate
Unweighted
Observational
Estimate
Weighted
Observational
Estimate
Augmented
Observational
Estimate
Earnings '78
FIGURE 16.5
Estimates of E[Y (0)|W = 1].
References
[1] Stefan Tübbicke. Entropy balancing for continuous treatments. Journal of econometric methods,
11(1):71–89, 2021.
[2] C.A. Stone and Y. Tang. Comparing propensity score methods in balancing covariates and
recovering impact in small sample educational program evaluations. Practical Assessment,
Research and Evaluation, 18, 11 2013.
Concluding Remarks 309
[3] Mark Lunt, Daniel Solomon, Kenneth Rothman, Robert Glynn, Kimme Hyrich, Deborah P. M
Symmons, and Til Stürmer. Different Methods of balancing covariates leading to different
effect estimates in the presence of effect modification,” American Journal of Epidemiology,
169(7):909–917, 2009.
[4] Jessica M Franklin, Jeremy A Rassen, Diana Ackermann, Dorothee B Bartels, and Sebastian
Schneeweiss. Metrics for covariate balance in cohort studies of causal effects. Statistics in
Medicine, 33(10):1685–1699, 2014.
[5] Beth Ann Griffin, Greg Ridgeway, Andrew R Morral, Lane F Burgette, Craig Martin, Daniel
Almirall, Rajeev Ramchand, Lisa H Jaycox, and Daniel F McCaffrey. Toolkit for weighting
and analysis of nonequivalent groups (twang), 2014.
[6] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observa-
tional studies for causal effects. Biometrika, 70(1):41–55, 1983.
[7] Daniel Daniel Westreich, Justin Justin Lessler, and Michele Jonsson Michele Jonsson Funk.
Propensity score estimation: machine learning and classification methods as alternatives to
logistic regression. Journal of clinical epidemiology, 63(8):826–833, 2010.
[8] Ambarish Chattopadhyay, Christopher H Hase, and José R Zubizarreta. Balancing vs modeling
approaches to weighting in practice. Statistics in Medicine, 39(24):3227–3254, 2020.
[9] Elizabeth A Stuart. Matching methods for causal inference: A review and a look forward.
Statistical science, 25(1):1–21, 2010.
[10] David A Hirshberg and José R Zubizarreta. On two approaches to weighting in causal inference.
Epidemiology (Cambridge, Mass.), 28(6):812–816, 2017.
[11] Jens Hainmueller. Entropy balancing for causal effects: A multivariate reweighting method to
produce balanced samples in observational studies. Political Analysis, 20(1):25–46, 2012.
[12] José R Zubizarreta. Stable weights that balance covariates for estimation with incomplete
outcome data. Journal of the American Statistical Association, 110(511):910–922, 2015.
[13] Victor Chernozhukov, Whitney K Newey, and Rahul Singh. Automatic debiased machine
learning of causal and structural effects. arXiv preprint, 2018.
[14] Z Tan. Regularized calibrated estimation of propensity scores with model misspecification and
high-dimensional data. Biometrika, 107(1):137–158, 2020.
[15] Kosuke Imai and Marc Ratkovic. Covariate balancing propensity score. Journal of the Royal
Statistical Society. Series B, Statistical Methodology, 76(1):243–263, 2014.
[16] Robert J LaLonde. Evaluating the econometric evaluations of training programs with experi-
mental data. The American Economic Review, 76(4):604–620, 1986.
[17] Rajeev H Dehejia and Sadek Wahba. Causal effects in nonexperimental studies: Reevaluating the
evaluation of training programs. Journal of the American Statistical Association, 94(448):1053–
1062, 1999.
[18] J. Neyman. On the application of probability theory to agricultural experiments. Statistical
Science, 5(5):463–480, 1923.
[19] Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized
studies. Journal of Educational Psychology, 66(5):688, 1974.
310 Balancing Weights for Causal Inference
[20] Donald B Rubin. Randomization analysis of experimental data: the fisher randomization test
comment. Journal of the American Statistical Association, 75(371):591–593, 1980.
[21] Eli Ben-Michael, Avi Feller, David A Hirshberg, and Jose R Zubizarreta. The balancing act for
causal inference. arXiv preprint, 2021.
[22] Jamie M Robins, A Rotnitzky, and L P Zhao. Estimation of regression coefficients when
some regressors are not always observed. Journal of the American Statistical Association,
89(427):846–866, 1994.
[23] Daniel F McCaffrey, Greg Ridgeway, and Andrew R Morral. Propensity score estimation
with boosted regression for evaluating causal effects in observational studies. Psychological
Methods, 9(4):403, 2004.
[24] Brian K Lee, Justin Lessler, and Elizabeth A Stuart. Improving propensity score weighting
using machine learning. Statistics in Medicine, 29(3):337–346, 2010.
[25] Keisuke Hirano, Guido W Imbens, and Geert Ridder. Efficient estimation of average treatment
effects using the estimated propensity score. Econometrica, 71(4):1161–1189, 2003.
[26] Peter C Austin and Elizabeth A Stuart. Moving towards best practice when using inverse
probability of treatment weighting (iptw) using the propensity score to estimate causal treatment
effects in observational studies. Statistics in Medicine, 34(28):3661–3679, 2015.
[27] Til Sturmer, Michael Webster-Clark, Jennifer L Lund, Richard Wyss, Alan R Ellis, Mark Lunt,
Kenneth J Rothman, and Robert J Glynn. Propensity score weighting and trimming strategies
for reducing variance and bias of treatment effect estimates: A simulation study. American
Journal of Epidemiology, 190(8):1659–1670, 2021.
[28] Valerie S Harder, Elizabeth A Stuart, and James C Anthony. Propensity score techniques and the
assessment of measured covariate balance to test causal associations in psychological research.
Psychological Methods, 15(3):234–249, 2010.
[29] Fan Li and Laine E Thomas. Addressing extreme propensity scores via the overlap weights.
American Journal of Epidemiology, 188(1):250–257, 2019.
[30] Ambarish Chattopadhyay and Jose R. Zubizarreta (2022). On the implied weights of linear
regression for causal inference. Biometrika, 2022, forthcoming.
[31] Yair Ghitza and Andrew Gelman. Deep interactions with MRP: Election turnout and voting
patterns among small electoral subgroups. American Journal of Political Science, 57(3):762–
776, 2013.
[32] Eli Ben-Michael, Avi Feller, and Erin Hartman. Multilevel calibration weighting for survey
data, 2021.
[33] Chad Hazlett. Kernel balancing: A flexible non-parametric weighting procedure for estimating
causal effects. Statistica Sinica, 30(3):1155–1189, 2020.
[34] David A Hirshberg, Arian Maleki, and Jose R Zubizarreta. Minimax linear estimation of the
retargeted mean. arXiv preprint, 2019.
[35] Nathan Kallus. Generalized optimal matching methods for causal inference. Journal of Machine
Learning Research, 21, 2020.
[36] Rahul Singh, Liyuan Xu, and Arthur Gretton. Generalized kernel ridge regression for nonpara-
metric structural functions and semiparametric treatment effects, 2021.
Concluding Remarks 311
[37] Raymond K W Wong and Kwun Chuen Gary Chan. Kernel-based covariate functional balancing
for observational studies. Biometrika, 105(1):199–213, 2018.
[38] Yixin Wang and Jose R Zubizarreta. Minimal dispersion approximately balancing weights:
Asymptotic properties and practical considerations. Biometrika, 107(1):93–105, 2020.
[39] Rajeev H Dehejia and Sadek M Wahba. Propensity score matching methods for non-
experimental causal studies, volume no. 6829. of NBER working paper series. National
Bureau of Economic Research, Cambridge, MA, 1998.
[40] Ha Quang Minh. Some Properties of Gaussian reproducing Kernel Hilbert spaces and their
implications for function approximation and learning theory. Constructive Approximation,
32(2):307–338, 2009.
[41] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In J. Platt,
D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing
Systems, volume 20. Curran Associates, Inc., 2007.
[42] Leslie Kish. Survey sampling. John Wiley and Sons, New York, 26, 1965.
[43] Gary King, Christopher Lucas, and Richard A Nielsen. The balance-sample size frontier in
matching methods for causal inference. American Journal of Political Science, 61(2):473–489,
2017.
[44] James Robins, Mariela Sued, Quanhong Lei-Gomez, and Andrea Rotnitzky. Comment: Per-
formance of double-robust estimators when “inverse probability” weights are highly variable.
Statistical Science, 22(4), Nov 2007.
[45] Paul R Rosenbaum. Modern algorithms for matching in observational studies. 7(1):143–176,
2020.
[46] James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients
when some regressors are not always observed. Journal of the American Statistical Association,
89(427):846–866, 1994.
[47] Susan Athey, Guido W Imbens, and Stefan Wager. Approximate residual balancing: debiased
inference of average treatment effects in high dimensions. Journal of the Royal Statistical
Society. Series B, Statistical Methodology, 80(4):597–623, 2018.
[48] David A. Hirshberg and Stefan Wager. Augmented minimax linear estimation, 2020.
[49] Victor Chernozhukov, Whitney K Newey, and Rahul Singh. Automatic debiased machine
learning of causal and structural effects, 2021.
[50] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen,
Whitney Newey, and James Robins. Double/debiased machine learning for treatment and
structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018.
[51] A Schick. On asymptotically efficient estimation in semiparametric models. The Annals of
statistics, 14(3):1139–1151, 1986.
[52] Wenjing Zheng and Mark J van der Laan. Cross-validated targeted minimum-loss-based
estimation. In Targeted Learning, Springer Series in Statistics, pages 459–474. Springer New
York, New York, NY, 2011.
[53] C. C Heyde and P Hall. Martingale limit theory and its application. Academic Press, 2014.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
17
Assessing Principal Causal Effects Using Principal
Score Methods
CONTENTS
17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
17.2 Potential Outcomes and Principal Stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
17.3 Structural Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
17.3.1 Ignorable treatment assignment mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
17.3.2 Monotonicity assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
17.3.3 Exclusion restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
17.3.4 Principal ignorability assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
17.4 Principal Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
17.4.1 The principal score: definition and properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
17.4.2 Estimating the principal score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
17.4.3 Choosing the specification of the principal score model . . . . . . . . . . . . . . . . . . . 329
17.5 Estimating Principal Causal Effects Using Principal Scores . . . . . . . . . . . . . . . . . . . . . . . . 330
17.6 Assessing Principal Score Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
17.7 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
17.7.1 Sensitivity analysis for principal ignorability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
17.7.2 Sensitivity analysis for monotonicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
17.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
17.1 Introduction
Decisions in many branches of sciences, including social sciences, medical care, health policy and
economics, depend critically on appropriate comparisons of competing treatments, interventions and
policies. Causal inference is used to extract information about such comparisons. A main statistical
framework for causal inference is the potential outcomes approach [1–3, 5, 6, 10], also known as the
Rubin Causal Model [7], where a causal effect is defined as the contrast of potential outcomes under
different treatment conditions for a common set of units. See Imbens and Rubin [18] for a textbook
discussion.
In several studies, inference on causal effects is made particularly challenging by the presence of
post-treatment variables (e.g., mediators, treatment compliance, intermediate endpoints), potentially
affected by the treatment and also associated with the response. Adjusting treatment comparisons for
post-treatment variables requires special care in that post-treatment variables are often confounded
by the unit’s characteristics that are also related to the response. Therefore, the observed values of an
intermediate variable generally encode both the treatment condition as well as the characteristics of
the unit, and thus, naive methods, conditioning on the observed values of that intermediate variable,
do not lead to properly defined causal effects, unless strong assumptions are imposed.
Within the potential outcomes approach to causal inference, principal stratification [9] is a
principled approach for addressing causal inference problems, where causal estimands are defined in
terms of post-treatment variables. A principal stratification with respect to a post-treatment variable
is a partition of units into latent classes, named principal strata, defined by the joint potential values
of that post-treatment variable under each of the treatments being compared. Because principal
strata are not affected by treatment assignment, they can be viewed as a pre-treatment covariate, and
thus, conditioning on principal strata leads to well-defined causal effects, named principal causal
effects. Principal causal effects are local causal effects in the sense that they are causal effects for
sub-populations defined by specific (union of) principal strata.
In the causal inference literature there is a widespread agreement on the important role of principal
stratification to deal with post-treatment complications in both experimental and observational
studies. Principal stratification has been successfully applied to deal with the following issues: non-
compliance [11, 50]; censoring by death, where the primary outcome is not only unobserved but also
undefined for subjects who die [12–14]; loss to follow-up or missing outcome data [15]; surrogate
endpoints [9, 16]; and mediation [17–22], although its role in these settings is still controversial [67].
Early contributions on principal stratification mainly concentrate on studies with a binary treat-
ment and a binary post-treatment variable, and we mainly focus on this setting throughout the chapter.
Nevertheless, the literature has rapidly evolved offering a growing number of studies extending
principal stratification to deal with the presence of intermediate variables that are multivariate, such as
in studies suffering from combinations of post-treatment complications [12, 13, 24], categorical [25],
continuous [26–28] or time-dependent [29], and to studies with longitudinal treatments [30] and
multilevel treatments [31].
Principal stratification allows dealing with confounded post-treatment intermediate variables by
providing meaningful definitions of causal effects conditional on principal strata. However, it also
raises serious inferential challenges due to the latent nature of the principal strata. In general, principal
stratum membership is not observed and the observed values of the treatment and intermediate
variable cannot uniquely identify principal strata. Instead, groups of units sharing the same observed
values of the treatment and post-treatment variable generally comprise mixtures of principal strata. For
instance, in randomized studies with non-compliance, after randomization, some subjects assigned
to the active treatment will not take it, but effectively take the control/standard treatment. Similarly,
some subjects assigned to the control condition will actually take the active treatment. The actual
treatment received is a confounded intermediate variable, in the sense that subjects receiving and
subjects not receiving the active treatment may systematically differ in characteristics associated
with the potential outcomes. The principal stratification with respect to the compliance behavior
partitions subjects into four types defined by the joint potential values of the treatment that would be
received under assignment to the control and the active treatment. Without additional assumptions,
above and beyond randomization, the compliance status is not observed for any subject, e.g., the
group of subjects assigned to the control condition who are observed to effectively not take treatment
comprises a mixture of subjects who would always comply with their assignment, the compliers, by
taking the active treatment if assigned to treatment and taking the control if assigned control; and
subjects who would never take the active treatment, the never-takers, irrespective of their assignment.
Some common assumptions for disentangling these mixtures, which relate also to the econometric
Instrumental Variables framework (IV), are the monotonicity assumption, which allows the effect of
the treatment on the intermediate variable to be only in one direction, essentially limiting the number
of principal strata [11,17,50], and exclusion restrictions, which rule out any effect of treatment on the
outcome for units in principal strata where the intermediate variable is unaffected by the treatment
[50]. Monotonicity and exclusion restrictions have been often invoked in randomized studies with
non-compliance to non-parametrically identify and estimate the average causal effect for compliers
Introduction 315
(CACE) [50]. More generally, under monotonicity and exclusion restrictions we can uniquely
disentangle the observed distribution of outcomes into a mixture of conditional outcome distributions
associated to each of the latent strata [32]. Depending on the empirical setting, monotonicity and
exclusion restrictions may be questionable. For instance, in randomized experiments with non-
compliance, exclusion restrictions appear plausible in blind or double-blind experiments, but less so
in open-label ones [11, 24]. Outside the non-compliance/instrumental variables setting, assumptions
such as the exclusion restrictions cannot be invoked, because they would rule out a priori the effects
of interest, that is, causal effects for subpopulations of units where the intermediate variable is not
affected by the treatment. For example, in studies with outcomes censored by death, such as Quality
of Life, which is censored by death for subjects who die before the end of the follow-up, focus is on
causal effects for always survivors, subjects who would live no matter how treated, that is, subjects
for which the post-treatment survival indicator is unaffected by the treatment. In mediation analysis
causal effects for units for which the mediator is unaffected by the treatment provide information on
the direct effect of the treatment. Exclusion restrictions would imply that causal effects for always
survivors in studies with outcomes censored by death and direct effects in mediation studies are zero.
In the literature various sets of assumptions have been introduced and investigated, including
alternative structural assumptions, such as assumptions of ranked average score/stochastic dominance
[24, 33], distributional assumptions, such as the assumption of normal outcome distributions within
principal strata [13, 14] and assumptions that either involve the existence of an instrument for
the intermediate variable [15, 19] or rely on additional covariates or secondary outcomes [34,
35, 37–39, 68]. A fully model-based likelihood or Bayesian approach to principal stratification
[11–14, 17, 24, 26, 28, 40], can be used, carefully thinking about the plausibility of parametric
assumptions [41]. In addition, the problem of unbounded likelihood function may arise and jeopardize
inference [34, 42] and Bayesian analyses may be sensitive to priors.
Without exclusion restrictions and/or parametric assumptions, principal causal effects can be
generally only partially identified [19, 33, 43, 44], but large sample bounds are typically too wide for
practical use, although there exist several strategies for sharpening them [33, 37, 38, 45–48].
A line of the literature proposes an alternative approach to deal with identification issues in
principal stratification analysis dealing with confounded post-treatment variables: it invokes principal
ignorability assumptions, which are conditional independence assumptions between (some) principal
strata and (some) potential outcomes for the primary outcome given covariates. Principal ignorability
assumptions imply that, given the observed covariates, there is no unmeasured confounding between
certain principal strata and the outcome under specific treatment conditions. As a consequence, the
distribution of potential outcomes is exchangeable across different principal strata and valid inference
on principal casual effects can be conducted by comparing treated and control units belonging to
different principal strata but with the same distributions of background characteristics.
As opposed to the general principal stratification framework, where outcome information is
required to disentangle mixtures of principal strata, under principal ignorability assumptions studies
with intermediate variables should be designed and analyzed without using information on the
primary outcome to create subgroups of units belonging to different principal strata with similar
distributions of the background variables. The idea is not new in causal inference; it borrows from
the literature on the design and analysis of regular observational studies, i.e., non-experimental
studies where the assignment mechanism is strongly ignorable [52]. Indeed, it is widely recognized
that for objective causal inference, regular observational studies should be carefully designed to
approximate randomized experiments using only background information (without any access
to outcome data). This is done by creating subgroups of treated and control units with similar
distributions of background variables [38, 39]. Among others, the covariate-adjustment method based
on the propensity score, defined as the treatment probability given covariates, is a popular method to
identify and adjust for these groups of individuals [52].
In a similar spirit, the literature on principal ignorability has introduced the principal score [52],
defined as the conditional probability of belonging to a latent principal stratum given covariates.
316 Assessing Principal Causal Effects Using Principal Score Methods
In the same way the propensity score is a key tool in the design and analysis of observational
causal studies with regular treatment assignment mechanisms, the principal score plays a central
role in principal stratification analysis under principal ignorability assumptions. Principal scores are
generally unknown and their identification may be challenging: even in randomized experiments with
two treatments and a single binary intermediate variable, each observed group defined by the assigned
treatment and the observed value of the intermediate variable comprises a mixture of two principal
strata, and thus, principal scores are not identified without additional assumptions. Monotonicity
assumptions are usually invoked, ruling out the existence of some principal strata and unmasking
principal stratum membership for some units. Therefore, together with the ignorability of treatment
assignment, which implies that principal strata have the same distribution in both treatment arms
(at least conditional on covariates), monotonicity assumptions allow one to identify and estimate
principal scores for each unit.
Early principal stratification analyses based on principal ignorability assumptions and principal
scores date back to Follmann [53], who refers to the conditional principal stratum membership
probability given the covariates as a compliance score, and to Hill et al. [52], who introduced
the term principal score. Both papers focus on the analysis of randomized experiments with one-
sided non-compliance, where a strong monotonicity assumption, implying that there exist only two
latent principal strata, holds by design, and thus, the principal score is identified and can be easily
estimated. In subsequent years various researchers have contributed to the literature on principal
stratification analysis based on principal scores, under the assumptions of principal ignorability and
monotonicity. Applied contributions providing prominent examples of how principal score methods
work in practice include Crèpon et al. [54], Schochet et al. [55] and Zhai et al. [56]. Key theoretical
and conceptual contributions include Jo and Stuart [57], Jo et al. [58] and Stuart and Jo [59], who
further investigate the performance of principal score methods in randomized experiments with
one-sided non-compliance (see also [60] for a simulation study), and Ding and Lu [61], Feller et
al. [62], Joffe et al. [63], and Jiang et al. [64], who consider the use of principal scores in a more
general setup, where there may exist more than two principal strata. Specifically, Joffe et al. [63]
suggest the use of principal scores to identify and estimate general principal causal effects; Ding and
Lu [61] provide theoretical justification for principal stratification analysis based on principal scores;
Feller et al. [62] offer an excellent review of methods based on propensity scores and introduce new
sets of assumptions under which principal causal effects can be identified and estimated mixing and
matching principal ignorability assumptions and exclusion restrictions; and Jiang et al. [64] develop
multiply robust estimators for principal causal effects, showing that they are consistent if two of
the treatment, intermediate variable, and outcome models are correctly specified, and that they are
locally efficient if all three models are correctly specified.
The chapter will draw from this literature, offering a comprehensive overview of principal score
methods based on principal ignorability and monotonicity assumptions. We will also discuss the
importance of conducting sensitivity analysis to assess the robustness of the results with respect to
violations of principal ignorability and monotonicity assumptions in principal stratification analysis
based on principal scores. Indeed, principal ignorability and monotonicity assumptions are funda-
mentally untestable and, in some cases, they are not easily justifiable according to subject-matter
knowledge. We will review the literature on sensitivity analysis with respect to principal ignorability
and monotonicity assumptions [61, 62, 64] .
Throughout the chapter we will illustrate and discuss the key concepts using two public health
randomized experiments. The first study is a randomized experiment aiming to assess causal effects
of Brest-Self Examination (BSE) teaching methods [24, 65]. The study was conducted between
January 1988 and December 1990 at the Oncologic Center of the Faenza Health District in Italy. We
will refer to this study as the Faenza Study. In the Faenza study, a sample of women who responded
to a pretest BSE questionnaire was randomly assigned to either a standard leaflet-based information
program (control group) or to a new in-person teaching program (intervention group). Women in
the standard treatment group were mailed an information leaflet describing how to perform BSE
Potential Outcomes and Principal Stratification 317
correctly. In addition to receiving the mailed information, women assigned to the new intervention
group were invited to the Faenza Oncologic Center to attend a training course on BSE techniques
held by specialized medical staff. Unfortunately, some of the women assigned to the treatment
group did not attend the course, but only received the standard information leaflet. On the contrary,
women assigned to the control group could not access the in-person course, implying a one-sided
non-compliance setting with only two compliance types (never-takers and compliers). The outcome
of interest is post-treatment BSE practice a year after the beginning of the study as reported in a
self-administered questionnaire. The post-treatment questionnaire also collected information on
quality of self-exam execution. BSE quality is a secondary outcome censored by “death,” with the
censoring event defined by BSE practice: BSE quality is not defined for women who do not practice
BSE. We will use the Faeza study to discuss principal stratification analysis based on principal
score in settings of one-sided non-compliance and in settings with outcome truncated by death.
We will consider the two complications one at a time, defining separate principal scores for each
post-treatment variable. In practice we should deal with them simultaneously [24, 65].
The second study is the flu shot encouragement experiment conducted between 1978 and 1980
at a general medicine clinic in Indiana (USA). In this study, physicians were randomly selected to
receive a letter encouraging them to inoculate patients at risk for flu under U.S. Public Health Service
Criteria. The treatment of interest is the actual flu shot, and the outcome is an indicator for flu-related
hospital visits during the subsequent winter. The study suffers from two-sided non-compliance: some
of the study participants, whose physician received the encouragement letter, did not actually receive
the vaccine; and some of the study participants, whose physician did not receive the encouragement
letter, actually received the flu shot. The study was previously analyzed by Hirano et al. [11] using a
model-based Bayesian approach to inference and by Ding and Lu [61] using principal score methods.
The chapter is organized as follows. In Section 21.2, we first introduce the notation and the basic
framework of principal stratification. Then, we review some identifying (structural) assumptions –
ignorability of the treatment assignment mechanism, monotonicity, exclusion restrictions and prin-
cipal ignorability assumptions – and discuss their plausibility in different settings. In Section 17.4
we define principal scores, discuss their properties and provide some details on the estimator under
monotonicity. In Section 17.5 we review identification and estimation strategies of principal causal
effects based on weighting on principal score under monotonicity and principal ignorability assump-
tions. In Section 17.6 we deal with the specification of the principal score model. In Section 17.7 we
review and discuss methods to conducting sensitivity analysis with respect to principal ignorability
and monotonicity. We offer some discussion on possible extensions of principal stratification analysis
based on principal scores under principal ignorability in Section 17.8 together with some concluding
remarks.
treatment, Wi = 0. In addition we will mainly focus on studies with a single binary post-treatment
variable.
We define causal effects using the potential outcomes framework under the Stable Unit Treatment
Value Assumption (SUTVA) [5]. SUTVA rules out hidden versions of treatment and interference
between units, that is, a unit’s potential outcome does not depend on the treatment and intermediate
variables of others, and a unit’s potential value of the intermediate variable does not depend on
the treatment of others. SUTVA is a critical assumption, which may be questionable, especially
in studies where the outcomes are transmissible (e.g., infectious diseases or behaviors) and where
units are grouped in clusters (e.g., clustered encouragement designs) or interact on a network. In the
principal stratification literature multi-level and hierarchical models have been proposed to account
for clustering and relax SUTVA in clustered encouragement designs [31, 68, 69, 69, 70]. Here, for
ease of presentation, we maintain SUTVA throughout. Under SUTVA, for each unit i, there are
two associated potential outcomes for the intermediate and the primary outcome: let S0i and Y0i be
the potential values of the intermediate variable and outcome if unit i were assigned to the control
treatment, Wi = 0, and let S1i and Y1i be the potential values of the intermediate variable and
outcome if unit i were assigned to the active treatment, Wi = 1.
A causal effect of treatment assignment, W , on the outcome Y , is defined to be a comparison
between treatment and control potential outcomes, Y1i versus Y0i , on a common set of units, G,
that is, a comparison of the ordered sets {Y0i , i ∈ G} and {Y1i , i ∈ G}. For instance, we can be
interested in the difference between the means of Y1i and Y0i for all units, ACE = E[Y1i − Y0i ], or
for subgroups of units defined by values of covariates, ACEx = E[Y1i − Y0i | Xi = x].
Principal stratification uses the joint potential values of the intermediate variable to define a
stratification of the population into principal strata. Formally, the basic principal stratification with
respect to a post-treatment variable S is the partition of subjects into sets such that all subjects in the
same set have the same vector Ui = (S0i , S1i ).
In studies with two treatments, the basic principal stratification with respect to a binary inter-
mediate variable, S, partitions subjects into four latent groups: P0 = {(0, 0), (0, 1), (1, 0), (1, 1)},
which we label as {00, 01, 10, 11}, respectively. A principal stratification P with respect to the
post-treatment variable S is then a partition of units into sets being the union of sets in the
basic principal stratification. A common principal stratification is the basic principal stratifica-
tion itself: P = P0 = {{00}, {01}, {10}, {11}}. Other examples of principal stratification are
P = {{00, 11}, {01, 10}} and P = {{00, 10, 11}, {01}}. We denote by Ui an indicator of the
principal stratum membership, with Ui ∈ P. A principal causal effect with respect to a principal
stratification P is defined as a comparison of potential outcomes under standard versus new treatment
within a principal stratum, ζ, in that principal stratification P: Y0i versus Y1i , for all i with Ui = u,
for all u ∈ P [9]. Because the principal stratum variable, Ui , is unaffected by the treatment, any
principal causal effect is a well-defined local causal effect. Here we mainly focus on average principal
causal effects with respect to the basic principal stratification:
ACEu = E [Y1i − Y0i | Ui = u] u ∈ {00, 01, 10, 11} (17.1)
By definition, principal causal effects provide information on treatment effect heterogeneity. Depend-
ing on the context, principal strata may have a specific scientific meaning and principal causal effects
may be of intrinsic interest per se. Furthermore, we could be interested in further investigating the
heterogeneity of treatment effect with respect to observed characteristics. To this end, let ACEu|x
denote the conditional average principal causal effect for the subpopulation of units of type Ui = u
with the same values of the covariates, x: ACEu|x = E [Y1i − Y0i | Ui = u, Xi = x].
In randomized experiments with non-compliance, as our running examples, S is the actual
treatment received and principal stratification classifies units into four groups according to their
compliance status: Ui = 00 for subjects who would never take the treatment, irrespective of their
assignment; Ui = 01 for subjects who always comply with their assignment; Ui = 10 for subjects
who would always do the opposite of their assignment; and Ui = 11 for subjects who would
Potential Outcomes and Principal Stratification 319
always take the treatment, irrespective of their assignment. In the literature these principal strata are
usually referred to as never-takers, compliers, defiers and always-takers, respectively [50]. In these
settings, defiers are often ruled out invoking monotonicity of the compliance status. In randomized
experiments with non-compliance, in addition to the intention-to-treat effect of the assignment, which
is identified thanks to randomization, the interest is usually on the effect of the actual treatment
receipt. Because the actual treatment is not randomized, we must rely on alternative identifying
assumptions. If we could assume ignorability of the treatment conditional on covariates, we would
be able to identify the treatment effect by comparing treated and untreated units within strata defined
by the covariates. However, in non-compliance settings, the treatment receipt is usually confounded.
Principal stratification focuses on well-defined causal effects within principal strata, ACEu , and
provides alternative identifying assumptions. The estimand of interest is usually the average causal
effect for compliers, ACE01 , also named complier average causal effect (CACE). Because under
monotonicity compliers are the only units that can be observed under both the active and control
treatment according to the protocol, the ACE01 can provide information on the effect of the actual
treatment. Furthermore, regardless of the identifying assumptions, principal causal effects can provide
useful information on treatment effect heterogeneity.
In studies where the primary outcome is censored by death, in the sense that it is not only
unobserved but also undefined for subjects who die, S is the survival status with S = 1 for survival
and S = 0 for dead. For instance, in a clinical study where focus is on assessing causal effects of
a new versus standard drug treatment on quality of life one year after assignment, quality of life is
censored by death for patients with S = 0, that is, quality of life is nod defined for patients who die
before reaching the end-point of one-year post-randomization survival. As opposed to the case where
outcomes are missing, i.e., not observed, for patients who experience the censoring event, i.e., death,
quality of life is not defined, without any hidden value masked by death. In these settings, the basic
principal stratification classifies units into the following four groups: always survivors – subjects who
would live no matter how treated, Ui = 11; never survivors – subjects who would die no matter how
treated, Ui = 00; compliant survivors – subjects who would live if treated but would die if not treated,
Ui = 01; and defiant survivors – subjects who would die if treated but would live otherwise, Ui = 10.
Because a well-defined real value for the outcome under both the active treatment and the control
treatment exists only for the latent stratum of always survivors, ACE11 is the only well-defined
and scientifically meaningful principal causal effect in this context [71]. This causal estimand of
interest, ACE11 , is often referred to as the survivor average causal effect (SACE). As opposed to the
non-compliance setting where the primary causal estimand of interest might not be a principal causal
effect, in settings affected by censoring by death the SACE is the only well-defined causal estimand
of intrinsic interest. The problem of censoring by death arises not only in studies where the censoring
event is death and the primary outcome (e.g., quality of life) is not defined for those who die. For
instance, in the Faenza study, when focus is on causal effects on BSE quality, S is the indicator for
BSE practice and the survivor average causal effect on BSE quality, ACE11 , is the average causal
effect of the assignment on BSE quality for women who would practice BSE irrespective of their
assignment to receive the information leaflet only or to also attend the in-person training [24, 65].
Censoring events can also be found in other fields. For instance, in labor economics, where focus is
on the causal effect of a job-training program on wages, wages are censored by unemployment, in
the sense that they are not defined for people who are unemployed [12–14]. In educational studies
where interest is in assessing the effect of a special educational intervention in high school on final
test scores, test score is censored by death for students who do not finish high school [33].
In studies with loss at follow-up or missing outcome data, the intermediate variable S is the
response indicator and units can be classified into principal strata according to their response behavior,
defined by the joint values of the potential response indicator under each treatment condition [15].
In studies where S is a surrogate or a mediator, precious information can be obtained by separately
looking at casual effects for principal strata where the intermediate variable is unaffected by the
treatment, ACE00 and ACE11 , and casual effects for principal strata where the intermediate variable
320 Assessing Principal Causal Effects Using Principal Score Methods
TABLE 17.1
Observed data pattern and principal strata for the observed groups.
is affected by the treatment, ACE10 and ACE01 . In this literature ACE00 and ACE11 and ACE10
and ACE01 are, respectively, named dissociative effects and associative effects, because they measure
effects on the outcome that are dissociative and associative with effects on the intermediate variable
[9]. In problems of surrogate endpoints, a good surrogate should satisfy the property of causal
necessity, which implies that ACE00 = 0 and ACE11 = 0 [9], and the property of causal sufficiency,
which implies that ACE10 6= 0 and ACE01 6= 0 [16]. In mediation studies, evidence on the direct
effect of the treatment on the primary outcome is provided by dissociative effects, ACE00 and
ACE11 ; in principal strata where S0i 6= S1i , causal effects are associative, and thus, they combine
direct effects and indirect effects through the mediator [20, 67].
Assumption 6 amounts to state that there is no unmeasured confounding between the treatment
assignment and intermediate variable and between the treatment assignment and outcome.
Assumption 6 holds by design in randomized experiments where covariates do not enter the
assignment mechanism, such as completely randomized experiments. Both the Faenza study and the
flu shot study are randomized experiments where randomization is performed without taking into
account the values of the pretreatment variables, and thus, Assumption 6 holds by design.
Assumption 6 implies that the distribution of the principal stratification variable, Ui , is the same
in both treatment arms, both unconditionally and conditionally on covariates.
In stratified randomized experiments the treatment assignment mechanism is ignorable by design
conditionally on the design variables: Wi ⊥⊥ (S0i , S1i , Y0i , Y1i ) | Xi . In observational studies
ignorability of the treatment assignment mechanism conditionally on the observed covariates is a
critical assumption, and it strongly relies on subject matter knowledge, which allows one to rule out
the existence of relevant unmeasured confounders that affect the treatment as well as the outcome
and intermediate variable.
When the treatment assignment is ignorable only conditional on covariates, it is the conditional
distribution of the principal stratification variable, Ui , given covariates that is the same in both
treatment arms.
Strong Monotonicity
Observed Actual Treatment Observed Intermediate Possible Latent
Group Assigned Variable Principal Strata
Ow,s Wi Si Ui
O0,0 0 0 00 01
O1,0 1 0 00
O1,1 1 1 01
Monotonicity
Observed Actual Treatment Observed Intermediate Possible Latent
Group Assigned Variable Principal Strata
Ow,s Wi Si Ui
O0,0 0 0 00 01
O0,1 0 1 11
O1,0 1 0 00
O1,1 1 1 01 11
{i : Wi = 1, Si = 1} comprise only units belonging to the principal stratum 00 and 01, respectively.
The control group, that is, the observed group O(0, 0) = {i : Wi = 0, Si = 0}, still comprises a
mixture of the two latent principal strata, Ui = 00 and Ui = 01. Under monotonicity (Assumption 8),
the observed group O0,1 = {i : Wi = 0, Si = 1} comprises only units belonging to the principal
stratum 11, and the observed group O1,0 = {i : Wi = 1, Si = 0} comprises only units belonging to
the principal stratum 00. The observed groups O0,0 = {i : Wi = 0, Si = 0} and O1,1 = {i : Wi =
1, Si = 1} comprise a mixture of the two types of units: units of types Ui = 00 and Ui = 01, and
units of type Ui = 01 and Ui = 11, respectively.
Proposition 17.1. Suppose that ignorability of treatment assignment (Assumption 6), (strong) mono-
tonicity (Assumption 7 or Assumption 8), and exclusion restrictions (Assumptions 9 and 10) hold,
then
E[Y1i − Y0i ]
ACE01 =
E[S1i − S0i ]
Exclusion restriction assumptions may be questionable in open-label randomized experiments, as
the Faenza study and the flu shot encouragement study, where treatment assignment may have a direct
effect on the outcome. In the Faenza study the exclusion restriction for never-takers, women who
would not participate in the BSE training course irrespective of their treatment assignment, might be
violated if the decision of some women to not attend the BSE training course induces them not to
perform BSE and take instead other screening actions (e.g., mammography, breast ultrasonography).
In the flu shot study the exclusion restriction for always-takers, patients who would take the flu
shot regardless of their physician’s receipt of the encouragement letter, may be arguable if always-
takers comprise weaker patients, and thus, the letter prompted physicians to take other medical actions
for this type of patients, e.g., by advising them about ways to avoid exposure or by providing them
other medical treatment. The exclusion restriction for never-takers, which implies that never-takers
are completely unaffected by their physicians’ receipt of the letter, may be a reasonable assumption if
never-takers and their physicians do not regard the risk of flu as high enough to warrant inoculation,
and thus, they are not subject to other medical actions either [11]. On the other hand, also the
exclusion restriction for never-takers may be arguable, if never-takers’ physicians were particularly
meticulous, and thus, the letter encouraged them to take other precautions for never-takers, given that
they will not get the flu shot.
Outside the non-compliance setting, ACE00 and ACE11 may be effects of primary interest. For
instance, in problems of censoring by death, surrogacy and mediation, assessing whether ACE00
and/or ACE11 is zero is the actual scientific question of interest. Therefore, in these settings,
exclusion restrictions would rule out a priori the effects of interest, and thus, cannot be invoked.
Y0i ⊥⊥ Ui | Xi (17.2)
and
Y1i ⊥⊥ Ui | Xi (17.3)
Strong principal ignorability requires that each potential outcome, Y0i and Y1i , is independent
of the bi-variate variable Ui = (S0i , S1i ) conditionally on the covariates. Specifically, it assumes
conditional independence of Y0i and Ui (Equation (17.2)) and of Y1i and Ui (Equation (17.3)) given
the covariates, Xi . Therefore, conditionally on Xi the marginal distributions of the control and
treatment potential outcomes are the same across principal strata, and thus, ACEx = ACE00|x =
ACE10|x = ACE01|x = ACE11|x .
In non-compliance and mediation settings, strong principal ignorability implies that the causal
effect of the assignment, conditional on covariates, is the same for all principal strata, regardless
of whether the intermediate variable is affected by the assignment. This could either preclude any
impact from the intermediate variable or it could mean that the direct impact of the assignment is
heterogenous and can compensate the lack of the effect of the intermediate variable for never-takers
324 Assessing Principal Causal Effects Using Principal Score Methods
(and always-takers). For instance, in a randomized evaluation of a job training program with non-
compliance never-takers could receive an equivalent program to compliers [62]. However, in general,
this assumption is quite difficult to justify in practice, especially in non-compliance and mediation
settings.
Assumption 12. (Weak principal ignorability).
and
Y1i ⊥⊥ S0i | S1i , Xi (17.5)
Weak principal ignorability requires that the potential outcome Ywi is independent of the marginal
potential outcome for the intermediate variable under the opposite treatment, i.e., S(1−w)i given
Swi and covariates, w = 0, 1. Specifically, weak principal ignorability assumes that conditional on
the covariates, Xi , Y0i is independent of S1i given S0i (Equation (17.4)), and Y1i is independent of
S0i given S1i (Equation (17.5)). Equation (17.4) implies that conditionally on Xi , the distributions
of the control potential outcomes, Y0i , are the same across strata Ui = 00 and Ui = 01 (with
S0i = 0) and across strata Ui = 10 and Ui = 11 (with S0i = 1). Similarly, Equation (17.5)
implies that, conditionally on Xi , the distributions of the treatment potential outcomes, Y1i , are the
same across strata Ui = 00 and Ui = 10 (with S1i = 0) and across strata Ui = 01 and Ui = 11
(with S1i = 1). Assumption 12 (weak principal ignorablity) is strictly weaker than Assumption
11 (strong principal ignorablity), because it applies to a subset of units in each treatment arm.
Principal ignorability assumptions, implying that treatment and/or control potential outcomes have
the same distribution across some principal strata, are types of homogeneity assumptions, similar to
homogeneity assumptions recently introduced in causal mediation analysis [27, 69].
Principal ignorability assumptions require that the set of the observed pre-treatment variables,
Xi , includes all the confounders between the intermediate variable, Si , and the outcome, Yi . 1
Under strong monotonicity, Ui = (S0i , S1i ) = (0, S1i ), and thus, principal stratum membership
is defined by S1i only. In this setting, strong and weak principal ignorability (Assumptions 11 and
12) are equivalent: they both require that conditionally on the covariates, the distributions of the
control and treatment potential outcomes for units of type Ui = 01 and units of type Ui = 00 are
the same: Y0i | Ui = 00, Xi ∼ Y0i | Ui = 01, Xi and Y1i | Ui = 00, Xi ∼ Y1i | Ui = 01, Xi . In the
Faenza study where strong monotonicity holds by design, Equations (17.2) and (17.4) imply that for
women assigned to control, that is, women who only receive mailed information on how to perform
BSE correctly, the distribution of their outcome, BSE practice, given the covariates, does not depend
on whether they would have attended the BSE training course if offered. Although this assumption
is very strong, it may be plausible in the Faenza study, given that women assigned to the control
treatment have no access to the training course, and thus, the observed training course participation
Si takes on the same value, zero, for all of them, irrespective if they are compliers or never-takers.
Equations (17.3) and (17.5) imply that for women invited to attend the BSE training course, their
likelihood of practising BSE is unrelated to whether they actually participate in the course, given the
covariates. This assumption is difficult to justify.
It is worth noting that, under strong monotonicity, Equations (17.3) and (17.5) are superfluous for
identification: Equation (17.5) naively holds because S0i = 0 for all units i = 1, . . . , n under strong
monotonicity; and Equation (17.3) is not required because we directly observe the principal stratum
membership for units assigned to treatment, we can directly estimate the outcome distributions under
treatment for units of type Ui = 00 and Ui = 01.
1 Ignorability of the intermediate variable in the form of sequential ignorability also rules out unmeasured confounders
between the intermediate variable and the outcome. For an in-depth comparison between sequential and principal ignorability
assumptions, see [18].
Structural Assumptions 325
Under monotonicity, there exist three principal strata: 00, 01, 11, and under strong principal
ignorability they all have the same conditional distributions of the control and treatment potential
outcomes given the covariates: Y0i | Ui = 01, Xi ∼ Y0i | Ui = 00, Xi ∼ Y0i | Ui = 11, Xi and
Y1i | Ui = 01, Xi ∼ Y1i | Ui = 00, Xi ∼ Y1i | Ui = 11, Xi . In the flu shot study Assumption 11
states that, given covariates, whether a patient actually receives the influenza vaccine is unrelated
to that patient’s potential hospitalization status under treatment and under control. Equivalently,
given covariates, whether a patient actually receive the influenza vaccine is unrelated to the effect
of the receipt of the encouragement letter by his/her physician on hospitalization. Strong principal
ignorability is a quite strong assumption and may be implausible in practice. However, under
monotonicity, it has testable implications because we observe the outcome under treatment for
units of type Ui = 00 (never-takers) and the outcome under control for units of type Ui = 11
(always-takers).
Under monotonicity, the key implications of weak principal ignorability are that, conditionally
on Xi , (i) the distributions of the control potential outcomes, Y0i , are the same across strata Ui = 00
and Ui = 01 (with S0i = 0): Y0i | Ui = 01, Xi ∼ Y0i | Ui = 00, Xi ; and (ii) the distributions of
the treatment potential outcomes are the same across strata Ui = 01 and Ui = 11 (with S1i = 1):
Y1i | Ui = 01, Xi ∼ Y1i | Ui = 11, Xi .
In the flu shot study, Equation (17.4) states that, for patients whose physician does not receive
the encouragement letter (patients assigned to control) and who are not inoculated (S0i = 0), their
hospitalization status (Y0i ) is unrelated to whether they would have been inoculated if their physician
had received the encouragement letter (S1i ), given covariates. Similarly, Equation (17.5) states that,
for patients whose physician receives the encouragement letter (patients assigned to treatment) and
who receive a flu shot (S1i = 1), their hospitalization status (Y1i ) is unrelated to whether they would
have received a flu shot if their physician had not received the encouragement letter (S0i ), given
covariates.
It is worth noting that if focus is on average principal causal effects, we can use weaker ver-
sions of the principal ignorability assumptions, requiring mean independence rather than statistical
independence:
Assumption 13. (Strong principal ignorability - Mean independence).
and
E[Y1i | Ui , Xi ] = E[Y1i | Xi ] (17.7)
Assumption 14. (Weak principal ignorability - Mean independence).
and
E[Y1i | S1i , S0i , Xi ] = E[Y1i | S1i , Xi ] (17.9)
Although most of the existing results on principal stratification analysis based on principal
scores hold under these weaker versions of strong and weak principal ignorability (Assumptions
13 and 14) [61, 62, 64], we prefer to define principal ignorability assumptions in terms of statistical
independence (Assumptions 11 and 12).
In some studies exclusion restriction may be plausible for at least one of the two types of units
Ui = 00 and Ui = 11 for which treatment assignment has no effect on the intermediate variable. In
these settings we can combine exclusion restriction and principal ignorability assumptions [62]. For
instance, in the flu shot study, the exclusion restriction for always-takers (patients of type Ui = 11)
appears arguable, but the exclusion restriction for never-takers (patients of type Ui = 00) is relatively
uncontroversial. Therefore, we can assume that (i) never-takers are completely unaffected by their
326 Assessing Principal Causal Effects Using Principal Score Methods
physicians’ receipt of the letter (Assumption 9 – exclusion restriction for never-takers) and that (ii)
the conditional distributions given covariates of the treatment outcomes are the same for always-
takers and compliers (Assumption 12 – weak principal ignorability – Equation (17.5) for patients
with S1i = 1):
Assumption 15. (Exclusion restriction for units of type Ui = 00 and weak principal ignorability).
(i) Exclusion restriction for units of type 00:
where the second equality follows from strong principal ignorability and the third equality follows
from ignorability of treatment assignment.
Similarly, suppose that ignorability of treatment assignment (Assumption 6) and weak principal
ignorability (Assumption 12) hold. Let u = s0 s1 , s0 , s1 ∈ {0, 1}. Then,
where the third equality follows from weak principal ignorability and the fourth equality follows
from ignorability of treatment assignment. Therefore under ignorability of treatment assignment and
strong/weak principal ignorability, we can non-parametrically identify conditional principal causal
effects, ACEu|x .
To further identify principal causal effects, ACEu , an additional assumption, such as monotonic-
ity, is needed to identify the conditional distribution of the covariates given the principal stratum
variable. Thus, under ignorability of treatment assignment, strong/weak principal ignorability, and
monotonicity, the principal causal effects ACEu are non-parametrically identified by:
where the outer expectation is over the conditional distribution of the covariates given the principal
stratum variable.
The non-parametric identification of principal causal effects under strong principal ignorability
implies that we can estimate them by estimating the mean difference of observed outcomes between
treated and untreated units adjusting for observed covariates. Under weak principal ignorability we
also need to adjust for the observed value of the intermediate variable. Working within cells defined
by the covariates, although feasible in principle, may be difficult in practice with a large number of
pre-treatment variables, possibly including continuous covariates. Borrowing from the propensity
score literature, we can face the curse of dimensionality using the principal score.
The principal score has two key properties [57, 61, 62]:
1. Balancing of pre-treatment variables across principal strata given the principal score.
I {Ui = u} ⊥⊥ Xi | eu (Xi ),
where I {·} is the indicator function.
The balancing property states that the principal score is a balancing score in the sense that within
groups of units with the same value of the principal score for a principal stratum u, eu (x), the
probability that Ui = u does not depend on the value of the covariates, Xi :
In other words, within cells with the same value of eu (x), the distribution of covariates is the
same for those belonging to the principal stratum u and for those belonging to other principal
strata.
2. Principal ignorability given the principal score. If principal ignorability holds conditional on the
covariates, it holds conditional on the principal score. Formally,
(a) Strong principal ignorability given the principal score. If
⊥ I {Ui = u} | Xi
Y0i ⊥ and Y1i ⊥⊥ I {Ui = u} | Xi
328 Assessing Principal Causal Effects Using Principal Score Methods
for all u.
(b) Weak principal ignorability given the principal score. If
and
Y1i ⊥⊥ I {S0i = s0 } | S1i = s1 , Xi
for all s0 , s1 , then
Y0i ⊥⊥ I {S1i = s1 } | S0i = s0 , eu (Xi )
and
Y1i ⊥⊥ I {S0i = s0 } | S1i = s1 , eu (Xi )
for u = s0 s1 , for all s0 , s1 .
These properties of the principal score imply that under ignorability of treatment assignment (As-
sumption 6) and strong principal ignorability (Assumption 11)
and under ignorability of treatment assignment (Assumption 6) and weak principal ignorability
(Assumption 12), for u = (s0 , s1 ), s0 , s1 ∈ {0, 1},
eu (x) = Pr (Ui = u | Xi = x)
= Pr (Ui = u | Xi = x, Wi = 0)
= Pr (Ui = u | Xi = x, Wi = 1) .
This result implies that the principal stratum variable, Ui , has the same conditional distribution
given the covariates in both treatment arms. Therefore, if we could observe the principal stratum
membership for units in a treatment arm, we could easily infer the conditional distribution of Ui for
all units. Let pw (x) = Pr(Si = 1 | Wi = w, Xi = x), w = 0, 1, denote the conditional probability
of Si = 1 under treatment status w given covariates, X. Under strong monotonicity (Assumption
7), Ui = 00 or Ui = 01. Because within the treatment group, Si = 1 if and only if Ui = 01 and
Si = 0 if and only if Ui = 00, and ignorability of treatment assignment guarantees that Ui has the
same distribution in both treatment arms conditional on covariates, we have that e01 (x) = Pr(Ui =
Principal Scores 329
The likelihood function has a complex form involving mixtures of distributions, which challenge
inference. However, we can obtain an estimate of the model parameter vector, α, and thus, of the
principal scores, eu (Xi ; α), by using missing data methods, such as the EM algorithm [72] (see [61]
for details on EM estimation of principal scores) or the data augmentation (DA) algorithm [73].
It is worth noting that the joint method allows us to estimate the principal scores even without
monotonicity. Alternative smoothing techniques can be used, including nonparametric regression
methods and machine learning methods.
goal is to obtain estimates of the principal score that best balance the covariates between treated and
control units within each principal stratum. This criterion is somewhat vague, and we elaborate on
its implementation later (see Section 17.6). From this perspective, we can use a two-part procedure
for specifying the principal score model. First we specify an initial model, motivated by substantive
knowledge. Second, we assess the statistical adequacy of an estimate of that initial model, by checking
whether the covariates are balanced for treated and control units within each principal stratum. In
principle, we can iterate back and forth between these two stages, specification of the model and
assessment of that model, each time refining the specification of the model, by, e.g., adding higher
polynomial and interaction terms of the covariates. It is worth noting that estimation of the principal
scores only requires information on the covariates, Xi , the treatment assignment variable, Wi , and
the intermediate variable, Si : it does not use the outcome data. As in the design of observational
studies with regular assignment mechanism [18,38,39], this outcome-free strategy guarantees against
the temptation to search for favorable outcome models and effect estimates.
E [Ywi | Ui = u] = E [E [Ywi | Xi = x, Ui = u] | Ui = u]
Z
= E [Ywi | Xi = x, Ui = u] fXi |Ui (x | u) dx
Z
fX (x)eu (x)
= E [Ywi | Xi = x, Ui = u] i dx
eu
eu (x)
= E E [Ywi | Xi = x, Ui = u]
eu
where the first/second equality follows from the law of iterated expectations, the third equality follows
form the Bayes’s formula, and the outer expectation in the last equality is over the distribution of the
covariates.
The following propositions summarize the main results on weighting under Assumption 6,
which holds by design in randomized studies where randomization does not depend on covariates.
Proposition 17.2 shows identification of average principal causal effects under strong monotonicity.
For instance, it may be used to draw inference on principal causal effects in randomized experiments
with one-sided non-compliance, such as the Faenza study. Recall that under strong monotonicity,
strong principal ignorability (Assumption 11) and weak principal ignorability (Assumption 12) are
equivalent.
Proposition 17.2. Suppose that ignorability of treatment assignment (Assumption 6) and strong
monotonicity (Assumption 7) hold.
If strong/weak principal ignorability (Assumption 11/12) holds, then
If only Equation (17.2)/ (17.4) of strong/weak principal ignorability (Assumption 11/12) holds, then
where
e01 (x) e00 (x) 1 − e01 (x)
ω01 (x) = and ω00 (x) = =
e01 e00 1 − e01
Proposition 17.3 shows identification of average principal causal effects under monotonicity,
which implies that there exist three latent principal strata. We may use results in Proposition 17.3 to
analyze the flu shot data [61].
Proposition 17.3. Suppose that ignorability of treatment assignment (Assumption 6) and monotonic-
ity (Assumption 7) hold.
If strong principal ignorability (Assumption 11) holds, then
eu (x)
where wu (x) = , for u = 00, 01, 11.
eu
If weak principal ignorability (Assumption 12) holds, then
where
e01 (x) . e
01
ω0,01 (x) =
e00 (x) + e01 (x) e00 + e01
e01 (x) . e
01
ω1,01 (x) =
e01 (x) + e11 (x) e01 + e11
e00 (x) . e
00
ω0,00 (x) =
e00 (x) + e01 (x) e00 + e01
e11 (x) . e
11
ω1,11 (x) =
e01 (x) + e11 (x) e01 + e11
It is worth noting that, under strong principal ignorability, Equations (17.11) and (17.14) also
identify principal causal effects under strong monotonicity and monotonicity, respectively. Therefore,
under strong principal ignorability, there are two possible approaches to the identification of the
principal causal effects, which could be compared to see if they provide consistent estimtates.
Proposition 17.4 shows identification of average principal causal effects under monotonicity and
a combination of exclusion restrictions and principal ignorability assumptions. Proposition 17.4
offers an alternative approach to the analysis of the flu shot encouragement study, where exclusion
restriction for always-takers (Ui = 11) is controversial but exclusion restriction for never-takers
(Ui = 00) is deemed to be reasonable.
Proposition 17.4. Suppose that ignorability of treatment assignment (Assumption 6) and monotonic-
ity (Assumption 7) hold.
332 Assessing Principal Causal Effects Using Principal Score Methods
and
ACE11 = E [ω1,11 (Xi ) · Yi | Wi = 1, Si = 1] − E [Yi | Wi = 0, Si = 1]
e11 + e01 e11
ACE01 = E [Yi | Wi = 1, Si = 1] − E [Yi | Wi = 0, Si = 1] −
e01 e01
E [ω0,01 (Xi ) · Yi | Wi = 0, Si = 0]
and
ACE00 = E [Yi | Wi = 1, Si = 0] − E [ω0,00 (Xi ) · Yi | Wi = 0, Si = 0] .
The practical use of the above propositions requires the derivation of appropriate estimators for
the principal scores, eu (x), the principal strata proportions, eu , and the expectations of the outcome
over subpopulations involved. We can use simple moment-based estimators of the quantities in
Propositions 17.2, 17.3 and 17.4. In particular, Pnwe replace the principal scores eu (x) and eu with
their consistent estimators, e
b u (Xi ) and e
b u = Pnebu (Xi )/n.
i=1
Define nw = i=1 I{Wi = w} and nws = i=1 I{Wi = w} · I{Si = s}, w = 0, 1, s = 0, 1.
Pn
Then, moment-based estimators of the expectations of the outcome over subpopulations involved in
Propositions 17.2, 17.3 and 17.4 are
Yi · I{Wi = w} · I{Si = s}
Pn
E [Yi | Wi = w, Si = s] = i=1
b
nws
and
n
b [wu (Xi ) · Yi | Wi = w] = 1 bu (Xi ) · Yi · I{Wi = w}
X
E w
nw i=1
b [w1,u (Xi ) · Yi | Wi = 1, Si = 1] =
E
n
1 X
b1,u (Xi ) · Yi · I{Wi = 1} · I{Si = 1}
w u = 01, 11
n11 i=1
b [w0,u (Xi ) · Yi | Wi = 0, Si = 0] =
E
n
1 X
b0,u (Xi ) · Yi · I{Wi = 0} · I{Si = 0}
w u = 00, 01.
n00 i=1
Estimating Principal Causal Effects Using Principal Scores 333
We could also normalize the weights to unity within groups using the following estimators of the
weighted averages of the outcomes:
b [wu (Xi ) · Yi | Wi = w]
E
bu (Xi ) · Yi · I{Wi = w}
Pn
i=1 w
=
bu (Xi ) · I{Wi = w}
P n
w
Pn i=1
i=1 e bu (Xi ) · Yi · I{Wi = w}
=
bu (Xi ) · I{Wi = w}
P n
i=1 e
b [w1,u (Xi ) · Yi | Wi = 1, Si = 1]
E
b1,u (Xi ) · Yi · I{Wi = 1} · I{Si = 1}
Pn
i=1 w
=
b1,u (Xi ) · I{Wi = 1} · I{Si = 1}
P n
i=1 w
b [w0,u (Xi ) · Yi | Wi = 0, Si = 0]
E
b0,u (Xi ) · Yi · I{Wi = 0} · I{Si = 0}
Pn
i=1 w
=
b0,u (Xi ) · I{Wi = 0} · I{Si = 0}
P n
i=1 w
For inference, we can use nonparametric bootstrap, which allows us to naturally account for both
uncertainty from principal score estimation and sampling uncertainty. Bootstrap inference is based
on the following procedure:
1. Fit the principal score model and estimate the principal causal effects on the original data set;
2. Generate B bootstrap data sets by sampling n observations with replacement from the original
data set;
3. Fit the principal score model and estimate the principal causal effects on each of the B bootstrap
data sets.
Bootstrap standard errors and (1 − α) confidence intervals for the principal causal effects can be
obtained by calculating the standard deviation and the (α/2)th and (1 − α/2)th percentiles of the
bootstrap distribution.
It is worth noting that inference depends on the identifying assumptions. Generally, stronger iden-
tifying assumptions (e.g., strong principal ignorability versus weak principal ignorability) increase
precision, by leading to estimators with smaller variance. Nevertheless, estimators assuming stronger
assumptions may be biased if the underlying assumptions do not hold (see [62] for some theoretical
justifications on this issue).
In the above estimation procedure for principal causal effects, pre-treatment variables are only
used to predict principal stratum membership and construct the weights. Pre-treatment variables may
contain precious information about both the principal strata and the outcome distributions, and thus,
adjusting for covariates may help improving statistical efficiency. To this end, weighted regression
models can be used to estimate principal causal effects [57, 59]. Alternative covariate-adjusted
estimators can be derived using the following equality
0 0
ACEu = E[Y1i − β1,u Xi | Ui = u] − E[Y0i − β0,u Xi | Ui = u]
+ (β1,u − β0,u )0 E[Xi | Ui = u],
334 Assessing Principal Causal Effects Using Principal Score Methods
which holds for all u and all fixed vectors, βw,u , u = 00, 10, 01, 11; w = 0, 1. Specifically, we can
0
apply Propositions 17.2 and 17.3 treating the residuals Ywi − βw,u Xi as new potential outcomes [61]:
Corollary 1. Suppose that ignorability of treatment assignment (Assumption 6) and strong mono-
tonicity (Assumption 7) hold.
If strong/weak principal ignorability (Assumption 11/12) holds, then
0 0
E[Ywi − βw,u Xi | Ui = u] = E ωu (Xi ) · (Yi − βw,u Xi ) | Wi = w w = 0, 1
E[Xi | Ui = u] = E [ωu (Xi ) · Xi | Wi = 0]
= E [ωu (Xi ) · Xi | Wi = 1]
If only Equation (17.2)/ (17.4) of strong/weak principal ignorability (Assumption 11/12) holds, then
0 0
E[Y1i − β1,01 Xi | Ui = 01] = E Yi − β1,01 Xi | Wi = 1, Si = 1
0 0
E[Y0i − β0,01 Xi | Ui = 01] = E ω01 (Xi ) · (Yi − β0,01 Xi ) | Wi = 0
E[Xi | Ui = 01] = E [Xi | Wi = 1, Si = 1]
= E [ω01 (Xi ) · Xi | Wi = 0]
0 0
E[Y1i − β1,00 Xi | Ui = 00] = E Yi − β1,00 Xi | Wi = 1, Si = 0
0 0
E[Y0i − β0,00 Xi | Ui = 00] = E ω00 (Xi ) · (Yi − β0,01 Xi ) | Wi = 0
E[Xi | Ui = 00] = E [Xi | Wi = 1, Si = 0]
= E [ω00 (Xi ) · Xi | Wi = 0]
For any fixed vectors βw,u , covariate-adjustment estimators for principal strata effects can be
derived using the empirical analog of the expectations in corollaries
Pn 1 and 2 with eu (x) and eu
again replaced by their consistent estimators, ebu (Xi ) and ebu = i=1 ebu (Xi )/n. For instance, under
Assumptions 6, 7 and 12, a possible covariate-adjustment estimator for ACE01 is
Xi ) · I{Wi = 1} · I{Si = 1}
Pn 0
adj ω (X ) · (Yi − β1,01
[ 01 = i=1 1,01 i
ACE −
n11
i=1 ω0,01 (Xi ) · (Yi − β0,01 Xi ) · I{Wi = 0} · I{Si = 0}
Pn 0
+
n
" 00
n
1
ω1,01 (Xi ) · Xi · I{Wi = 1} · I{Si = 1}+
0
X
(β1,01 − β0,01 )
n00 + n11 i=1
n
#
ω1,01 (Xi ) · Xi · I{Wi = 0} · I{Si = 0}
X
i=1
The vectors βw,u can be chosen as linear regression coefficients of potential outcomes on the space
spanned by the covariates for units of type Ui = u:
−1
βw,u = E [X 0 X | Ui = u] E [XYw | Ui = u]
where X is a matrix with ith row equals to Xi and Yw is a n−dimensional vector with ith element
equal to Ywi . Each component of the above least squares formula is identifiable under ignorability of
treatment assignment, (strong) monotonicity, and strong/weak principal ignorability [61]. Simulation
studies suggest that this type of covariate-adjusted estimators perform better than the regression
method proposed by Jo and Stuart [57]: they are robust to mispecification of the outcome models and
have smaller variance [61].
Recently, Jiang et al. [64] have generalized Proposition 17.3 assuming ignorability of treatment
assignment conditionally on covariates: Wi ⊥⊥ (S0i , S1i , Y0i , Y1i ) | Xi . Under ignorability of
treatment assignment given covariates, monotonicity and weak principal ignorability (Assumptions
6, 8, 12), the identification formulas in Equation (17.14) still hold with an additional weighting
term based on the inverse of the treatment probability (see Theorem 1 in [64]). Theorem 1 in [64]
also introduces two additional sets of identification formulas, providing overall three nonparametric
identification formulas for each principal causal effect. These three identification formulas lead
also to develop three alternative estimators. Moreover, Jiang et al. [64] show that appropriately
combining these estimators through either the efficient influence functions in the semi-parametric
theory or the model-assisted estimators in the survey sampling theory, we can obtain new estimators
for the principal causal effects, which are triply robust, in the sense that they are consistent if two
of the treatment, intermediate variables, and outcome models are correctly specified, and they are
locally efficient if all three models are correctly specified. Simplified versions of these results hold
under ignorability of treatment assignment as defined in Assumption 6 and/or strong monotonicity
(Assumptions 7). See [64] for details.
score allows us to estimate the expectation of any stratum-specific function of the covariates, h(x),
via a principal score weighted average. Specifically, Ding and Lu [61] show that if the treatment
assignment mechanism satisfies Assumption 6, such that, for all u,
E[h(Xi ) | Ui = u] = [h(Xi ) | Ui = u, Wi = 1] = E[h(Xi ) | Ui = u, Wi = 0]
then, under strong monotonicity (Assumption 7), we have the following balancing conditions:
E[h(Xi ) | Ui = 01, Wi = 1]
= E[h(Xi ) | Wi = 1, Si = 1] = E[ω01 · h(Xi ) | Wi = 0] =
E[h(Xi ) | Ui = 01, Wi = 0]
E[h(Xi ) | Ui = 00, Wi = 1]
= E[h(Xi ) | Wi = 1, Si = 0] = E[ω00 · h(Xi ) | Wi = 0] =
E[h(Xi ) | Ui = 00, Wi = 0]
Similarly, under monotonicity (Assumption 8), we have
E[h(Xi ) | Ui = 01, Wi = 1]
= E[ω1,01 · h(Xi ) | Wi = 1, Si = 1] = E[ω0,01 · h(Xi ) | Wi = 0, Si = 0] =
E[h(Xi ) | Ui = 01, Wi = 0]
E[h(Xi ) | Ui = 00, Wi = 1]
= E[h(Xi ) | Wi = 1, Si = 0] = E[ω0,00 · h(Xi ) | Wi = 0, Si = 0] =
E[h(Xi ) | Ui = 00, Wi = 0]
E[h(Xi ) | Ui = 11, Wi = 1]
= E[ω1,11 · h(Xi ) | Wi = 1, Si = 1] = E[h(Xi ) | Wi = 0, Si = 1] =
E[h(Xi ) | Ui = 11, Wi = 0]
In practice, we replace the true principal scores with the estimated principal scores and investigate
whether, at least approximately, the previous equalities hold. If the above balancing conditions are
violated, we need to specify more flexible models for the principal scores to account for the residual
dependence of the principal stratum variable, Ui , on the covariates, Xi .
Jiang et al. [64] generalize these balancing conditions to studies where the treatment assignment
mechanism is ignorable conditional on the observed covariates and introduce additional sets of
balancing conditions using different sets of weights.
We can assess covariate balance using various summary statistics of the covariate distributions
by treatment status within principal strata. For instance, borrowing from the literature on the role
of the propensity score in the design of a regular observational study [18], we can measure the
difference between the covariate distributions focusing on locations or dispersion using the normalized
differences within principal strata, as a measure of the difference in location:
E[Xi | Ui = u, Wi = 1] − E[Xi | Ui = u, Wi = 0]
∆u = p ,
(V [Xi | Ui = u, Wi = 1] + V [Xi | Ui = u, Wi = 0])/2
and the logarithm of the ratio of standard deviations within each principal stratum as a measure of
the difference in dispersion:
p
V [Xi | Ui = u, Wi = 1]
Γu = log p .
V [Xi | Ui = u, Wi = 0]
Sensitivity Analysis 337
These quantities can be easily estimated either directly from the observed data or via weighting
using the estimated principal score according to the above balancing conditions (or to the version
in [64]). For instance, under monotonicity, the mean and the variance of each covariate for units of
type Ui = 01 can be estimated as follows:
and
Vb [Xi | Ui = 01, Wi = w] =
2
Pn b i | Ui = 01, Wi = w] · I{Wi = w}
bw,01 (Xi ) · Xi − E[X
ω
i=1
.
bw,01 (Xi ) · I{Wi = w} − 1
Pn
i=1 ω
variables like knowledge of screening tests and breast cancer risk perceptions and attitudes can be
viewed as important counfounders, associated with both the compliance behavior and BSE practice.
Therefore, principal ignorability assumptions may be untenable whenever the observed variables do
not properly account for them. Although the Faenza study includes information on pre-treatment
BSE practice and knowledge of breast pathophysiology, which may be considered as proxies of
breast cancer risk perceptions and attitudes and knowledge of screening tests, a sensitivity analysis
with respect to principal ignorability assumptions is still valuable, making inference based on those
assumptions more defensible. In the flu shot encouragement study, Ding and Lu [61] find that results
based on principal ignorability assumptions are consistent with those obtained by Hirano et al. [11]
using a model-based Bayesian approach to inference, suggesting that the exclusion restriction may
be plausible for never-takers, but does not hold for always-takers. The coherence between the two
analyses supports the plausibility of principal ignorability, but does not prove it, and thus, performing
a sensitivity analysis for principal ignorability is recommendable.
A sensitivity analysis relaxes the principal ignorability assumptions without replacing them with
additional assumptions, and thus, it leads to ranges of plausible values for principal causal effects,
with the width of these ranges corresponding to the extent to which we allow the principal ignorability
assumptions to be violated.
Feller et al. [62] propose a partial identification-based approach to sensitivity analysis, where
estimates of principal causal effects under principal ignorability assumptions are compared with their
corresponding nonparametric bounds. We can interpret estimates of principal causal effects based on
principal ignorability assumptions as possible guesses of the true principal causal effects within the
bounds, so that estimates of principal causal effects based on principal ignorability outside of the
corresponding bounds provide strong evidence against principal ignorability. Unfortunately, Feller
et al. [62] do not provide any technical detail on the implementation of the partial identification
approach, describing it as a valuable direction for future work.
A formal framework to assess the sensitivity of the deviations from the principal ignorability
assumptions has been developed by Ding and Lu [61]. Specifically, they focus on violations of the
weaker version of the (weak) principal ignorability assumption (Assumption 14). Assumption 14
implies that (i) under strong monotonicity, the conditional means of the control potential outcomes
are the same for principal strata 01 and 00 given covariates, E[Yi (0) | Ui = 01, Xi ] = E[Yi (0) |
Ui = 00, Xi ]; and (ii) under monotonicity, the conditional means of the control potential outcomes
are the same for principal strata 01 and 00 given covariates, and the conditional means of the
treatment potential outcomes are the same for principal strata 01 and 11 given covariates, E[Yi (0) |
Ui = 01, Xi ] = E[Yi (0) | Ui = 00, Xi ] and E[Yi (1) | Ui = 01, Xi ] = E[Yi (1) | Ui = 11, Xi ].
Given that Assumption 14 is weaker than Assumption 12, results derived under Assumption 12 in
Propositions 17.2 and 17.3 still hold under the weaker version centered on conditional means, and
clearly, if Assumption 14 does not hold, Assumption 12 does not hold either.
The sensitivity analysis proposed by Ding and Lu [61] is based on the following proposition:
Proposition 17.5. Suppose that ignorability of treatment assignment (Assumption 6) holds.
Under strong monotonicity (Assumption 7), define
E [Y0i | Ui = 01]
= .
E [Y0i | Ui = 00]
Then, for a fixed value of , we have
ACE01 = E [Yi | Wi = 1, Si = 1] − E [ω01 (Xi ) · Yi | Wi = 0]
(17.15)
ACE00 = E [Yi | Wi = 1, Si = 0] − E [ω00 (Xi ) · Yi | Wi = 0]
where
· e01 (x) e00 (x)
ω01 (x) = and ω00 (x) =
[ · e01 (x) + e00 (x)] · e01 [ · e01 (x) + e00 (x)] · e00
Sensitivity Analysis 339
where
0 · e01 (x) . e
01
ω0,01 (x) =
e00 (x) + 0 · e01 (x) e00 + e01
1 · e01 (x) . e
01
ω1,01 (x) =
1 · e01 (x) + e11 (x) e01 + e11
e00 (x) . e
00
ω0,00 (x) =
e00 (x) + 0 · e01 (x) e00 + e01
e11 (x) . e
11
ω1,11 (x) =
1 · e01 (x) + e11 (x) e01 + e11
The sensitivity parameters and (0 , 1 ) capture deviations from the principal ignorability
assumption 14. The principal score for units of type Ui = 01, e01 (x), is over-weighted by the
sensitivity parameter under strong monotonicity and by the sensitivity parameters 0 and 1 in
the treatment group and control control group, respectively, under monotonicity. For = 1 and
(0 , 1 ) = (1, 1), no extra weight is applied and the same identification results hold as those under
(weak) principal ignorability (Assumption 12) shown in Propositions 17.2 and 17.3, respectively.
The further away the values of the sensitivity parameters are from 1, the stronger the deviation from
principal ignorability is. Given a set of reasonable values for the sensitivity parameters, or (0 , 1 ),
we can calculate a lower and upper bound for the average principal causal effects over that set,
and assess whether inferences based on principal ignorability assumptions are defensible. Note that
Proposition 17.5 implicitly assumes that the sensitivity parameters and (0 , 1 ) do not depend on
the covariates.
The choice of the sensitivity parameters deserves some discussion. Suppose that strong mono-
tonicity holds. The sensitivity parameter compares the average outcomes under control for units of
type Ui = 01 and units of type Ui = 00 with the same values of the covariates. If the outcome Y is
binary, it is the relative risk of Ui on the control potential outcome, Y0i , given covariates, Xi . This
interpretation suggests that we can select the range of according to subject-matter knowledge on the
relationship between the control potential outcome and the principal stratum variable. For instance, in
the Faenza study, where strong monotonicity holds by design, it might be reasonable to believe that
on average the never-takers (Ui = 00) are women who feel that preforming BSE correctly requires
some experience, and thus, we can assume a deviation from homogeneous outcomes under principal
ignorability in the direction corresponding to < 1. Similarly, under monotonicity, the sensitivity
parameters 0 and 1 can be selected on the basis of background knowledge. For instance, in the
flu shot study, background knowledge suggests that the never-takers may comprise the healthiest
patients and the always-takers may comprise the weakest patients, and thus, a plausible choise of the
sensitivity parameters is within the range 0 > 1 and 1 < 1 [11, 61]. Ding and Lu [61] perform the
sensitivity analysis with respect to principal ignorability for the flu shot study, finding that the point
340 Assessing Principal Causal Effects Using Principal Score Methods
and interval estimates vary with the sensitivity parameters (0 , 1 ), but the final conclusions do not
change substantially, and thus, inferences based on principal ignorability are credible.
Interestingly, Proposition 17.5 implies testable conditions of principal ignorability and exclusion
restrictions. For instance, under strong monotonicity, if ACE00 = 0 and = 1, then
If the observed data show evidence against this condition, then we must reject either ACE00 = 0
or = 1. Therefore, we can test ACE00 = 0, assuming = 1, and we can test = 1, assuming
ACE00 = 0. It is worth noting that Proposition 17.5 implies testable conditions for compatibility
of principal ignorability and exclusion restrictions. Suppose that we are willing to impose principal
ignorability, then we can test ACE00 = 0. If the test leads to reject the null hypothesis that
ACE00 = 0, then we may reject the exclusion restriction assumption. Nevertheless, we may also
have doubts about principal ignorability.
Similarly, Proposition 17.5 implies testable conditions for compatibility of principal ignorability
and exclusion restrictions under monotonicity:
Jiang et al. [64] extend Proposition 17.5 to studies where the assignment mechanism is (or is
assumed) ignorable conditionally on covariates, by also allowing the sensitivity parameters to depend
on the covariates, and derive the semi-parametric efficiency theory for sensitivity analysis.
Pr(Ui = 10 | Xi )
ξ= .
Pr(Ui = 01 | Xi )
Note that ξ is also the ratio between the marginal probabilities of strata Ui = 10 and Ui = 01:
ξ = Pr(Ui = 10)/Pr(Ui = 01).
The sensitivity parameter takes on values in [0, +∞): when ξ = 0 monotonicity holds; when
ξ > 0 we have a deviation from monotonicity, with zero average causal effect on the intermediate
variable S for ξ = 1, and positive and negative average causal effect on the intermediate variable S
for 0 < ξ < 1 and ξ > 1, respectively.
Discussion 341
p1 − p0 ξ · (p1 − p0 )
e11 = p1 − e10 = .
1−ξ 1−ξ
These identifying equations for the principal strata proportions imply bounds on sensitivity parameter
ξ. Formally, we have that
p1 − p0
0≤ξ ≤1− ≤ 1.
min{p1 , 1 − p0 }
Therefore, the observed data provide an upper bound for the sensitivity parameter ξ when the average
causal effect on the intermediate variable, S, is non-negative, simplifying the non-trival task of
selecting values for the sensitivity parameter, ξ: we need to conduct sensitivity analysis varying ξ
within the empirical version of the above bounds.
Moreover, under weak principal ignorability (Assumption 12), for a fixed value of ξ, we have that
e00 (x) . e
00 e00 (x) . e
00
ω0,00 (x) = ω1,00 (x) =
e00 (x) + e01 (x) e00 + e01 e00 (x) + e10 (x) e00 + e10
e11 (x) . e
11 e11 (x) . e
11
ω0,11 (x) = ω1,11 (x) =
e10 (x) + e11 (x) e10 + e11 e01 (x) + e11 (x) e01 + e11
e10 (x) . e
10 e10 (x) . e
10
ω0,10 (x) = ω1,10 (x) = .
e10 (x) + e11 (x) e10 + e11 e00 (x) + e10 (x) e00 + e10
17.8 Discussion
The chapter has introduced and discussed principal stratification analysis under different versions of
principal ignorability using principal scores.
342 Assessing Principal Causal Effects Using Principal Score Methods
As the propensity score in observational studies, the principal score plays a similar role in settings
with post-assignment variables, in that it may be used to design a principal stratification analysis and
to develop estimation strategies of principal causal effects. Through several examples, we have tried
also to convey the idea that, despite this similarity, principal stratification under principal ignorability
is more critical for several reasons.
Principal ignorability is an assumption about conditional independence of the outcome of interest
and the principal strata, which we do not completely observe. This makes it more difficult to
identify covariates that allow to break the potential dependence. The design plays a crucial role
here: complications or analysis with intermediate variables should be anticipated in protocols and
covariates that are predictive of both the intermediate and the outcome should be collected in order
to make principal ignorability plausible (see, also, Griffin et al. [74] for some discussion on the role
of covariates in principal stratification analysis).
The fact that principal strata are only partially observed renders the estimation of the principal
score in general more complicated than estimating a propensity score and requires some form of
monotonicity. This is one reason why the extension of principal score analysis to more complex
settings such as multivalued intermediate variables, sequential treatment, and so on may not be so
straightforward, although conceptually feasible. In general we would need to extend monotonicity or
find other plausible assumptions to reduce the number of principal strata (see for example [30]).
In addition, when the principal strata are more than two, as with binary treatment, binary
intermediate variable and monotonicity, the estimation procedures need to be modified accord-
ingly. In particular, the principal score of one principal strata must be rescaled by the probability
of one of the potential values of the intermediate variable, which is given by the sum of multi-
ple principal scores [61]. For instance, compare the weights involved in Proposition 17.2 and in
Proposition 17.3. In particular, consider, e.g., the expected control potential outcome for units
of type Ui = 01, E[Y0i | Ui = 01]. When there exit only two principal strata, 00 and 01,
E[Y0i | Ui = 01] is obtained as a weighted average of the outcomes for units with Wi = 0
and Si = 0 with weights given by ω01 (x) = e01 (x)/e01 = P (Ui = 01 | Xi = x)/P (Ui = 01)
(see Proposition 17.2). When there exit three principal strata, 00, 01, and 11, we need to take
into account that units assigned to the control treatment (with Wi = 0) for which Si = 0,
are units of type Ui = 00 or Ui = 01; they cannot be units of type Ui = 11. Therefore
their outcomes need to be weighted using the weights ω0,01 (x), defined by rescaling the ratio
e01 (x)/e01 = P (Ui = 01 | Xi = x)/P (Ui = 01) by (e00 (x) + e01 (x)) / (e00 + e01 ) =
(P (Ui = 00 | Xi = x) + P (Ui = 01 | Xi = x)) / (P (Ui = 00) + P (Ui = 01)) (see Proposition
17.2).
Likewise, with multivalued intermediate variables in principle we can derive theoretical results
under principal ignorability by rescaling the weights in a similar way. However, a continuous
intermediate variable results in infinitely many principal strata, and, thus, requires more structural
assumptions and more advanced statistical inferential tools to estimate the principal scores and
outcome distributions conditional on continuous variables.
Furthermore, theoretical results can also be derived for the case of multivalued treatment when
principal strata are defined by the potential values of the intermediate variable under each treatment
level and principal causal effects are defined as pairwise comparisons within principal strata or
combinations of principal strata.
We discussed primarily point estimators using the principal score and provided some guidelines
to assess some crucial underlying assumptions. However, principal stratification analysis can be very
naturally conducted under a Bayesian framework [10, 17, 27, 31, 68, 69]. Bayesian inference does not
necessarily require neither principal ignorability nor monotonicity because it does not require full
identification. From a Bayesian perspective, the posterior distribution of the parameters of interest is
derived by updating a prior distribution to a posterior distribution via a likelihood, irrespective of
whether the parameters are fully or partially identified [17, 24, 68, 69, 75]. This is achieved at the cost
Discussion 343
of having to specify parametric models for the principal strata and the outcomes, although Bayesian
nonparametric tools can be developed and research is still active in this area [28].
The Bayesian approach also offers an easy way to conduct sensitivity analysis to deviations from
both principal ignorability and monotonicity, by checking how the posterior distribution of causal
parameters change under specific deviations from the assumptions.
Bayesian inference can indeed be conducted also under principal ignorability, and it would offer
also a way of multiply imputing the missing intermediate outcomes. Under principal ignorability, the
posterior variability of the parameters of interest will be smaller, but the posterior distributions might
lead to misleading conclusions if the underlying principal ignorability assumption does not hold.
When principal ignorability assumptions appear untenable, as an alternative to the Bayesian
approach, a partial identification approach can be used, deriving nonparametric bounds on principal
causal effects. Unfortunately, unadjusted bounds are often too wide to be informative. The causal
inference literature has dealt with this issue, developing various strategies for shrinking bounds on
principal causal effects, based on pre-treatment covariates and/or secondary outcomes [33, 37, 38, 45–
48].
Until very recently principal stratification analysis under principal ignorability assumptions has
been mostly conducted under the ignorability assumption that the treatment assignment mechanism
is independent of both potential outcomes and covariates, and some type of monotonicity assumption.
Nowadays there is ongoing work aimed to extend principal stratification analysis under principal
ignorability assumptions: Jiang et al. [64] develop multiply robust estimators for principal causal
effects under principal ignorability, which can be used also to analyze block-randomized experiments
and observational studies, where the assignment mechanism is assumed to be ignorable conditional
on pre-treatment covariates; Han et al. [76] propose a new estimation technique based on a stochastic
monotonicity, which is weaker than the deterministic monotonicity usually invoked.
We believe that further extensions of principal stratification analysis under principal ignorability
assumptions might be worthwhile, providing a fertile ground for future work.
References
[1] Jerzy S Neyman. On the application of probability theory to agricultural experiments. essay on
principles. section 9.(tlanslated and edited by dm dabrowska and tp speed, statistical science
(1990), 5, 465-480). Annals of Agricultural Sciences, 10:1–51, 1923.
[2] Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized
studies. Journal of Educational Psychology, 66(5):688, 1974.
[3] Donald B Rubin. Assignment to treatment group on the basis of a covariate. Journal of
Educational Statistics, 2(1):1–26, 1977.
[4] Donald B Rubin. Bayesian inference for causal effects: The role of randomization. The Annals
of Statistics, 6(1):34–58, 1978.
[5] Donald Rubin. Discussion of “randomization analysis of experimental data in the fisher
randomization test” by d. basu. Journal of the American Statistical Association, 75:591–593,
1980.
[6] Donald B Rubin. Comment: Neyman (1923) and causal inference in experiments and observa-
tional studies. Statistical Science, 5(4):472–480, 1990.
[7] Paul W Holland. Statistics and causal inference. Journal of the American Statistical Association,
81(396):945–960, 1986.
344 Assessing Principal Causal Effects Using Principal Score Methods
[8] Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical
sciences. Cambridge University Press, 2015.
[9] Constantine E Frangakis and Donald B Rubin. Principal stratification in causal inference.
Biometrics, 58(1):21–29, 2002.
[10] Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identification of causal effects using
instrumental variables. Journal of the American statistical Association, 91(434):444–455, 1996.
[11] Keisuke Hirano, Guido W Imbens, Donald B Rubin, and Xiao-Hua Zhou. Assessing the effect
of an influenza vaccine in an encouragement design. Biostatistics, 1(1):69–88, 2000.
[12] Michela Bia, Alessandra Mattei, and Andrea Mercatanti. Assessing causal effects in a longitu-
dinal observational study with “truncated” outcomes due to unemployment and nonignorable
missing data. Journal of Business & Economic Statistics, pages 1–12, 2021.
[13] Paolo Frumento, Fabrizia Mealli, Barbara Pacini, and Donald B Rubin. Evaluating the effect of
training on wages in the presence of noncompliance, nonemployment, and missing outcome
data. Journal of the American Statistical Association, 107(498):450–466, 2012.
[14] JL Zhang, DB Rubin, and F Mealli. Likelihood-based analysis of causal effects via principal
stratification: new approach to evaluating job training programs. Journal of the American
Statistical Association, 104:166–176, 2009.
[15] Alessandra Mattei, Fabrizia Mealli, and Barbara Pacini. Identification of causal effects in the
presence of nonignorable missing outcome values. Biometrics, 70(2):278–288, 2014.
[16] Peter B Gilbert and Michael G Hudgens. Evaluating candidate principal surrogate endpoints.
Biometrics, 64(4):1146–1154, 2008.
[17] Michela Baccini, Alessandra Mattei, and Fabrizia Mealli. Bayesian inference for causal
mechanisms with application to a randomized study for postoperative pain control. Biostatistics,
18(4):605–617, 2017.
[18] Laura Forastiere, Alessandra Mattei, and Peng Ding. Principal ignorability in mediation
analysis: through and beyond sequential ignorability. Biometrika, 105(4):979–986, 2018.
[19] Alessandra Mattei and Fabrizia Mealli. Augmented designs to assess principal strata direct
effects. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5):729–
752, 2011.
[20] Fabrizia Mealli and Donald B Rubin. Assumptions allowing the estimation of direct causal
effects. Journal of Econometrics, 112(1):79–79, 2003.
[21] Donald B Rubin. Direct and indirect causal effects via potential outcomes. Scandinavian
Journal of Statistics, 31(2):161–170, 2004.
[22] Thomas R Ten Have and Marshall M Joffe. A review of causal estimation of effects in mediation
analyses. Statistical Methods in Medical Research, 21(1):77–107, 2012.
[23] Fabrizia Mealli and Alessandra Mattei. A refreshing account of principal stratification. The Inter-
national Journal of Biostatistics, 8(1), 2012. doi: 10.1515/1557-4679.1380. PMID: 22611592.
[24] Alessandra Mattei and Fabrizia Mealli. Application of the principal stratification approach
to the faenza randomized experiment on breast self-examination. Biometrics, 63(2):437–446,
2007.
Discussion 345
[25] Avi Feller, Todd Grindal, Luke Miratrix, and Lindsay C Page. Compared to what? variation
in the impacts of early childhood education by alternative care type. The Annals of Applied
Statistics, 10(3):1245–1285, 2016.
[26] Hui Jin and Donald B Rubin. Principal stratification for causal inference with extended partial
compliance. Journal of the American Statistical Association, 103(481):101–111, 2008.
[27] Chanmin Kim, Michael J Daniels, Joseph W Hogan, Christine Choirat, and Corwin M Zigler.
Bayesian methods for multiple mediators: Relating principal stratification and causal mediation
in the analysis of power plant emission controls. The Annals of Applied Statistics, 13(3):1927,
2019.
[28] Scott L Schwartz, Fan Li, and Fabrizia Mealli. A bayesian semiparametric approach to
intermediate variables in causal inference. Journal of the American Statistical Association,
106(496):1331–1344, 2011.
[29] Alessandra Mattei, Fabrizia Mealli, and Peng Ding. Assessing causal effects in the presence of
treatment switching through principal stratification. arXiv preprint arXiv:2002.11989, 2020.
[30] Constantine E Frangakis, Ronald S Brookmeyer, Ravi Varadhan, Mahboobeh Safaeian, David
Vlahov, and Steffanie A Strathdee. Methodology for evaluating a partially controlled longitu-
dinal treatment using principal stratification, with application to a needle exchange program.
Journal of the American Statistical Association, 99(465):239–249, 2004.
[31] Laura Forastiere, Patrizia Lattarulo, Marco Mariani, Fabrizia Mealli, and Laura Razzolini.
Exploring encouragement, treatment, and spillover effects using principal stratification, with
application to a field experiment on teens’ museum attendance. Journal of Business & Economic
Statistics, 39(1):244–258, 2021.
[32] Guido W Imbens and Donald B Rubin. Estimating outcome distributions for compliers in
instrumental variables models. The Review of Economic Studies, 64(4):555–574, 1997.
[33] Junni L Zhang and Donald B Rubin. Estimation of causal effects via principal stratification
when some outcomes are truncated by “death”. Journal of Educational and Behavioral Statistics,
28(4):353–368, 2003.
[34] Peng Ding, Zhi Geng, Wei Yan, and Xiao-Hua Zhou. Identifiability and estimation of causal
effects by principal stratification with outcomes truncated by death. Journal of the American
Statistical Association, 106(496):1578–1591, 2011.
[35] Zhichao Jiang, Peng Ding, and Zhi Geng. Principal causal effect identification and surrogate
end point evaluation by multiple trials. Journal of the Royal Statistical Society: Series B:
Statistical Methodology, pages 829–848, 2016.
[36] Alessandra Mattei, Fan Li, and Fabrizia Mealli. Exploiting multiple outcomes in bayesian
principal stratification analysis with application to the evaluation of a job training program. The
Annals of Applied Statistics, 7(4):2336–2360, 2013.
[37] Fabrizia Mealli and Barbara Pacini. Using secondary outcomes to sharpen inference in ran-
domized experiments with noncompliance. Journal of the American Statistical Association,
108(503):1120–1131, 2013.
[38] Fabrizia Mealli, Barbara Pacini, and Elena Stanghellini. Identification of principal causal effects
using additional outcomes in concentration graphs. Journal of Educational and Behavioral
Statistics, 41(5):463–480, 2016.
346 Assessing Principal Causal Effects Using Principal Score Methods
[39] Fan Yang and Dylan S Small. Using post-outcome measurement information in censoring-by-
death problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
78(1):299–318, 2016.
[40] Corwin M Zigler and Thomas R Belin. A bayesian approach to improved estimation of causal
effect predictiveness for a principal surrogate endpoint. Biometrics, 68(3):922–932, 2012.
[41] Avi Feller, Evan Greif, Luke Miratrix, and Natesh Pillai. Principal stratification in the twilight
zone: Weakly separated components in finite mixture models. arXiv preprint arXiv:1602.06595,
2016.
[42] Paolo Frumento, Fabrizia Mealli, Barbara Pacini, and Donald B Rubin. The fragility of standard
inferential approaches in principal stratification models relative to direct likelihood approaches.
Statistical Analysis and Data Mining: The ASA Data Science Journal, 9(1):58–70, 2016.
[43] Kosuke Imai. Sharp bounds on the causal effects in randomized experiments with “truncation-
by-death”. Statistics & Probability Letters, 78(2):144–149, 2008.
[44] Junni L Zhang, Donald B Rubin, and Fabrizia Mealli. Evaluating the effects of job training
programs on wages through principal stratification. In Modelling and Evaluating Treatment
Effects in Econometrics. Emerald Group Publishing Limited, 2008.
[45] Jing Cheng and Dylan S Small. Bounds on causal effects in three-arm trials with non-
compliance. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
68(5):815–836, 2006.
[46] Leonardo Grilli and Fabrizia Mealli. Nonparametric bounds on the causal effect of university
studies on job opportunities using principal stratification. Journal of Educational and Behavioral
Statistics, 33(1):111–130, 2008.
[47] David S Lee. Training, wages, and sample selection: Estimating sharp bounds on treatment
effects. The Review of Economic Studies, 76(3):1071–1102, 2009.
[48] Dustin M Long and Michael G Hudgens. Sharpening bounds on principal effects with covariates.
Biometrics, 69(4):812–819, 2013.
[49] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observa-
tional studies for causal effects. Biometrika, 70(1):41–55, 1983.
[50] Donald B Rubin. The design versus the analysis of observational studies for causal effects:
parallels with the design of randomized trials. Statistics in Medicine, 26(1):20–36, 2007.
[51] Donald B Rubin. For objective causal inference, design trumps analysis. The Annals of Applied
Statistics, 2(3):808–840, 2008.
[52] Jennifer Hill, Jane Waldfogel, and Jeanne Brooks-Gunn. Differential effects of high-quality
child care. Journal of Policy Analysis and Management: The Journal of the Association for
Public Policy Analysis and Management, 21(4):601–627, 2002.
[53] Dean A Follmann. On the effect of treatment among would-be treatment compliers: An analysis
of the multiple risk factor intervention trial. Journal of the American Statistical Association,
95(452):1101–1109, 2000.
[54] Bruno Crépon, Florencia Devoto, Esther Duflo, and William Parienté. Estimating the impact
of microcredit on those who take it up: Evidence from a randomized experiment in morocco.
American Economic Journal: Applied Economics, 7(1):123–50, 2015.
Discussion 347
[55] Peter Z Schochet and John Burghardt. Using propensity scoring to estimate program-related
subgroup impacts in experimental program evaluations. Evaluation Review, 31(2):95–120,
2007.
[56] Fuhua Zhai, Jeanne Brooks-Gunn, and Jane Waldfogel. Head start’s impact is contingent on
alternative type of care in comparison group. Developmental Psychology, 50(12):2572, 2014.
[57] Booil Jo and Elizabeth A Stuart. On the use of propensity scores in principal causal effect
estimation. Statistics in Medicine, 28(23):2857–2875, 2009.
[58] Booil Jo, Elizabeth A Stuart, David P MacKinnon, and Amiram D Vinokur. The use of
propensity scores in mediation analysis. Multivariate Behavioral Research, 46(3):425–452,
2011.
[59] Elizabeth A Stuart and Booil Jo. Assessing the sensitivity of methods for estimating principal
causal effects. Statistical Methods in Medical Research, 24(6):657–674, 2015.
[60] Raphaël Porcher, Clémence Leyrat, Gabriel Baron, Bruno Giraudeau, and Isabelle Boutron. Per-
formance of principal scores to estimate the marginal compliers causal effect of an intervention.
Statistics in Medicine, 35(5):752–767, 2016.
[61] Peng Ding and Jiannan Lu. Principal stratification analysis using principal scores. Journal of
the Royal Statistical Society: Series B (Statistical Methodology), 79(3):757–777, 2017.
[62] Avi Feller, Fabrizia Mealli, and Luke Miratrix. Principal score methods: Assumptions, ex-
tensions, and practical considerations. Journal of Educational and Behavioral Statistics,
42(6):726–758, 2017.
[63] Marshall M Joffe, Dylan S Small, and Chi-Yuan Hsu. Defining and estimating intervention
effects for groups that will develop an auxiliary outcome. Statistical Science, 22(1):74–97,
2007.
[64] Zhichao Jiang, Shu Yang, and Peng Ding. Multiply robust estimation of causal effects under
principal ignorability. arXiv preprint arXiv:2012.01615, 2020.
[65] Fabrizia Mealli, Guido W Imbens, Salvatore Ferro, and Annibale Biggeri. Analyzing a random-
ized trial on breast self-examination with noncompliance and missing outcomes. Biostatistics,
5(2):207–222, 2004.
[66] Leah Comment, Fabrizia Mealli, Sebastien Haneuse, and Corwin Zigler. Survivor average
causal effects for continuous time: A principal stratification approach to causal inference with
semicompeting risks. arXiv preprint arXiv:1902.09304, 2019.
[67] Laura Forastiere, Fabrizia Mealli, and Tyler J VanderWeele. Identification and estimation of
causal mechanisms in clustered encouragement designs: Disentangling bed nets using bayesian
principal stratification. Journal of the American Statistical Association, 111(514):510–525,
2016.
[68] Constantine E Frangakis, Donald B Rubin, and Xiao-Hua Zhou. Clustered encouragement
designs with individual noncompliance: Bayesian inference with randomization, and application
to advance directive forms. Biostatistics, 3(2):147–164, 2002.
[69] Booil Jo, Tihomir Asparouhov, and Bengt O Muthén. Intention-to-treat analysis in cluster
randomized trials with noncompliance. Statistics in Medicine, 27(27):5565–5577, 2008.
348 Assessing Principal Causal Effects Using Principal Score Methods
[70] Booil Jo, Tihomir Asparouhov, Bengt O Muthén, Nicholas S Ialongo, and C Hendricks Brown.
Cluster randomized trials with treatment noncompliance. Psychological Methods, 13(1):1,
2008.
[71] Donald B Rubin. Causal inference through potential outcomes and principal stratification:
application to studies with” censoring” due to death. Statistical Science, 21(3):299–309, 2006.
[72] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete
data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological),
39(1):1–22, 1977.
[73] Martin A Tanner and Wing Hung Wong. The calculation of posterior distributions by data
augmentation. Journal of the American Statistical Association, 82(398):528–540, 1987.
[74] Beth Ann Griffin, Daniel F McCaffery, and Andrew R Morral. An application of principal
stratification to control for institutionalization at follow-up in studies of substance abuse
treatment programs. The Annals of Applied Statistics, 2(3):1034, 2008.
[75] Paul Gustafson. Bayesian inference for partially identified models. The International Journal
of Biostatistics, 6(2), Article 17, 2010. doi: 10.2202/1557-4679.1206. PMID: 21972432.
[76] Shasha Han, Larry Han, and Jose R. Zubizarreta. Principal resampling for causal inference.
Mimeo, 2021.
18
Incremental Causal Effects: An Introduction and
Review
CONTENTS
18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
18.2 Preliminaries: Potential Outcomes, the Average Treatment Effect, and Types of
Interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
18.2.1 Average treatment effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
18.2.2 When positivity is violated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
18.2.3 Dynamic interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
18.2.4 Stochastic interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
18.3 Incremental Propensity Score Interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
18.3.1 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
18.3.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
18.3.3 Properties of the estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
18.3.3.1 Pointwise inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
18.3.3.2 Uniform inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
18.4 Time-varying Treatments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
18.4.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
18.4.2 Marginal structural models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
18.4.3 Time-varying incremental effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
18.4.4 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
18.4.5 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
18.4.6 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
18.5 Example Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
18.6 Extensions & Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
18.6.1 Censoring & dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
18.6.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
18.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
In this chapter, we review the class of causal effects based on incremental propensity score inter-
ventions proposed by [1]. The aim of incremental propensity score interventions is to estimate the
effect of increasing or decreasing subjects’ odds of receiving treatment; this differs from the average
treatment effect, where the aim is to estimate the effect of everyone deterministically receiving versus
not receiving treatment. We first present incremental causal effects for the case when there is a single
binary treatment, such that it can be compared to average treatment effects and thus shed light on
key concepts. In particular, a benefit of incremental effects is that positivity – a common assumption
in causal inference – is not needed to identify causal effects. Then we discuss the more general
case where treatment is measured at multiple time points, where positivity is more likely to be
violated and thus incremental effects can be especially useful. Throughout, we motivate incremental
effects with real-world applications, present nonparametric estimators for these effects, and discuss
their efficiency properties, while also briefly reviewing the role of influence functions in functional
estimation. Finally, we show how to interpret and analyze results using these estimators in practice
and discuss extensions and future directions.
18.1 Introduction
Understanding the effect of a variable A, the treatment, on another variable Y , the outcome, involves
measuring how the distribution of Y changes when the distribution of A is manipulated in some
way. By “manipulating a distribution,” we mean that we imagine a world where we can change the
distribution of A to some other distribution of our choice, which we refer to as the “intervention
distribution.” The intervention distribution defines the causal effect of A on Y , i.e., the causal
estimand. For instance, suppose A is binary and is completely randomized with some probability c0 ;
one may then ask how the average outcome Y would change if the randomization probability were
set to some other constant c1 .
In order to answer questions about causal effects, we adopt the potential outcomes framework
( [2]). We suppose that each subject in the population is linked to a number of “potential outcomes,”
denoted by Y a , that equal the outcome Y that would have been observed if the subject had received
treatment A = a. In practice, only one potential outcome is observed for each subject, because one
cannot go back in time and assign a different treatment value to the same subject. As a result, we
must make several assumptions to identify and estimate causal effects. One common assumption is
“positivity,” which says that each subject in the population must have a non-zero chance of receiving
each treatment level. Positivity is necessary to identify the average treatment effect, which is the most
common causal estimand in the literature; however, positivity may be very unrealistic in practice,
particularly in many time-point analyses where the number of possible treatment regimes scales
exponentially with the number of time-points.
In this chapter we introduce and review the class of “incremental causal effects” proposed in [1].
Incremental causal effects are based on an intervention distribution that shifts the odds of receiving a
binary treatment by a user-specified amount. Crucially, incremental causal effects are well-defined
and can be efficiently estimated even when positivity is violated. Furthermore, they represent an
intuitive way to summarize the effect of the treatment on the outcome, even in high-dimensional,
time-varying settings. In Section 18.2 we first review how to identify and estimate the average
treatment effect, as well as review the notion of static interventions, dynamic interventions, and
stochastic interventions. A limitation of the average treatment effect is that it only considers a static
and deterministic intervention, where all subjects are either assigned to treatment or control, which
may be improbable if positivity is violated or nearly violated. On the other hand, dynamic and
stochastic interventions allow the treatment to vary across subjects, thereby lessening dependence
on positivity. Incremental causal effects assume a particular stochastic intervention, which we
discuss in depth for binary treatments in Section 18.3 before discussing time-varying treatments in
Section 18.4. We then demonstrate how to use incremental effects in practice in Section 18.5. We
conclude with a discussion of extensions and future directions for incremental causal effects in
Section 18.6.
Preliminaries: Potential Outcomes, the Average Treatment Effect, and Types of Interventions 351
The ATE compares two quantities: E(Y 1 ), the mean outcome when all subjects receive treatment;
and E(Y 0 ), the mean outcome when all subjects receive control. Within our example, these are the
recidivism rate when everyone attends behavioral health services and the recidivism rate when no
one attends behavioral health services, respectively.
The fundamental problem of causal inference ([7]) is that each subject receives either treatment
or control – never both – and thus we only observe one of the two potential outcomes for each subject.
In other words, the difference Y 1 − Y 0 is only partially observed for each subject. Thus, assumptions
must be made in order to identify and thereby estimate the ATE. It is common to make the following
three assumptions to estimate the ATE:
Assumption 17. (Consistency). Y = Y a if A = a.
352 Incremental Causal Effects: An Introduction and Review
Assumption 19. (Positivity). There exists > 0 such that P( ≤ π(X) ≤ 1 − ) = 1.
Consistency says that if an individual takes treatment a, we observe their potential outcome
under that treatment; consistency allows us to write the observed outcomes as Y = AY 1 + (1 −
A)Y 0 and would be violated if, for example, there were interference between subjects, such that
one subject’s treatment affected another subject’s outcome. Meanwhile, exchangeability says that
treatment is effectively randomized within covariate strata, in the sense that treatment is independent
of subjects’ potential outcomes – as in a randomized experiment – after conditioning on covariates.
This assumption is also called “no unmeasured confounding,” which means that there are no variables
beyond X that induce dependence between treatment and the potential outcomes. Finally, positivity
says that the propensity score is bounded away from zero and one for all subjects. With these three
assumptions, we have
Y 1(A = a)
E(Y a ) = E{E(Y a | X)} = E{E(Y a | X, A = a)} = E{E(Y | X, A = a)} = E ,
π(X)
i.e., the mean outcome if all were assigned A = a reduces to the regression function E(Y | X, A =
a), averaged over the distribution of X. The average treatment effect is thus ATE = E{E(Y |
X, A = 1) − E(Y | X, A = 0)}.
This is called the semiparametric efficiency bound, which is the equivalent notion to the Cramer-Rao
lower bound in parametric models. See [13] for a review on this topic and the precise definition of
regular estimators. Because the efficiency bound involves the reciprocal of the propensity score, it is
clear that near-positivity violations have a deleterious effect on the precision with which the ATE can
be estimated.
In the presence of positivity violations where there are substantial limits on how well one can
estimate the ATE in (18.1), an alternative option is to target different estimands. For example,
matching methods restrict analyses to a subsample that exhibits covariate balance between treatment
and control, and thus positivity can be more plausible for that subsample ( [13, 14]). In this case,
the targeted estimand would be the ATE for the matched subsample instead of the ATE in the entire
population. Similarly, propensity score trimming aims to remove subjects for whom propensity scores
are outside the interval [, 1 − ], for some user-specified > 0 ( [75]). If the true propensity scores
were known, this trimming would ensure positivity holds by design, and the targeted estimand is the
ATE for the trimmed subsample instead of the ATE. A benefit of these approaches is that standard
ATE estimators can still be used, simply within a subsample. However, complications arise because
the estimand then depends on the sample: The subsample ATE is only defined after a matching
algorithm is chosen or propensity scores are estimated. Thus, interpretability may be a concern,
because the causal effect is only defined for the sample at-hand instead of the broader population
of interest. A way to overcome this shortcoming is to define the estimand based on the trimmed
true propensity scores, i.e. E{Y 1 − Y 0 | π(X) ∈ [, 1 − ]}. However, estimating the quantity
1{π(X) ∈ [, 1 − ]} is challenging in flexible, nonparametric√ models because it is a non-smooth
transformation of the data-generating distribution; as such, n-consistent estimators do not generally
exist without imposing further assumptions.
The estimands discussed so far – the ATE in (18.1) and subsample ATEs implied by matching
or propensity score trimming – all represent static interventions, where the treatment A is set to
fixed values across a population of interest. In what follows, we will consider alternative estimands
that researchers can target in the face of positivity violations. These estimands concern effects of
dynamic (non-static) interventions, where, in counterfactual worlds, the treatment can be allowed to
vary across subjects, instead of being set to fixed values. Indeed, the static interventions discussed so
far – where every subject receives treatment or every subject receives control – may be impossible to
implement for many applications, whereas non-static interventions may be closer to what we would
expect to be possible in practice.
As an example, let’s say we want to measure the effect of providing behavioral health services only
for probationers for whom positivity holds, i.e., for probationers for whom P(A = 1 | X) ∈ [, 1−].
In other words we would like to answer the question, “What is the causal effect of providing behavioral
health services to probationers who can plausibly choose whether or not to attend services?” We
could address this question by considering the following dynamic intervention:
A similar dynamic intervention was discussed in [17]. Under this intervention, we set subjects to
treatment a when positivity holds; otherwise, treatment is fixed at a = 1 for subjects that will almost
certainly receive treatment and a = 0 for subjects that will almost certainly receive control. In
this case the causal estimand is E Y d1 (X) − Y d0 (X) , where E Y da (X) denotes the average
outcome when A is set according to da (X) across subjects. Note that E Y d1 (X) differs from
E(Y 1 ) in the ATE (18.1), to the extent that there are subjects for whom P(A = 1 | X) < , in
which case some subjects receive control when estimating E Y d1 (X) but not E(Y 1 ). In this case,
E Y d1 (X) − Y d0 (X) is equivalent to the ATE within a propensity score trimmed subsample, but
Let Q(a | x) denote the probability distribution of treatment for a stochastic intervention, and let
E(Y Q ) denote the average outcome under this intervention. The quantity Q(a | x) is also known as
the intervention distribution, because it is a distribution specified by an intervention. This intervention
can depend on the covariates x, and in this case, it is a dynamic stochastic intervention. Or, we can
have Q(a | x) = Q(a) for all x, which would be a static (but potentially stochastic) intervention.
The quantity E(Y Q ) can then be written as a weighted average of the various potential outcomes
Y a , with weights given by Q(a | x):
Z Z
E(Y Q ) = E(Y a | X = x) dQ(a | x) dP(x) (18.2)
A X
Causal effects under stochastic interventions are often framed as contrasting different E(Y Q ) for
different Q(a) distributions. For example, as stated earlier, the ATE in (18.1) can be viewed as
contrasting E(Y Q ) for two different point-mass distributions, where every subject receives treatment
or every subject receives control.
Under consistency and exchangeability (Assumptions 17 and 18), we can identify (18.2) as
Z Z
E(Y ) =
Q
E(Y | X = x, A = a)dQ(a | x)dP(x) (18.3)
A X
where A denotes the set of all possible treatment values, and X are all possible covariate values.
Note that the notation dQ(a | x) acknowledges that the intervention distribution may depend on
covariates. In the binary treatment case, this reduces to
Z n o
E(Y Q ) = E(Y | X = x, A = 0)Q(A = 0 | x) + E(Y | X = x, A = 1)Q(A = 1 | x) dP(x).
X
Notably, positivity (Assumption 19) may not be needed to identify (18.2), depending on the
definition of Q. We now turn to incremental propensity score interventions, which are stochastic
interventions defining an intuitive causal estimand that does not rely on positivity for identification.
δπ(x)
Q(A = 1 | x) ≡ q(x; δ, π) = (18.4)
δπ(x) + 1 − π(x)
where π(x) = P(A = 1 | X = x). In our recidivism example this intervention would shift each
probationer’s probability of attending a healthcare service. The incremental parameter δ is user-
specified and controls the direction and magnitude of the propensity score shift. It tells us how much
the intervention changes the odds of receiving treatment.
356 Incremental Causal Effects: An Introduction and Review
For example, if δ = 1.5, then the intervention increases the odds of treatment by 50% for
everyone. If δ = 1, then we are left with the observation propensity scores, and q(x; δ, π) = π(x).
As δ increases from 1 towards ∞, the shifted propensity scores approach 1, and as δ decreases from 1
towards 0, the shifted propensity scores approach 0. There are other interventions one might consider,
but shifting the odds of treatment is a simple intervention that gives an intuitive explanation for the
parameter δ.
Remark 18.1. It is possible to let δ depend on X, thereby allowing the intervention distribution Q
to modify the treatment process differently based on subjects’ covariates. This would lead to more
nuanced definitions of treatment effects, potentially at the cost of losing straightforward interpretation
of the estimated effects. In fact, taking δ to be constant is not an assumption; it just defines the
particular causal estimand that is targeted for inference.
Incremental propensity score interventions allow us to avoid the tricky issues with positivity that
were discussed in Section 18.2. There are two groups of people for whom the positivity assumption
is violated: people who never attend treatment (π = 0) and people who always attend treatment
(π = 1). Incremental interventions avoid assuming positivity for these groups because the intervention
leaves their propensity score unchanged: It has the useful property that π = 0 =⇒ q = 0 and
π = 1 =⇒ q = 1.
We cannot know a priori whether positivity is violated, so this intervention allows us to compute
effects on our data without worrying whether positivity holds. Thus, this intervention differs from
the dynamic intervention in Section 18.2, because we do not make our intervention depend on the
propensity score of each individual in our sample. Practically, this means we could still include in our
sample people who always attend or never attend treatment; e.g., in our running recidivism example,
we could still include individuals who must attend treatment as part of their probation, and the causal
effect is still well-defined.
Remark 18.2. The reader may wonder why we do not consider simpler interventions, such as
q(x; δ, π) = π + δ or q(x; δ, π) = π · δ. One reason is that these interventions require the range of
δ to depend on the distribution P, because otherwise q(x; δ, π) may fall outside the unit interval.
In contrast, for any δ, the incremental propensity score intervention constrains Q so that 0 ≤
q(x; δ, π) ≤ 1.
Remark 18.3. If positivity holds, incremental interventions contain the ATE as a special case. If
π(x) is bounded away from zero and one almost surely, then E(Y Q ) tends to E(Y 1 ) as δ → ∞ and
to E(Y 0 ) as δ → 0. Thus, both E(Y 1 ) and E(Y 0 ) can be approximated by taking δ to be very large
or very small, and the ATE can be approximated by taking their contrast.
18.3.1 Identification
Under consistency and exchangeability (Assumptions 17 and 18), we can plug the incremental
intervention distribution Q(a | x) = q(x; δ, π)a {1 − q(x; δ, π)}1−a into equation (18.3) to derive an
identification expression for ψ(δ) = E{Y Q(δ) }, the expected outcome if the treatment distribution
is intervened upon and set to Q(a | x).
Theorem 18.1. Under Assumptions 17 and 18, and if δ ∈ (0, ∞), the incremental effect ψ(δ) =
E{Y Q(δ) } for the propensity score intervention as defined in equation (18.4) satisfies
δπ(X)µ(X, 1) + {1 − π(X)}µ(X, 0) Y (δA + 1 − A)
ψ(δ) = E[ ] = E[ ] (18.5)
δπ(X) + {1 − π(X)} δπ(X) + {1 − π(X)}
Theorem 18.1 offers two ways to link the incremental effect ψ(δ) to the data generating distribu-
tion P. The first involves both the outcome regressions µ(x, a) and the propensity score π(x), and
the second just the propensity score. The former is a weighted average of the regression functions
µ(x, 1) and µ(x, 0), where the weight on µ(x, a) is given by the fluctuated intervention propensity
score Q(A = 1 | x), while the latter is a weighted average of the observed outcomes, where the
weights are related to the intervention propensity score and depend on the observed treatment.
Remark 18.4. From the identification expression in equation (18.5), one can notice that even if
there are subjects for which P(A = a | X) = 0, so that µ(X, a) = E(Y | X, A = a) is not defined
because of conditioning on a zero-probability event, ψ(δ) is still well-defined, because those subjects
receive zero weight when the expectation over the covariates distribution is computed.
18.3.2 Estimation
Theorem 18.1 provides two formulas to link the causal effect ψ(δ) to the data generating distribution
P. The next step is to estimate ψ(δ) relying on these identification results. The first estimator, which
we call the “outcome-based estimator,” includes estimates for both µ and π:
n
1 X δb µ(Xi , 1) + {1 − π
π (Xi )b b(Xi )}b
µ(Xi , 0)
ψ(δ) =
b
n i=1 π (Xi ) + {1 − π
δb b(Xi )}
The second estimator motivated by the identification result is the inverse-probability-weighted (IPW)
estimator:
n
b = 1
X Yi (δAi + 1 − Ai )
ψ(δ)
n i=1 δb π (Xi ) + {1 − π b(Xi )}
Both estimators are generally referred to as “plug-in” estimators, since they take estimates for µb or π
b
and plug them directly into the identification results. If we assume we can estimate µ and π with
correctly specified parametric models, then both estimators inherit parametric rates of convergence
and can be used to construct valid confidence intervals.
However, in practice, parametric models are likely to be misspecified. Thus, we may prefer
to estimate the nuisance regression functions µ and π with nonparametric models. However, if
nonparametric models are used for either of the plug-in estimators without any correction, the
estimator are generally sub-optimal, because they inherit the large bias incurred in estimating the
regression functions in large classes. For example, the best possible convergence rate in mean-square-
error of an estimator of a regression function that belongs to a Hölder class of order α, essentially a
function that is α-times differentiable, scales at n−2α/(2α+d) , where n is the sample size and d ≥ 1 is
the dimension of the covariates ( [18]). This rate is slower than n−1 for any α and d. Because plug-in
estimators with no further corrections that use agnostic nuisance function estimators typically inherit
this “slow rate,” they lose n−1 efficiency if the nuisance functions are not estimated parametrically.
Remark 18.5. We remind readers that the issues regarding plug-in estimators outlined in the previous
paragraph are not isolated to incremental interventions; they apply generally to plug-in estimators
of functionals. So, for example, these problems also apply to the outcome-based and IPW estimators
for the ATE.
Semiparametric efficiency theory provides a principled way to construct estimators that make
a more efficient use of flexibly estimated nuisance functions ( [19–22, 67]).1 Such estimators are
1 For a gentle introduction to the use of influence functions in functional estimation, we refer to [24] and the tutorial at
http://www.ehkennedy.com/uploads/5/8/4/5/58450265/unc_2019_cirg.pdf
358 Incremental Causal Effects: An Introduction and Review
based on influence functions and are designed for parameters that are “smooth” transformations of
the distribution P. The parameter ψ(δ) is an example of a smooth parameter. The precise definition
of smoothness in this context can be found in Chapter 25 of [20]. Informally, however, we can note
that ψ(δ) only involves differentiable functions of π and µ, thereby suggesting that it is a smooth
parameter.
If we let Φ(P) ∈ R denote some smooth target parameter, we can view Φ(P) as a functional
acting on the space of distribution functions. One key feature of smooth functionals is that they
satisfy a functional analog of a Taylor expansion, sometimes referred to as the von Mises expansion.
Given two distributions P1 and P2 , the von Mises expansion of Φ(P) is
Z
Φ(P1 ) − Φ(P2 ) = φ(z; P1 ){dP1 (z) − dP2 (z)} + R(P1 , P2 )
where φ(z; P) is the (mean-zero) influence function and R is a second-order remainder term. For
the purpose of estimating Φ(P), the above expansion is useful when P2 = P denotes the true
data-generating distribution and P1 = P b is its empirical estimate. Thus, the key step in constructing
estimators based on influence functions is to express the first-order bias of plug-in estimators as an
expectation with respect to P of a particular, fixed function of the observations and the nuisance
parameters, referred to as the influence function. Because the first-order bias is expressed as an
expectation with respect to the data-generating P, it can be estimated with error of order n−1 simply
by replacing P by the empirical distribution. This estimate of the first-order bias can be subtracted
from the plug-in estimator, so that the resulting estimator will have second-order bias without any
increase in variance. This provides an explicit recipe for constructing “one-step” bias-corrected
estimators of the form
n
b) + 1
b P) = Φ(P φ(Zi ; P
X
Φ( b)
n i=1
If φ(Z; P) takes the form ϕ(Z; P) − Φ(P), then the estimator simplifies to Φ( b P) =
n−1 i=1 ϕ(Zi ; P
P n b ). Remarkably, there are other ways of doing asymptotically equivalent bias
corrections, such as targeted maximum likelihood ( [11]).
The functional ψ(δ) satisfies the von Mises expansion for
1(A = a)
φa (Z) = {Y − µ(X, a)} + µ(X, a).
P(A = a | X)
To highlight that ϕ(Z; P) depends on P through π and µ and that it also depends on δ, we will write
ϕ(Z; P) ≡ ϕ(Z; δ, π, µ). The influence-function-based one-step estimator of ψ(δ) is therefore
n
b = 1
X
ψ(δ) ϕ(Zi ; δ, π
b, µ
b) (18.6)
n i=1
Remark 18.6. The function φa (Z) is the un-centered influence function for the parameter
E{µ(X, a)}, and is thus part of the influence function for the ATE under exchangeability (As-
sumption 2), which can be expressed as E{µ(X, 1)} − E{µ(X, 0)}. Thus, one may view the first
Incremental Propensity Score Interventions 359
Algorithm 1 Split the data into K folds (e.g., 5), where fold k ∈ {1, . . . , K} has nk observations.
For each fold k:
1. Build models π
b−k (X), µ
b1,−k (X) and µ
b0,−k (X) using the observations not contained in fold k.
2. For each observation j in fold k, calculate its un-centered influence function
ϕ(Zj ; δ, π
b−k (Xj ), µ
b−k (Xj )) using the models π
b−k and µ
b−k calculated in the Step 1.
3. Calculate an estimate for ψ(δ) within fold k by averaging the estimates of the un-centered
influence function:
1 X
ψbk (δ) = ϕ{Zj ; δ, π
b−k (Xj ), µ
b−k (Xj )}
nk
j∈k
K
1 Xb
ψb = ψk (δ)
K
k=1
This bias term is second-order. Thus, the bias can be oP (n−1/2 ), and of smaller order than the
standard error, even if π and µ are estimated at slower rates. If the bias is asymptotically negligible
360 Incremental Causal Effects: An Introduction and Review
e = n−1
where σ 2 (δ) = Var{ϕ(Z; δ, π, µ)} and ψ(δ)
P
ϕ(Zi ; δ, π, µ). Given a consistent estimator
i
2
of the variance σ
b (δ), by Slutsky’s theorem,
√ b
n{ψ(δ) − ψ(δ)}
N (0, 1). (18.8)
σ
b(δ)
The bias term in (18.7) can be oP (n−1/2 ) even when the nuisance functions are estimated nonpara-
metrically with flexible machine learning methods. By the Cauchy-Schwarz inequality,
bk2 + kπ − π
|B| ≤ kπ − π bk max kµa − µ
ba k,
a
where kf k2 = f 2 (z)dP(z). Therefore, it is sufficient that the product of the integrated MSEs for
R
estimating π and µa converge to zero faster than n−1/2 to ensure that the bias is asymptotically
negligible. This can happen, for instance, if both π and µa are estimated at faster than n−1/4 rates,
which is possible in nonparametric models under structural conditions such as sparsity or smoothness.
The convergence statement in (18.8) provides an asymptotic Wald-type (1 − α)-confidence
interval for ψ(δ),
b ± z1−α/2 · σ
ψ(δ)
b(δ)
√ , (18.9)
n
where zτ is the τ -quantile of a standard normal. This confidence interval enables us to conduct valid
inference for ψ(δ) for a particular value of δ.
Remark 18.7. If π and µa both lie a Hölder class of order α, then the best estimators in the class
have MSEs of order n−2α/(2α+d) , where d is the dimension of the covariates ( [18]). Therefore, the
bias term would be of order oP (n−1/2 ) and asymptotically negligible whenever d < 2α. Similarly, if
µa follows a s-sparse linear model and π a s-sparse logistic model, [66] shows that s2 (log d)3+2δ =
o(n), for some δ > 0, is sufficient for the bias to be asymptotically negligible.
Remark 18.8. To see why estimating the nuisance functions on an independent sample may help,
consider the following expansion for ψ(δ)
b defined in (18.6):
where we used the shorthand notation Pn {g(Z)} = n−1 i=1 g(Zi ) and P{g(Z)} = g(z)dP(z).
Pn R
The second term will be of order OP (n−1/2 ) by the Central Limit Theorem, whereas the third term
can be shown to be second order (in fact, upper bounded by a multiple of B in (18.7)) by virtue of
ϕ(Z; δ, π, µ) being the first-order influence function.
√ helps with controllingR the first term. If g(Zi )1 {g} (Zj ), then (Pn − P){g(Z)} =
Cross-fitting
OP (kg(Z)k/ n), where kg(Z)k2 = g 2 (z)dP(z). If π b and µb are computed on a separate training
sample, then, given that separate sample, the first term is
kϕ(Z; δ, π b) − ϕ(Z; δ, π, µ)k
b, µ
(Pn − P){ϕ(Z; δ, π b) − ϕ(Z; δ, π, µ)} = OP
b, µ √
n
This means that, under essentially the same conditions that guarantee asymptotic normality at a
fixed value of δ, it is also the case that any finite linear combination ψ(δ1 ), ψ(δ2 ), . . . , ψ(δm ) is
asymptotically distributed as a multivariate Gaussian.
Establishing sufficient conditions to achieve convergence to a Gaussian process allows us to
conduct uniform inference across many δ’s; i.e., we can conduct inference for ψ(δ) for many δ’s
without issues of multiple testing. In particular, we can construct confidence bands around the curve
ψ(δ)
b that covers the true curve with a desired probability across all δ. The bands can be constructed
b ±b √ b
as ψ(δ) cα σ
b(δ), where b cα is an estimate of the (1 − α)-quantile of supδ∈D n|ψ(δ) − ψ(δ)|/b σ (δ).
We can estimate the supremum of Gaussian processes using the multiplier bootstrap, which is
computationally efficient. We refer the reader to [1] for full details.
With the uniform confidence interval, we can also conduct a test of no treatment effect. If the
treatment has no effect on the outcome, then Y 1 {A} | X and ψ(δ) = E(Y ) under exchange-
ability (Assumption 2). Given the uniform confidence band, the null hypothesis of no incremental
intervention effect,
H0 : ψ(δ) = E(Y ) for all δ ∈ D,
can be tested by checking whether a (1 − α) band contains a straight horizontal line over D. That is,
we reject H0 at level α if we cannot run a straight horizontal line through the confidence band. We
362 Incremental Causal Effects: An Introduction and Review
Geometrically, the p-value corresponds to the α at which we can no longer run a straight horizontal
line through our confidence band. At α = 0 we have an infinitely wide confidence band and we
always fail to the reject H0 . Increasing α corresponds to a tightening confidence band. In Section
18.5 we give a visual example of how to conduct this test. But first, we discuss incremental propensity
score interventions when treatment is time-varying.
A popular approach to reduce the number of parameters is to assume that the expected potential
outcomes under different interventions vary smoothly. For instance, one can specify a model of the
form
where
1
W (Z) = QT
t=1 P(At | Ht )
and h is some arbitrary function of the treatment. This moment condition suggests an inverse-
propensity weighted estimator, where we estimate βb by solving the empirical analog of (18.10):
n
1 Xn c (Zi ){Yi − m(AT,i ; β)}
o
h(AT,i )W b = 0.
n i=1
We can also estimate β using a doubly robust version of the moment condition ( [59]).
MSMs are a major advance toward performing sound causal inference in time-varying settings.
However, there are two important issues they cannot easily resolve. First, while specifying a model
for E(Y aT ) is a less stringent requirement than, say, specifying a model for the outcome given
treatment and covariates, it can still lead to biased estimates if m(aT ; β) is not correctly specified.
Second, identifying and estimating β still relies on positivity, which is unlikely to be satisfied when
there are many time points or treatment values. In fact, even if positivity holds by design as it would
in an experiment, we are unlikely to observe all treatment regimes in a given experiment simply
because the number of possible treatment regimes grows exponentially with the number of time
points. This poses a computational challenge even when positivity holds, because the product of
densities in the denominator of (18.10) may be very small, resulting in an unstable estimate of β.
There are ways to mitigate the two drawbacks outlined above, but neither issue can be completely
fixed. First, one does not necessarily need to assume the MSM is correctly specified. Instead, one can
use the “working model” approach and estimate a projection of E(Y aT ) onto the MSM m(aT ; β). In
this approach one estimates β as the parameter that yields the best approximation of the causal effect
in the function class described by m(aT ; β) ( [35]). This approach is beneficial because it allows
for model-free inference: we can construct valid confidence intervals for the projection parameter
β regardless of whether the MSM is correctly specified. However, we are still only estimating a
projection; so, this approach can be of limited practical relevance if the model is grossly misspecified
and the projection onto the model has very little bearing on reality.
364 Incremental Causal Effects: An Introduction and Review
δπt (Ht )
Qt (At = 1 | Ht ) ≡ qt (Ht ; δ, πt ) = for 0 < δ < ∞. (18.11)
δπt (Ht ) + 1 − πt (Ht )
for πt (Ht ) = P(At = 1 | Ht ). This is the same intervention as in equation (18.4), just with time
subscripts added and conditioning on all the past up to treatment in time t (i.e., Ht ). There are two
main differences between the intervention (18.11) and the incremental intervention described in
Section 18.3:
1. This intervention happens over every time period from t = 1 to T , the end of the study.
2. This intervention requires the time-varying analogs of consistency and exchangeability, Assump-
tions 20 and 21.
Despite these differences, most of the machinery developed in Section 18.3 applies here. The intuition
about what happens when we shift δ → 0 or δ → ∞ is the same. Unfortunately, the results and
proofs look much more imposing at first glance, but that is due to the time-varying nature of the data,
not any change in the incremental intervention approach.
The incremental approach is quite different from the MSM approach. The incremental intervention
is a stochastic dynamic intervention that shifts propensity scores in each time period, whereas MSMs
describe a static intervention for what would happen if everyone took treatment aT . Consequently,
the incremental intervention framework does not require us to assume positivity (Assumption 22) or
a parametric model m(AT ; β), whereas MSMs require both.
Time-varying Treatments 365
Remark 18.11. The time-varying incremental intervention can actually allow δ to vary over t, but
we omit this generalization for ease of exposition and interpretability. In other words, in equation
(18.11) we could use δt instead of δ and allow δt to vary with t. Whether allowing δ to vary with time
is useful largely depends on the context. In some applications it may be enough to study interventions
that keep δ constant across time. On the other hand, one can imagine interventions whose impact
varies over time; for example, some encouragement mechanism might have an effect that decays
toward zero with time. Either way, the theory and methodology presented here would remain valid.
18.4.4 Identification
The following theorem is a generalization of Theorem 18.1; it shows that the mean counterfactual
T
outcome under the incremental intervention {qt (Ht ; δ, πt )}t=1 is identified under Assumptions 20
and 21.
Theorem 18.2. (Theorem 1, [1])
Under Assumptions 20 and 21, for δ ∈ (0, ∞), the incremental propensity score effect ψ (δ) satisfies
T
X Z at δπt (ht ) + (1 − at ){1 − πt (ht )}
dP(xt | ht−1 , at−1 )
Y
ψ (δ) = µ(hT , aT )
X t=1
δπt (ht ) + 1 − πt (ht )
aT ∈AT
(18.12)
and
T
δAt + 1 − At
ψ(δ) = E{Y
Y
} (18.13)
t=1
δπt (Ht ) + 1 − πt (Ht )
which follows by Robins’ g-formula ( [38]) and replacing the general treatment process with a
generic stochastic intervention dQ(at | ht ). Then, R we can replaceP the generic dQ(at | ht ) with
at qt (Ht ; δ, πt ) + (1 − at ){1 − qt (Ht ; δ, πt )} and A1 ×···×AT with aT ∈AT . The full proof is shown
in the appendix of [1]. Just like in the T = 1 case, ψ(δ) is well-defined even if πt (ht ) = 0 or 1 for
some ht .
18.4.5 Estimation
In Section 18.3, we presented two “plug-in” estimators, briefly reviewed the theory of influence
functions, and presented an influence-function-based estimator that allows for nonparametric nuisance
function estimation. In the multiple time point case, the parameter ψ(δ) is still smooth enough to
possess an influence function. Thus, the discussion in Section 18.3 also applies here, although the
notation becomes more involved.
Again, there are two plug-in estimators. The outcome-based g-computation style estimator is
motivated by equation (18.12) in Theorem 18.2 and can be constructed with
T
X Z πt (ht ) + (1 − at ){1 − π
at δb bt (ht )} b
dP(xt | ht−1 , at−1 )
Y
ψ (δ) = µ
b(hT , aT )
X t=1
πt (ht ) + 1 − π
δb bt (ht )
aT ∈AT
366 Incremental Causal Effects: An Introduction and Review
As before, both estimators inherit the convergence rates of the nuisance function estimators for πbt and
µ
b, and typically attain parametric rates of convergence only with correctly specified restrictive models
for every nuisance function. As with the single time-point case, we may be skeptical as to whether
specifying correct parametric models is possible, and this motivates an influence-function-based
approach. Theorem 2 in [1] derives the influence function for ψ(δ).
Theorem 18.3. (Theorem 2, [1]) The un-centered efficient influence function for ψ (δ) in a nonpara-
metric model with unknown propensity scores is given by
T
X At {1 − πt (Ht )} − (1 − At )δπ(Ht )
ϕ(Z; η, δ) ≡
t=1
δ/(1 − δ)
δπt (Ht )mt (Ht , 1) + {1 − πt (Ht )}mt (Ht , 0)
×
δπt (Ht ) + 1 − πt (Ht )
( t ) T
Y δAs + 1 − As Y (δAt + 1 − At )Y
× +
s=1
δπs (Hs ) + 1 − πs (Hs ) t=1
δπt (Ht ) + 1 − πt (Ht )
with Rt = (HT × AT ) \ Ht .
The following influence-function-based estimator is a natural consequence of Theorem 18.3:
n
b = 1
X
ψ(δ) ϕ(Zi ; ηb, δ)
n i=1
Similarly to the T = 1 case, this estimator is optimal with nonparametric models under certain
smoothness or sparsity constraints. Again, it is advantageous to construct this estimator using
cross-fitting, since it allows fast convergence rates of ψ(δ)
b without imposing Donsker-type conditions
on the estimators of η. The detailed algorithm is provided as Algorithm 1 of [1]. For intuition, the
reader can imagine the algorithm as an extension of Algorithm 1 in this chapter, where in step 1 we
estimate all the nuisance functions in Theorem 18.3. However, unlike Algorithm 1, we can estimate
mt recursively backwards though time, and this is outlined in Remark 2 and Algorithm 1 from [1].
The ipsi function in the R package npcausal can be used to calculate incremental effects in
time-varying settings.
18.4.6 Inference
As in the T = 1 case, we provide both pointwise and uniform inference results. The theory is
essentially the same as in the one time-point case, but the conditions to achieve convergence to a
Gaussian distribution or a Gaussian process need to be adjusted to handle multiple time points.
Example Analysis 367
then
√ b
n{ψ(δ) − ψ(δ)}
G(δ)
σ
b(δ)
The main assumption underlying this result is that the product of the L2 errors for estimating mt
and πt is of smaller order than n−1/2 . This is essentially the same small-bias condition discussed in
Section 18.3.3.1 for the T = 1 case, adjusted to handle multiple √time points. This requirement is
rather mild, because mt and πt can be estimated at slower-than- n rates, say n−1/4 rates, without
affecting the efficiency of the estimator. As discussed in Remark 18.7, n−1/4 rates are attainable in
nonparametric models under smoothness or sparsity constraints.
Finally, the convergence statements above allow for straightforward pointwise and uniform
inference as discussed in Sections 18.3.3.1 and 18.3.3.2. In particular, we can construct a Wald-type
confidence interval
b ± z1−α/2 σ
ψ(δ)
b(δ)
√
n
at each δ using the algorithm outlined above. We can also create uniformly valid confidence bands
covering δ 7→ ψ(δ) via the multiplier bootstrap and use the bands to test for no treatment effect, as in
Section 18.3.3.2. We refer to Sections 4.3 and 4.4 in [1] for additional technical details.
FIGURE 18.1
Estimated marriage prevalence 10 years post-baseline, if the incarceration odds were multiplied by
factor δ, with pointwise and uniform 95% confidence bands.
smooth line. The darker blue confidence band is a pointwise 95% confidence interval that would give
us valid inference at a single δ value. The lighter blue confidence band is the uniform 95% confidence
interval that allows us to perform inference across the whole range of δ values.
In this example we see that incarceration has a strong effect on marriage rates: Estimated marriage
rates decrease with higher odds of incarceration (δ > 1) but only increase slightly with lower odds
of incarceration (δ < 1). To be more specific, at observational levels of incarceration (i.e., leaving
the odds of incarceration unchanged) they estimated a marriage rate of ψ(1) = 0.294, or 29.4%. If
the odds of treatment were doubled, they estimated marriage rates would decrease to 0.281, and if
they were quadrupled they would decrease even further to 0.236. Conversely, if the odds of treatment
were halved, they estimated marriage rates would only increase to 0.297; the estimated marriage
rates are the same if the odds are quartered.
Finally, we can use the uniform confidence interval in Figure 18.1 to test for no effect across the
range δ ∈ [0.2, 5], and we reject the null hypothesis at the α = 0.05 level with a p-value of 0.049. By
eye, we can roughly see that it is just about impossible to run a horizontal line through the uniform
95% confidence interval. But it is very close, and the p-value is 0.049, only just below 0.05.
Extensions & Future Directions 369
18.7 Discussion
This note serves as an introduction to and a review of [1], which describes a class of causal effects
based on incremental propensity score interventions. Such effects can be identified even if positivity
is violated and thus can be useful in time-varying settings when the number of possible treatment
sequences is large and positivity is less likely to hold. We have compared this class of effects to
other effects commonly estimated in practice, such as average treatment effects or coefficients
in MSM, and highlighted the scenarios where it can be most useful. Along the way, we have
reviewed the estimation and inferential procedure proposed in [1] and briefly reviewed the underlying
semiparametric efficiency theory.
370 Incremental Causal Effects: An Introduction and Review
References
[1] Edward H Kennedy. Nonparametric causal effects based on incremental propensity score
interventions. Journal of the American Statistical Association, 114(526):645–656, 2019.
[2] Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized
studies. Journal of Educational Psychology, 66(5):688, 1974.
[3] BJS. Bureau of justice statistics web page. https://www.bjs.gov/index.cfm?ty=
kfdetail&iid=487, 2021.
[4] BJS. Bureau of justice statistics 2018 update on prisoner recidivism: A 9-year follow-up period
(2005-2014). https://www.bjs.gov/content/pub/pdf/18upr9yfup0514_
sum.pdf, 2018.
[5] HJ. Steadman, FC. Osher, PC. Robbins, B. Case, and S. Samuels. Prevalence of serious mental
illness among jail inmates. Psychiatric Services, 60(6):761–765, 2009.
[6] Jennifer L. Skeem, Sarah Manchak, and Jillian K. Peterson. Correctional policy for offenders
with mental illness: Creating a new paradigm for recidivism reduction. Law and Human
Behavior, 35(2):110–126, 2011.
[7] Paul W Holland. Statistics and causal inference. Journal of the American Statistical Association,
81(396):945–960, 1986.
[8] Daniel Westreich and Stephen R Cole. Invited commentary: Positivity in practice. American
Journal of Epidemiology, 171(6):674–677, 2010.
[9] Maya L Petersen, Kristin E Porter, Susan Gruber, Yue Wang, and Mark J Van Der Laan.
Diagnosing and responding to violations in the positivity assumption. Statistical Methods in
Medical Research, 21(1):31–54, 2012.
[10] Leslie Kish. Weighting for unequal pi. Journal of Official Statistics, pages 183–200, 1992.
[11] Matias Busso, John DiNardo, and Justin McCrary. New evidence on the finite sample properties
of propensity score reweighting and matching estimators. Review of Economics and Statistics,
96(5):885–897, 2014.
[12] Jinyong Hahn. On the role of the propensity score in efficient semiparametric estimation of
average treatment effects. Econometrica, 66(2):315–331, 1998.
[13] Whitney K Newey. Semiparametric efficiency bounds. Journal of Applied Econometrics,
5(2):99–135, 1990.
[14] Daniel E Ho, Kosuke Imai, Gary King, and Elizabeth A Stuart. Matching as nonparametric
preprocessing for reducing model dependence in parametric causal inference. Political Analysis,
15(3):199–236, 2007.
[15] Elizabeth A Stuart. Matching methods for causal inference: A review and a look forward.
Statistical science: a review journal of the Institute of Mathematical Statistics, 25(1):1, 2010.
[16] Richard K Crump, V Joseph Hotz, Guido W Imbens, and Oscar A Mitnik. Dealing with limited
overlap in estimation of average treatment effects. Biometrika, 96(1):187–199, 2009.
Discussion 371
[17] Kelly L Moore, Romain Neugebauer, Mark J van der Laan, and Ira B Tager. Causal inference
in epidemiological studies with strong confounding. Statistics in Medicine, 31(13):1380–1404,
2012.
[18] Alexandre B Tsybakov. Introduction to Nonparametric Estimation. New York: Springer, 2009.
[19] Peter J Bickel, Chris AJ Klaassen, Ya’acov Ritov, and Jon A Wellner. Efficient and Adaptive
Estimation for Semiparametric Models. Baltimore: Johns Hopkins University Press, 1993.
[20] Aad W van der Vaart. Asymptotic Statistics. Cambridge: Cambridge University Press, 2000.
[21] Edward H Kennedy. Semiparametric theory and empirical processes in causal inference. In:
Statistical Causal Inferences and Their Applications in Public Health Research, pages 141–167,
Springer, 2016.
[22] James M Robins, Lingling Li, Eric Tchetgen Tchetgen, and Aad W van der Vaart. Quadratic
semiparametric von mises calculus. Metrika, 69(2-3):227–247, 2009.
[23] Anastasios A Tsiatis. Semiparametric Theory and Missing Data. New York: Springer, 2006.
[24] Aaron Fisher and Edward H Kennedy. Visually communicating and teaching intuition for
influence functions. The American Statistician, 75(2):162–172, 2021.
[25] Mark J van der Laan and Daniel B Rubin. Targeted maximum likelihood learning. UC Berkeley
Division of Biostatistics Working Paper Series, 212, 2006.
[26] Chris AJ Klaassen. Consistent estimation of the influence function of locally asymptotically
linear estimators. The Annals of Statistics, 15(4):1548–1562, 1987.
[27] Peter J Bickel and Yaacov Ritov. Estimating integrated squared density derivatives: Sharp best
order of convergence estimates. Sankhyā, pages 381–393, 1988.
[28] James M Robins, Lingling Li, Eric J Tchetgen Tchetgen, and Aad W van der Vaart. Higher
order influence functions and minimax estimation of nonlinear functionals. Probability and
Statistics: Essays in Honor of David A. Freedman, pages 335–421, 2008.
[29] Wenjing Zheng and Mark J van der Laan. Asymptotic theory for cross-validated targeted
maximum likelihood estimation. UC Berkeley Division of Biostatistics Working Paper Series,
Paper 273:1–58, 2010.
[30] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen,
Whitney Newey, and James M Robins. Double/debiased machine learning for treatment and
structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018.
[31] Max H Farrell. Robust inference on average treatment effects with possibly more covariates
than observations. Journal of Econometrics, 189(1):1–23, 2015.
[32] Edward H Kennedy, S Balakrishnan, and M G’Sell. Sharp instruments for classifying compliers
and generalizing causal effects. The Annals of Statistics, 48(4):2008–2030, 2020.
[33] Aad W van der Vaart and Jon A Wellner. Weak Convergence and Empirical Processes. Springer,
1996.
[34] Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal
inference models. Biometrics, 61(4):962–973, 2005.
[35] Romain Neugebauer and Mark J van der Laan. Nonparametric causal effects based on marginal
structural models. Journal of Statistical Planning and Inference, 137(2):419–434, 2007.
372 Incremental Causal Effects: An Introduction and Review
[36] Stephen R Cole and Miguel A Hernán. Constructing inverse probability weights for marginal
structural models. American Journal of Epidemiology, 168(6):656–664, 2008.
[37] Denis Talbot, Juli Atherton, Amanda M Rossi, Simon L Bacon, and Genevieve Lefebvre. A
cautionary note concerning the use of stabilized weights in marginal structural models. Statistics
in Medicine, 34(5):812–823, 2015.
[38] James M Robins. A new approach to causal inference in mortality studies with a sustained
exposure period - application to control of the healthy worker survivor effect. Mathematical
Modelling, 7(9-12):1393–1512, 1986.
[39] Kwangho Kim, Edward H Kennedy, and Ashley I Naimi. Incremental intervention ef-
fects in studies with many timepoints, repeated outcomes, and dropout. arXiv preprint
arXiv:1907.04004, 2019.
[40] James M Robins, Miguel Angel Hernan, and Babette Brumback. Marginal structural models
and causal inference in epidemiology. Epidemiology, 11(5):550–560, 2000.
19
Weighting Estimators for Causal Mediation
Donna L. Coffman
University of South Carolina
Megan S. Schuler
RAND Corporation
Trang Q. Nguyen
Johns Hopkins University
Daniel F. McCaffrey
ETS
CONTENTS
19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
19.1.1 Overview of causal mediation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
19.1.2 Introduction to case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
19.1.3 Potential outcomes notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
19.2 Natural (In)direct Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
19.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
19.2.2 Identifying assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
19.3 Estimation of Natural (In)direct Effects for a Single Mediator . . . . . . . . . . . . . . . . . . . . . . 379
19.3.1 Ratio of mediator probability weighting (RMPW) . . . . . . . . . . . . . . . . . . . . . . . . 380
19.3.1.1 Applied example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
19.3.2 Huber (2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
19.3.2.1 Applied example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
19.3.3 Nguyen et al. (2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
19.3.3.1 Applied example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
19.3.4 Albert (2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
19.3.4.1 Applied example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
19.3.5 Inverse odds ratio weighting (IORW) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
19.3.5.1 Applied example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
19.4 Natural (In)direct Effects for Multiple Mediators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
19.4.1 Notation and assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
19.4.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
19.4.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
19.4.3.1 RMPW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
19.4.3.2 Huber (2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
19.4.3.3 Nguyen et al. (2022) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
19.4.3.4 Albert (2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
19.4.3.5 IORW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
19.5 Interventional (In)direct Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
19.1 Introduction
The focus of this chapter is weighting estimators for causal mediation analysis – specifically, we focus
on natural direct and indirect effects and interventional direct and indirect effects. Using potential
outcomes notation, we define specific estimands and the needed identifying assumptions. We then
describe estimation methods for the estimands. We detail relevant R packages that implement these
weighting methods and provide sample code, applied to our case study example. More detail about
the R packages is given at the end of the chapter in Section 19.7.
C M
A Y
FIGURE 19.1
A simple mediation model.
FIGURE 19.2
Mediation model for case study.
individuals may identify as a broader range of sexual identities), in that it gives rise to experiences
of “minority stress,” namely, excess social stressors experienced by individuals in a marginalized
social group (e.g., LGB individuals). Manifestations of minority stress may include experiences of
stigma, discrimination, bullying, and family rejection, among others. Substance use among LGB
individuals has been theorized to reflect, in part, a coping strategy to minority stress experiences.
In our example, the particular outcome of interest is current smoking among LGB women, which
we know to be disproportionately higher than among heterosexual women [1]. We apply mediation
analysis to elucidate potential causal pathways that may give rise to these elevated rates of smoking.
Our hypothesized mediator is early smoking initiation (i.e., prior to age 15); that is, we hypothesize
that LGB women are more likely to begin smoking at an earlier age than heterosexual women,
potentially in response to minority stressors. In turn, early smoking initiation, which is a strong
risk factor for developing nicotine dependence, may contribute to higher rates of smoking among
LGB women. In summary, the exposure is defined as sexual minority status (1 = LGB women, 0
= heterosexual women), the mediator is early smoking initiation (1 = early initiation, 0 = no early
initiation), and the outcome is current smoking in adulthood (1 = yes, 0 = no). Baseline covariates
include age, race/ethnicity, education level, household income, employment status, marital status,
and urban vs. rural residence. Figure 19.2 depicts the mediation model for our motivating example.
potential outcomes, Y1 and Y0 , exist for all individuals in the population regardless of whether the
individual received the exposure or comparison condition. However, we can observe only one of
these outcomes for each participant depending on which exposure condition the individual actually
receives.
The mediator is an “intermediate” outcome of the exposure and itself has potential values. For
each exposure level a, there is a corresponding potential mediator value, denoted Ma . Also, there is
a corresponding potential outcome that reflects the outcome value that would arise under the specific
exposure level a and the specific potential mediator value Ma – this potential outcome is denoted
Y(a,Ma ) . Causal definitions of direct and indirect effects require extending the potential outcomes
framework such that there is a potential outcome for each treatment and mediator pair. For the case
of a binary exposure A, there are four potential outcomes for an individual, formed by crossing both
exposure values with both potential mediator values: Y(1,M1 ) , Y(0,M0 ) , Y(1,M0 ) , and Y(0,M1 ) . Only
Y(1,M1 ) and Y(0,M0 ) , which correspond to the individual receiving A = 1 or A = 0, respectively,
can be observed in practice. The other two potential outcomes are hypothetical quantities (i.e., the
mediator value is manipulated to take on the value it would have under the other exposure condition);
these are necessary to define the causal estimands of interest, as we detail later.
Our use of the above notation implicitly makes several assumptions, often collectively referred
to as the stable unit treatment value assumption [2]. First, we assume that an individual’s potential
outcomes are not influenced by any other individual’s treatment status and that there are not multiple
versions of the treatment. These assumptions ensure that the potential outcomes are well-defined. Ad-
ditionally, we invoke the consistency and composition assumptions [2]. The consistency assumption
states that the outcome observed for an individual is identical to (i.e., consistent with) the potential
outcome that corresponds to their observed exposure value; similarly, their observed mediator value
is the potential mediator value that corresponds to their observed exposure value, that is M = Ma
and Y = Ya if A = a. For example, if an individual’s sexual identity is LGB (A = 1), then their
observed mediator value M equals M1 and their observed outcome, Y , equals Y1 . Similarly, if an
individual’s sexual identity is heterosexual (A = 0), then their observed mediator value M equals
M0 and their observed outcome, Y , equals Y0 . The consistency assumption, extended to the joint
exposure-mediator, is that if observed exposure A = a and observed mediator M = m, then the
observed outcome, Y , equals Y(a,m) . Finally, the composition assumption pertains to the nested
counterfactuals and states that Ya = Y(a,Ma ) , that is intervening to set A = a, the potential outcome
Ya equals the nested potential outcome Y(a,Ma ) , when intervening to set A = a and M = Ma ,
the value that the mediator would obtain if A had been a. For example, using both the consistency
and composition assumptions, if an individual’s sexual identity is LGB (A = 1), then their ob-
served outcome Y equals Y1 , which equals Y(1,M1 ) and similarly, if an individual’s sexual identity is
heterosexual (A = 0), then their observed outcome Y equals Y0 , which equals Y(0,M0 ) .
19.2.1 Definitions
When we cross the possible exposure values and potential mediator values, there are four potential
outcome values:
Natural (In)direct Effects 377
• Y(1,M1 ) , the potential outcome under the exposed condition and the mediator value corresponding
to the exposed condition.
• Y(0,M0 ) , the potential outcome under the unexposed condition and the mediator value correspond-
ing to the unexposed condition.
• Y(1,M0 ) , the potential outcome under the exposed condition and the mediator value corresponding
to the unexposed condition.
• Y(0,M1 ) , the potential outcome under the unexposed condition and the mediator value correspond-
ing to the exposed condition.
As discussed previously, the latter two potential outcomes, Y(1,M0 ) and Y(0,M1 ) , are often referred to
as cross-world counterfactuals or cross-world potential outcomes. They are never observed for any
individual, yet allow us to more precisely define causal estimands for direct and indirect effects. We
begin by defining the total effect (T E) of A on Y :
where a and a0 are two different levels of the exposure (e.g., 1 = LGB and 0 = heterosexual).
The natural direct effect (N DE) and natural indirect effect (N IE), which sum to produce the
T E, are defined as follows:
Note that the N DE and N IE definitions rely on the hypothetical (unobservable) potential outcomes.
Consider the following decomposition of T E in the case of a binary exposure (a = 1 and a0 = 0):
total effect
z }| {
TE = Y1 − Y0 = Y(1,M1 ) − Y(0,M0 )
natural indirect effect natural direct effect
z }| { z }| {
= Y(1,M1 ) − Y(1,M0 ) + Y(1,M0 ) − Y(0,M0 )
= N IE1 + N DE0 (19.4)
This decomposition is obtained by adding and subtracting Y(1,M0 ) , the potential outcome we would
observe in a world where the exposure A = 1 and M is artificially manipulated to take the value it
would naturally have under the condition A = 0. We can similarly define an alternative decomposition
for N DE1 and N IE0 , by adding and subtracting Y(0,M1 ) as follows:
total effect
z }| {
TE = Y1 − Y0 = Y(1,M1 ) − Y(0,M0 )
natural direct effect natural indirect effect
z }| { z }| {
= Y(1,M1 ) − Y(0,M1 ) + Y(0,M1 ) − Y(0,M0 )
= N DE1 + N IE0 (19.5)
The subscripts for N DE denote the condition to which the mediator is held constant, whereas the
subscripts for N IE denote the condition to which the exposure is held constant. Each decomposition
includes an N IE and an N DE corresponding to opposite subscripts. N DE0 is sometimes called
“pure” direct effect and the N IE1 is sometimes referred to as the “total” indirect effect. The N DE1
is sometimes referred to as the “total” direct effect and the N IE0 is sometimes referred to as the
378 Weighting Estimators for Causal Mediation
“pure” indirect effect [3]. Although we do not use them here, these labels are sometimes used in
software output.
In the context of our motivating example, the N DE0 term, Y(1,M0 ) − Y(0,M0 ) , compares adult
smoking status corresponding to LGB versus heterosexual status, holding early smoking initiation
status to the value that would be obtained if heterosexual. The individual N DE0 will be non-null
only if LGB status has an effect on adult smoking status when early smoking initiation status is held
fixed – namely, if LGB status has a direct effect on the outcome,not through the mediator. The
population version of this effect is N DE0 = E Y(1,M0 ) − Y(0,M0 ) .
The N IE1 term Y(1,M1 ) − Y(1,M0 ) compares adult smoking status under the early smoking
initiation status that would arise with and without the exposure condition (i.e., LGB status), for those
in the exposure group (i.e., LGB women). The individual N IE1 will be non-null only if LGB status
has an indirect effect on adult smoking status via early smoking initiation
among LGB women. The
population version of this effect is N IE1 = E Y(1,M1 ) − Y(1,M0 ) . Throughout the remainder of
the chapter, N DEa0 , N IEa , and T E all refer to the population versions of the effects.
The above effect definitions, given as marginal mean differences, can also be defined on other
scales (e.g., odds ratio, risk ratio) and as conditional on covariates. In particular, our case study
involves a binary outcome, thus effect definitions on the ratio scale are relevant. On the ratio scale
(either odds or risk ratios), the N DEa0 and N IEa are defined as follows:
In addition, the effects may be defined conditionally on covariates. For example, on the ratio scale,
C L
A Y
FIGURE 19.3
Mediation model in which Assumption 4 is violated. L ≡ post-exposure confounders.
p(Ma0 = Mi |C = Ci ) 1
wi(a,Ma0 ) = , (19.6)
p(Ma = Mi |C = Ci ) p(A = a|C = Ci )
The first term in this weight is a ratio of the probabilities that the potential mediator Ma0 (in
the numerator) and the potential mediator Ma (in the denominator) take on the person’s observed
mediator value Mi , given their covariate values Ci . The point of a probability ratio weight is to
morph one distribution so that it resembles another. Consider a subpopulation where the individuals
share the same C values. Within this subpopulation we are using data from individuals with actual
exposure a to infer the mean of Y(a,Ma0 ) , but their mediator values are Ma , not Ma0 . Probability
ratio weighting shifts the mediator distribution so it resembles the desired distribution, that of Ma0 ,
effectively correcting the mismatch so that the weighted outcome distribution resembles that of the
cross-world potential outcome (within levels of C).
The second term is the usual inverse probability weight (IPW) that equates the exposed and
unexposed groups for the distribution of C and shifts that distribution to mimic to the distribution of
p(A=a)
C in the full sample. The stabilized version of the inverse probability weight, p(A=a|C=C i)
, can also
be used.
When estimating E Y(1,M0 ) (for the decomposition in Equation 19.4), the weight is:
p(M = Mi |A = 0, C = Ci ) 1
wi(1,M0 ) = . (19.8)
p(M = Mi |A = 1, C = Ci ) p(A = 1|C = Ci )
Similarly, for estimating E Y(0,M1 ) (for the decomposition in Equation 19.5), the weight is:
p(M = Mi |A = 1, C = Ci ) 1
wi(0,M1 ) = . (19.9)
p(M = Mi |A = 0, C = Ci ) p(A = 0|C = Ci )
In practice, these inverse probability weights can be estimated using logistic regression for a
binary exposure. For a binary mediator variable, each of the mediator probabilities can be estimated
using logistic regression. For a non-binary mediator, estimation of the first term (now a ratio of
probability densities) becomes more complicated.
The weighted
outcome model is fit by duplicating the exposed individuals when estimating
E Y(1,M0 ) for the decomposition in Equation 19.4. Let D be an indicator variable that equals 1
for the duplicates and 0 otherwise. Thus, the analytic data set is composed of the original exposed
and unexposed individuals for which D = 0 and duplicated exposed individuals in which D = 1.
Weights are assigned as follows. If A = 0 and D = 0, then the weight for individual i, wi , is
Estimation of Natural (In)direct Effects for a Single Mediator 381
TABLE 19.1
Estimates from rmpw.
E(Y ) = β0 + β1 A + β2 D
where β1 represents an estimate of the N DE0 and β2 represents an estimate of the N IE1 . This
outcome model was later termed the “natural effect model” [8].
to obtain an easier set of weights to estimate. The formula below (which is equivalent to Hong’s
weights [7]) shows the weight for individuals i in the group that experienced exposure level a.
Odds Weight
z }| {z IPW (for
}|
A=a’)
{
0
p(A = a |Mi , Ci ) 1
wi(a,Ma0 ) = (19.10)
p(A = a|Mi , Ci ) p(A = a0 |Ci )
382 Weighting Estimators for Causal Mediation
TABLE 19.2
Estimates from causalweight.
It is the combination of an odds weight that morphs the A = a group to mimic the joint (C, M )
distribution of the A = a0 group, and an inverse probability weight that shifts the C distribution
to that of the full sample. Note that the denominator of the IPW term here is different from that in
Hong’s formula; it is the conditional probability of being in the A = a0 (not A = a) group.
Equation 19.10 implies that for the decomposition in Equation 19.4, the cross-world weights for
treated units are
p(A = 0 | Mi , Ci ) 1
wi(1,M0 ) = ,
p(A = 1 | Mi , Ci ) p(A = 0 | Ci )
and for the other decomposition, the cross-world weights for control units are
p(A = 1 | Mi , Ci ) 1
wi(0,M1 ) = .
p(A = 0 | Mi , Ci ) p(A = 1 | Ci )
Huber [11] used logistic or probit regression to estimate these weights. More recently, Generalized
Boosted Modeling (GBM) has been implemented in the R package twangMediation for estimat-
ing the weights [12]. GBM is a nonparametric machine learning algorithm that has been shown to
outperform logistic regression [13]. Once the weights have been estimated, they are
used to compute a
weighted average of the observed outcomes. Specifically, to estimate E Y(a,Ma ) and E Y(a0 ,Ma0 ) ,
we use the usual inverse propensity weights and compute the average of the observed outcomes in the
p(A=1) 1−p(A=1)
exposure group weighted by p(A=1|C) and in the unexposed group weighted by 1−p(A=1|C) , where
p(A = 1|C) is estimated based on a model specified for the exposure. For estimating E Y(a,Ma0 )
and E Y(a0 ,Ma ) , we use the weights in Equation 19.10. Using these estimates of the counterfactuals,
the natural (in)direct effects can be computed using the definitions in Section 19.2.1.
in magnitude in adult smoking rates). Additionally, both NDE estimates are significant, indicating
that LGB identity also has a significant effect on adult smoking that is not attributed to the early
smoking initiation pathway.
Table 19.3 presents results estimated with GBM using the twangMediation package. The TE
estimate of 0.123 represents a difference in magnitude of 12.3% in adult smoking rates between LGB
and heterosexual women. Both NIE estimates are statistically significant, indicating that for both LGB
and heterosexual women, early smoking initiation represents a significant pathway regarding adult
smoking status (with early initiation accounting for approximately a 2–3% increase in magnitude in
adult smoking rates). Additionally, both NDE estimates are significant, indicating that LGB identity
also has a significant effect on adult smoking that is not attributed to the early smoking initiation
pathway. Differences in estimates between the twangMediation and causalweight packages
can be attributed to differences between GBM and logistic estimation of weights.
Here the first term can be ignored as it is a constant (dropping it results in stabilized weights). The
second term can be interpreted as the ratio of two densities. The denominator is the joint density of
(C, M ) for units in the A = a condition. The numerator is the joint density of (C, M ) for units in the
A = a0 condition but where these units have been weighted to mimic the covariate distribution of the
population. This means the cross-world weights essentially morph the A = a subsample to mimic
the (C, M ) distribution in the A = a0 subsample that has been weighted to mimic the population C
distribution. We do not use Equation 19.11 as a formula, but rely on this insight to develop a simple
method for estimating the weights. It involves two steps:
1. Estimate weights for the A = a0 subsample to mimic the full sample C distribution.
2. Estimate weights for the A = a subsample to mimic the (C, M ) distribution of the weighted
A = a0 sample from the previous step. These weights are the cross-world weights.
In practice these steps can be implemented via probability models (using IPW weights from the
propensity score model in the first step and using odds weights from a weighted model for A given
C, M in the second step) or via direct estimation of weights for optimal balance.
Conceptually, the connection among the three weighting estimation methods above is that the
objective of the cross-world weights (for estimating E(Y(a,Ma0 ) ) is to weight the A = a subsample
so that it mimics the full sample (population) C distribution and the M given C distribution in
384 Weighting Estimators for Causal Mediation
TABLE 19.4
Estimates obtained using Nguyen et al. (2021) weights.
individuals with A = a0 . The latter identifies the distribution of Ma0 , so intuitively this achieves
a swapping out of the Ma distribution for the Ma0 distribution. The three weighting estimation
methods all achieve this but in different ways. Hong’s (2010) [17] method directly achieves this by
combining an inverse probability weight that achieves the target C distribution and a density ratio
weight that achieves the target M given C distribution. Huber’s (2014) [11] formula achieves this
in a zig-zag manner where the A = a subsample is first weighted by an odds weight to mimic the
(C, M ) distribution of the A = a0 subsample, and then is weighted by an inverse probability weight
to morph the C distribution to that of the full sample. The third method identified by Nguyen et al.
(2022) [16], like the first method, is direct weighting. The first step simply serves to set up the right
target for use in the weighting of the A = a subsample. This works because the weighted A = a0
subsample obtained in the first step has both elements of the target: the same C distribution as in the
full sample (due to the inverse probability weighting), and the right M given C distribution (as these
are individuals with A = a0 ).
In terms of ease of use, the three methods require fitting different models (i.e., estimating different
functions), so each may be more convenient in different settings. The last two methods do not require
modeling mediator densities, so are easier to use generally (because densities are hard to estimate)
and especially when the mediator is non-binary or in the case of multiple mediators. The third method
is the only one that weights the source (A = a subsample) to mimic a target in one step rather than
relying on multiplying two weights. Therefore, methods that directly estimate weights (i.e., methods
that do not necessarily require fitting probability models) could be used to morph one sample to
mimic another.
p(A = 1|C) is estimated based on a model specified for the exposure. E(Y(a,Ma0 ) ) is estimated by
obtaining predicted values from the outcome model, E(Y |A = a, M = m, C = c). Specifically, for
each individual with A = a0 , obtain a predicted value for the outcome if the individual had been
in exposure A = a instead of A = a0 (using their observed values of the mediator and covariates).
A weighted average of these predicted values for individuals with A = a0 is computed using the
1−p(A=1)
weights 1−p(A=1|C) and provides an estimate of E(Y(a,Ma0 ) ).
N DE0 (N IE1 − 1)
% Mediated = .
TE − 1
This approach takes advantage of the invariance of the OR (i.e., the OR for two variables will
be the same regardless of which of the two variables is the independent variable and which is the
dependent variable). The weighting deactivates the indirect effects through the mediator by rendering
the exposure and mediator independent. Because this approach does not rely on a model for the
mediator, it is very flexible. For example, it easily accommodates mediators of any variable type (i.e.,
continuous, categorical).
If the exposure variable is binary, then a logistic regression model can be specified in which the
mediator(s) and pre-exposure covariates are predictors of the exposure. The weight is then computed
as the inverse of the predicted OR from this model for the individuals in the exposure group, as
follows:
p(A = 0|Mi , Ci )
wi = .
p(A = 1|Mi , Ci )
Individuals in the control group are given a weight of one. Alternatively, more stable weights for the
exposure individuals may be obtained by using the inverse of the predicted odds from this model,
which is referred to as inverse odds weighting [19].
This approach is flexible in that the exposure variable need not be binary. For example, if
there are three levels for the exposure variable, then a multinomial logistic regression, ordinal
logistic regression, or other appropriate model could be used. Again, individuals in the control (i.e.,
unexposed) group are given a weight of one. Individuals in the other groups are given weights of the
form:
p(A = 0|Mi , Ci )
wi = .
p(A = a|Mi , Ci )
In addition the model for the exposure variable can be flexible (e.g., could include polynomial terms
for the mediator(s)). Similarly, the outcome model is flexible in that it may be any generalized linear
model with a nonlinear link function, a Cox proportional hazard model in the case of a survival
outcome, or a quantile regression.
C M1
M2
A Y
FIGURE 19.4
Multiple mediation model in which mediators are independent.
19.4.2 Definitions
The joint natural (in)direct effects of M 1, M 2 are defined as [20]
As before,
TE = N DE1 + N IE0
= N DE0 + N IE1
The joint N IE is the effect of A on Y that is mediated through M 1 and/or M 2, and the joint
N DE is the effect that is not mediated through M 1 or M 2.
19.4.3 Estimation
Below, we detail multiple estimation approaches that can be used with multiple (independent)
mediators.
19.4.3.1 RMPW
The RMPW method may be extended to multiple mediators [22] for the special case where all the
mediators are independent of one another conditional on baseline covariates and exposure, as in
Figure 19.4. This condition can and should be checked using data. For example, with continuous
mediators, one can model the mediators given exposure and covariates, and check within each
exposure condition whether the residuals are uncorrelated. In this special case, to obtain an estimate
of E(Y(a,M 1a0 ,M 2a00 ) ), the following weights are applied to individuals i in the A = a group, and
the outcome is averaged using these weights in this group:
This approach assumes correct models for the exposure and mediator but not the outcome. Because
these weights rely on models for the mediators, the weights may become unstable if any of the
mediators are continuous. Multiple mediators using RMPW are not currently implemented in the
rmpw or medflex R packages.
p(A = a0 |M 1i , M 2i , Ci ) 1
. (19.12)
p(A = a|M 1i , M 2i , Ci ) p(A = a0 |Ci )
The causalweight package implements Huber’s weighting approach, estimated using logistic
or probit regression, the mediationClarity package implements it using logistic regression, and
the twangMediation package implements it using GBM or logistic regression. We demonstrate
this approach with the causalweight and twangMediation packages, using an extension of
our prior LGB disparities analyses examining the effect of a single mediator, early smoking initiation,
on adult smoking status. We now consider another mediator, early alcohol initiation, alc15, in
addition to early smoking initiation, cig15. The outcome is an indicator for whether an individual
meets criteria for either alcohol or nicotine dependence, alc cig depend, in adulthood.
Recall that causalweight provides marginal risk difference estimates. As shown in Table
19.7, the TE estimate of 0.110 represents a difference in magnitude of 11.0% in the prevalence
of alcohol/nicotine dependence between LGB and heterosexual women (standardized to the full
population covariate distribution). This indicates a significant disparity. Both NIE estimates are
significant, indicating that early initiation of alcohol and smoking jointly represents a significant
mediating pathway to adult dependence status (with early initiation accounting for approximately a
3% increase in magnitude in adult dependence rates). Examining the ratio of NIE to TE, we conclude
that early initiation accounts for approximately 30% of the adult disparity in alcohol/nicotine
dependence. Additionally, both NDE estimates are significant, indicating that LGB identity also has
a significant effect on adult alcohol/nicotine dependence that is not attributed to early initiation of
alcohol or smoking.
Table 19.8 presents the estimates from using twangMediation to implement Huber’s ap-
proach. Recall that twangMediation provides marginal risk difference estimates. As shown in
Table 19.8, the TE estimate of 0.084 represents a difference in magnitude of 8.4% in the rates of
alcohol/nicotine dependence between LGB and heterosexual women, indicating a significant disparity.
Both NIE estimates are significant, indicating that early initiation of alcohol and smoking jointly
represent a significant mediating pathway to adult dependence status (with early initiation accounting
for approximately a 3% increase in magnitude in adult dependence rates). Examining the ratio of
NIE to TE, we conclude that early initiation accounts for approximately 30% of the adult disparity in
alcohol/nicotine dependence. Additionally, both NDE estimates are significant, indicating that LGB
identity also has a significant effect on adult alcohol/nicotine dependence that is not attributed to
early initiation of alcohol or smoking.
TABLE 19.9
Multiple mediator setting: Estimates using the Nguyen et al. [16] weights.
scale. Summing across the NDE and NIE estimates, we obtain a TE estimate of 0.087 representing
a difference in magnitude of 8.7% in the rates of alcohol/nicotine dependence between LGB and
heterosexual women. The NIE estimate is significant, indicating that early initiation of alcohol and
smoking jointly represent a significant mediating pathway to adult dependence status (with early
initiation accounting for approximately a 2.7% increase in magnitude in adult dependence rates).
Additionally, both NDE estimates are significant, indicating that LGB identity also has a significant
effect on adult alcohol/nicotine dependence that is not attributed to early initiation of alcohol or
smoking.
p(A = a0 )
p(A = a0 |Ci )
are used to obtain a weighted average of Y as an estimate of E(Y(a0 ,M 1a0 ,M 2a0 ) ). Finally, an estimate
of E(Y(a,M 1a0 ,M 2a0 ) ) is obtained using an outcome model, E(Y |A = a, M 1, M 2, C), and then
for each individual with A = a0 , obtain a predicted value for the outcome if the individual had
been in exposure A = a instead of A = a0 . Each individual’s observed values of the mediators and
covariates are used in obtaining the predicted values. A weighted average of these predicted values
for individuals with A = a0 is computed using the weights in Equation 19.4.3.4 and provides an
estimate of E(Y(a,M 1a0 ,M 2a0 ) ). As is the case for Albert’s (2012) [23] approach, this extension to
multiple mediators does not require a model for the mediators. Thus, it is advantageous when the
analyst is more comfortable specifying models for the exposure and outcome.
VanderWeele & Vansteelandt’s [20] weighting approach for multiple mediators is implemented in
the CMAverse R package. Table 19.10 presents the estimates from our case study. On the OR scale,
the TE estimate is 1.826, indicating that LGB women have approximately 1.8 times the odds of adult
Interventional (In)direct Effects 391
TABLE 19.10
Multiple mediator setting: Estimates using VanderWeele & Vansteelandt’s [20] weighting approach.
TABLE 19.11
Multiple mediator setting: IORW estimates.
alcohol/nicotine dependence as heterosexual women. Both NIE estimates are statistically significant,
indicating that LGB status increases the odds of adult dependence 1.2 times through the pathways of
early alcohol and smoking initiation. Results indicate that early alcohol/smoking initiation accounts
for approximately 43% of the total disparities in adult alcohol/nicotine dependence. Additionally, the
NDE estimates indicate that LGB status increases the odds of adult dependence by approximately
1.5 times through alternative pathways excluding early initiation.
19.4.3.5 IORW
IORW is easily extended to multiple mediators [24]. As in the case of a single mediator, this approach
does not specify a model for the mediators. It can accommodate multiple mediators of mixed variable
types, and interactions between the mediators and exposure without the need to specify them.
The IORW approach for multiple mediators is implemented in the CMAverse package in R.
Table 19.11 presents the estimates from our case study. The TE estimate indicates that LGB women
have approximately 1.8 times the odds of adult alcohol/nicotine dependence as heterosexual women.
The N IE1 estimate is statistically significant, indicating that LGB status increases the odds of
adult dependence by 1.13 times through the pathways of early alcohol and smoking initiation,
which accounts for approximately 26% of the total disparities in adult alcohol/nicotine dependence.
The statistically significant NDE estimate indicates that LGB status increases the odds of adult
alcohol/nicotine dependence by 1.57 times through alternative pathways excluding early alcohol and
smoking initiation.
of interventions at the population level rather than the individual level and have also been called
stochastic interventions or randomized intervention analogs. The advantage is that they are identified
without invoking cross-world independence assumptions but the disadvantage is that the direct and
indirect effects do not necessarily sum to the total effect.
19.5.1 Definitions
We use the same potential outcomes notation defined in Section 20.1. In addition, we let Ga|C denote
a value of the mediator randomly drawn from the population distribution of those with exposure
level a conditional on baseline confounders, C. This value is used in place of M0 or M1 in the
definitions of the natural direct and indirect effects given previously. The interventional effects differ
from the natural effects in that the mediator is fixed to a value randomly drawn from the mediator
distribution of the exposed or the unexposed rather than the individual’s particular mediator value
in the presence or absence of exposure. Specifically, the interventional direct and indirect effects,
defined by VanderWeele et al (2014) [25], are
referred to as such because it may or may not equal the total effect [25].
Other than this definition of interventional (in)direct effects, there are also other possible defini-
tions. One of us has argued that the definition of interventional effects should be informed by the
specific research question in each application. See a thorough and accessible discussion of this topic
in Nguyen et al. (2021) [26]. For simplicity we will restrict current attention to the interventional
(in)direct effects defined above.
condition using
1 p(M = Mi | A = a0 , Ci )
wa0 a0 = .
p(A = a0 | Ci ) p(M = Mi | A = a0 , Ci , Li )
The weights that target (a, Ga0 |C ) are defined for units in the A = a condition using the formula
1 p(M = Mi | A = a0 , Ci )
waa0 = .
p(A = a | Ci ) p(M = Mi | A = a, Ci , Li )
The weights that target (a, Ga|C ) are defined also for units in the A = a condition using the formula
1 p(M = Mi | A = a, Ci )
waa = .
p(A = a | Ci ) p(M = Mi | A = a, Ci , Li )
Each of these three formulas includes two terms. The first is an inverse-propensity weight that helps
mimic the C distribution of the full sample. The second term is a ratio of mediator densities (or
probabilities), where the denominator reflects the actual mediator distribution which depends on L,
and the numerator reflects the target mediator distribution (Ga0 |C , Ga0 |C and Ga|C ) which does not
condition on L. In the absence of L, these weights reduce to the same weights for natural (in)direct
effects: w00 and w11 become the inverse-propensity weights, and waa0 becomes the w(a,Ma0 ) defined
in (19.7).
For R code to implement these weights in the simple case where the mediator is binary, see
section 19.7.7.
on observed covariates, those comparisons are unconfounded and can yield consistent estimates of
the various causal estimands described earlier in this chapter. These ignorability assumptions cannot
be tested. However, regardless of whether these assumptions hold, if there are differences across
groups in the observed covariates that are related to either the outcome or the mediator, then they can
introduce bias into estimates unless adjusted for through modeling or weighting. Thus, at minimum,
for estimation methods that use weighting, the weighted distributions of the covariates should match
across comparison groups and methods that use modeling need to check the adequacy of the model
for the covariates.
P Various methods P estimate the total, direct, and 0
indirect effects as differences of weighted means,
w
i a,MaP Y
0 ,i i / w
i a,MP a0 ,i
, for a = 0, 1
P and a = 0, 1.P
For instance, Huber’s method [11] estimates
N IE1 as i w1,M1 ,i Yi / i w1,M1 ,i − i w1,M0 ,i Yi / i w1,M0 ,i , where the sample is restricted
to the treatment group and other corresponding differences for the other effect estimates. Any
differences in the distributions of the covariates when weighted by each of these sets of weights could
result in bias in the T E, N IE, and N DE estimates. Moreover, if the weights were known and not
estimated, then E w(a,Ma0 ) Cj | A = a = E [Cj ] for all the covariates Cj . Thus, if the estimated
weights are adequate, then the weighted means of all of the covariates should be approximately equal.
Checking that the weighted means and distributions of the covariates are approximately the
same provides a check on the potential for bias and a check on the adequacy of the estimates
of the conditional probability functions used in estimating the weights (e.g., P (A = a | C) or
P (A = a | C, M ) for a = 0, 1). The twangMediation package, for instance, compares the
weighted means of all covariates used in the models for each of the pairwise differences used to
estimate the five effects, T E, N DE0 , N IE1 , N DE1 , and N IE0 . For each pair of weights, the
Kolmogorov–Smirnov (K-S) statistic is used to compare the weighted distributions for each covariate.
Balance of the covariates is not sufficient for removing possible bias in the estimated effects. The
cross-world weighted distribution of the mediator for the treatment group must match the mediator
distribution of the control group weighted to match the population, i.e., the IPW distribution for
the control group. Otherwise, the estimate of the cross-world mean E(Y(1,M0 ) ) and therefore, the
N IE1 and N DE0 could be biased. Similarly, for estimating E(Y(0,M1 ) ) and N IE0 and N DE1 ,
the cross-world weighted distribution of the mediator for the control group must match the IPW
distribution of the mediator in the treatment group. Hence, the comparability of these weighted
distributions should also be checked. To our knowledge, twangMediation is the only package
that provides these checks. For example, the twangMediation package provides a graphical
comparison of the density of the cross-world weighted mediator distributions for the treatment
(control) group with the density of the IPW distribution of the mediator in the control (treatment)
group. It also provides the standardized mean difference in the two weighted distributions and the
K-S statistic as a measure of the similarity of the distributions. If the densities are visually similar and
the standardized mean difference and the K-S statistics are small, then the distributions are similar,
which supports the unbiasedness of the effect estimates.
Tables 19.13, 19.14, and 19.15 present the output from twangMediation for the checks
described above. Table 19.13 is the unweighted balance table corresponding to the TE estimand.
The tx.mn and ct.mn columns represent the mean covariate values for the treatment and control
groups, respectively. The std.eff.sz column shows the standardized mean differences between
the treatment and control groups; values near 0 indicate very similar covariate values across groups.
As shown in Table 19.13, the treatment and control groups exhibit sizeable differences on several
covariates at baseline, prior to weighting. Table 19.14 is the weighted balance table corresponding to
the TE estimand. We can see that weighting has significantly improved balance, with standardized
mean differences uniformly near 0.
Table 19.15 is the weighted balance table corresponding to the N IE1 estimand. Note that the
tx.mn and ct.mn columns represent the mean covariate values for the treatment group weighted
by w11 weights and the treatment group weighted by w10 weights, respectively. Examining the
R Packages and Code to Implement Causal Mediation Weighting 395
TABLE 19.13
Balance Table for Unweighted Covariate Differences for the T E estimand.
standardized mean differences, we conclude that weighting has achieved very good covariate balance.
Similar tables are also provided for N IE0 , N DE1 , and N DE1 .
rmpw: Summary
library(rmpw)
results <- rmpw(data = NSDUH_female,
treatment = "lgb",
mediator = "cig15",
outcome = "smoke",
R Packages and Code to Implement Causal Mediation Weighting 397
TABLE 19.15
Balance table for N IE1 estimand.
Note that the user can specify which decomposition is estimated. When decomposition=0,
a single estimate for the NIE and NDE are generated (thereby assuming N DE1 =N DE0 and
N IE1 =N IE0 ). But the package also includes decompositions based on VanderWeele’s three-way
decomposition [27]. When decomposition = 1, the total effect will be decomposed into the
pure direct effect (N DE0 ), total and pure indirect effects (N IE1 and N IE0 ), and the corresponding
natural treatment-by-mediator interaction effect (N IE1 - N IE0 ). When decomposition = 2,
the total effect will be decomposed into the pure indirect effect (N IE0 ), the total and pure direct
effects (N DE1 and N DE0 ), and the corresponding natural treatment-by-mediator interaction effect
(N DE1 - N DE0 ).
Missing values are not allowed on the outcome or covariates. Standard errors are obtained using
the two-step estimation method described in Bein et al. [28].
19.7.2 The medflex package
The medflex package [10] implements methods proposed by Lange et al. [29] and Vansteelandt
et al. [8], using both a weighting-based approach (our focus) and an imputation-based approach.
Weighting is implemented using Hong’s RMPW method [17].
398 Weighting Estimators for Causal Mediation
medflex: Summary
Estimation of mediation effects via medflex requires specification of two models: (1) a regres-
sion model for the mediator and (2) a natural effects model for the counterfactual outcome. Using the
neWeight() function, the user specifies the mediator model, which can be fitted using the glm()
function (default) or the vglm() function from the VGAM package. When specifying the mediator
model, the exposure variable should be listed as the first variable on the right-hand side followed by
potential treatment-mediator confounders. Binary exposure variables need to be specified as factor
variables. The neWeight() function returns an expanded version of the original dataset which
includes observations corresponding to counterfactual values of the exposure. This expanded dataset
includes two new counterfactual variables, named in the form of exposure0 and exposure1. Next,
using the neModel() function, the user specifies the natural effects model. This model regresses
the outcome on both counterfactual exposure variables created via neWeight(); potential con-
founders can also be included in the outcome model. The neModel() function estimates models
using glm(). The specified dataset should be the expanded dataset created as the returned object
from the neWeight() function. Use of neEffdecomp() returns estimates for the TE, NDE,
and NIE. Standard errors can be obtained using the summary() function. By default, standard
error estimation is performed based on the i.i.d. bootstrap. Note that medflex additionally allows
the option of robust standard errors based on the sandwich estimator when models are estimated
using glm() (specify se = ‘‘robust"). The sample syntax provided below assumes a binary
exposure. See the medflex documentation for the specific syntax to implement these models using
a multinomial or continuous exposure.
library(medflex)
### Fit mediator model using neWeight() on original data
### neWeight() returned object is expanded dataset
expData <- neWeight(cig15 ˜ factor(lgb) + age + race +
educ + income + employ,
family = binomial,
data = NSDUH_female)
### Then fit natural effect model on expanded dataset
results <- neModel(smoke ˜ lgb0 + lgb1,
expData = expData)
### summary() command will display estimates from natural effect model
with SEs
R Packages and Code to Implement Causal Mediation Weighting 399
summary(results)
### neEffdecomp() command will list TE, NDE, and NIE
neEffdecomp(results)
When estimating marginal risk differences, the estimates obtained from medflex are compara-
ble to those from rmpw.
Conditional odds ratio code
library(medflex)
### Fit mediator model using neWeight() on original data
### neWeight() returned object is expanded dataset
expData <- neWeight(cig15 ˜ factor(lgb) + age +
race + educ + income + employ,
family = binomial,
data = NSDUH_female)
### Then fit natural effect model on expanded dataset
results <- neModel(smoke ˜ lgb0 + lgb1 + age +
race + educ + income + employ,
family = binomial,
expData = expData)
### summary() command will display estimates from natural effect model
with SEs
summary(results)
### neEffdecomp() command will list TE, NDE, and NIE
neEffdecomp(results)
causalweight: Summary
• Exposure: binary
• Mediator: continuous, binary, count, nominal, ordinal
• Outcome: any type
Multiple mediators? Yes
Causal mediation analysis is implemented in the causalweight package using the function
medweight(), as shown below:
library(causalweight)
400 Weighting Estimators for Causal Mediation
twangMediation: Summary
Effects estimated: T E, N DE0 , N IE1 , N DE1 , N IE0
Effect scale: marginal risk difference
Models: Exposure and outcome
Variable types allowed:
• Exposure: binary
• Mediator: continuous, binary, count, nominal, ordinal
• Outcome: any type
Multiple mediators? Yes
R Packages and Code to Implement Causal Mediation Weighting 401
Estimation of mediation effects via twangMediation requires two steps: (1) estimation of
propensity score weights for estimating the TE and (2) estimation of cross-world weights using
Huber’s approach. The first analytic step is to estimate propensity score weights used for weighting
the counterfactual means, E(Y(1,M1 ) ) and E(Y(0,M0 ) ), (i.e., TE weights). While these weights can
be estimated in any manner, we recommend estimating these weights with GBM using the ps()
function in the twang package (as shown below) such that all weights are consistently estimated
using GBM. Next, we use the wgtmed() function to estimate the cross-world weights and obtain
the mediation estimates of interest. By default, wgtmed() estimates the cross-world weights using
GBM (although logistic regression may also be specified).
library(twang)
### Estimate propensity score weights for exposure
TEps <- ps(formula = lgb ˜ age + race + educ + income
+ employ,
data = NSDUH_female,
n.trees = 7500,
estimand= "ATE")
### Mediation estimands obtained using wgtmed() function
library(twangMediation)
results <- wgtmed(formula.med = cig15 ˜ age + race + educ
+ income + employ,
a_treatment = "lgb",
y_outcome = "smoke",
data = NSDUH_female,
method = "ps",
total_effect_ps = TEps,
total_effect_stop_rule = "ks.mean",
ps_version = "gbm",
ps_n.trees = 7500,
ps_stop.method = "ks.mean")
The twangMediation package provides estimates of both direct effects, N DE0 and N DE1 ,
as well as both indirect effects, N IE0 and N IE1 . This package also provides robust balance
diagnostics (as detailed in Section 19.6).
Multiple mediator code
The twangMediation package also estimates the joint effects of multiple mediators. The first
step of estimating the propensity score weights for the exposure (lgb) is the same as in the prior
example with a single mediator. However, in the second step using wgtmed(), both mediators are
included on the left-hand side of the formula notation, separated by “+”.
## In wgtmed function, specify multiple mediators on left-hand side
separated by +
results <- wgtmed(cig15 + alc15 ˜ age + race + educ +
income + employ,
a_treatment = "lgb",
y_outcome = "alc_cig_depend",
data = NSDUH_female,
method = "ps",
total_effect_ps = TEps,
total_effect_stop_rule = "ks.mean",
ps_version = "gbm",
ps_n.trees = 6000,
ps_stop.method = "ks.mean")
402 Weighting Estimators for Causal Mediation
IORW approach
### Specify inverse odds ratio weighting approach with method = "iorw"
install.packages("devtools")
devtools::install_github("BS1125/CMAverse")
library(CMAverse)
results <- cmest(data = NSDUH_female,
model = "iorw",
outcome = "smoke",
exposure = "lgb",
mediator = "cig15",
basec = c("age", "race", "educ",
"income", "employ"),
yreg = "logistic",
ereg = "logistic",
inference ="bootstrap")
summary(results)
The outcome, exposure, and mediator are defined, and the pre-exposure confounders are spec-
ified in the basec= argument. The yreg= argument specifies the link function for the outcome
regression model, and the ereg= argument specifies the link function for the exposure regression
model. When the outcome is binary, the CMAverse package returns the following estimands: total
effect odds ratio (labeled RTE), pure natural direct effect, N DE0 , odds ratio (labeled RPNDE), total
natural indirect effect, N IE1 , odds ratio (labeled RTNIE), and proportion mediated (labeled PM).
We note that this package does not provide the alternative decomposition of N DE1 and N IE0 . By
default, standard errors are calculated by bootstrapping.
R Packages and Code to Implement Causal Mediation Weighting 403
### Specify inverse odds ratio weighting approach with method = "iorw"
results <- cmest(data = NSDUH_female,
model = "iorw",
outcome = "alc_cig_depend",
exposure = "lgb",
mediator = c("cig15", "alc15"),
basec = c("age", "race", "educ",
"income", "employ"),
yreg = "logistic",
ereg = "logistic",
inference = "bootstrap")
summary(mult4)
The syntax is similar to that for the IORW method shown above. However, the EMint= argument
is required to specify whether the outcome model should include an exposure-mediator interaction.
404 Weighting Estimators for Causal Mediation
Additionally, the mval= argument is required, for each mediator variable, to specify a value at
which the variable is controlled. When the outcome is a binary variable, the CMAverse package
returns the following estimands: controlled direct effect odds ratio (labeled RCDE), pure natural
direct effect, N DE0 , odds ratio (RPNDE), total natural direct effect, N DE1 , odds ratio (RTNDE),
pure natural indirect effect, N IE0 , odds ratio (RPNIE), total natural indirect effect, N IE1 , odds
ratio (RTNIE), total effect odds ratio (RTE), and overall proportion mediated (PM).
Multiple mediator code
mediationClarity: summary
We now present the code for estimation of the weights and for the three estimators of marginal
natural (in)direct effects. We then briefly comment on using the weights to estimate conditional
effects. All the code in this section is essentially the same regardless of the number of mediators.
Weights estimation
All three of the estimators highlighted rely on the same weighting scheme, which combines same-
world and cross-world weighting. Same-world weighting weights the treated and control subsamples
to mimic the full sample covariate distribution. Cross-world weighting weights the treated or control
subsample (which one depends on the specific choice of TE composition) to mimic the full sample
covariate distribution and to mimic the mediator given covariate distribution in the other subsample.
w.med <- weight_med(
data = NSDUH_female,
cross.world = "10",
a.c.form = "lgb˜age+race+educ+income+employ",
a.cm.form = "lgb˜age+race+educ+income+employ+cig15+alc15",
plot = TRUE,
c.std = "age"
)
Function weight med() estimates the weights and outputs the weighted data plus a weight
distribution plot and a balance plot, using the method based on the third expression for the cross-
world weights (see section 19.3.3). Argument cross.world accepts three values: "10" for the
(NDE0 , NIE1 ) effect pair, "01" for the (NIE0 , NDE1 ) effect pair, and "both" for both pairs.
Currently the weights are estimated based on logit models for P(A | C) and P(A | C, M ),
whose formulas are specified in arguments a.c.form and a.cm.form, and the cross-world
weight is computed using the formula in Equation 19.10. The models can be made more flexible
to reduce misspecification and improve balance, for example, using spline terms of continuous
variables or adding interaction terms among the variables. An alternative to estimate weights via
direct optimization for balance (not using probability models) will be added later.
The following code extracts the output (weighted data and diagnostic plots).
wdat <- w.med$w.dat
w.med$plots$w.wt.distribution
w.med$plots$balance
The balance plot shows mean-balance for both (1) the covariates across the pseudo (i.e., weighted)
treated, control, and cross-world samples and the full sample; and (2) the mediators between the
pseudo cross-world and pseudo control samples if cross.world = "10" and between the
pseudo cross-world and pseudo treated samples if cross.world = "01". Arguments c.std
and m.std indicate for which covariates and mediators to use standardized mean differences. This
function (as well as the estimator functions below) can handle sampling weights. See documentation
for details.
This function inherits weighting-related parameters from the weight med() function. In addi-
tion, argument effect.scale accepts "MD" for mean/risk difference, "MR" for mean/risk/rate
ratio, and "OR" for odds ratio. Argument y.var specifies the name of the outcome variable. Two
bootstrap methods are available, the classic resampling bootstrap [31] and a recent method using
continuous bootstrap weights [32]. These are specified by boot.method = "resample" and
boot.method = "cont-wt", respectively.
The following code extracts the output.
est.wtd$estimates
est.wtd$plots$w.wt.distribution
est.wtd$plots$key.balance
formula needs to be specified for this model. The output structure is exactly the same as that of
estimate wtd().
y.c0.form = "smoke˜age+race+educ+income+employ",
y.c1.form = "smoke˜age+race+educ+income+employ",
y.cm1.form = "smoke˜age+race+educ+income+employ+cig15+alc15",
y10.c.form = "smoke˜age+race+educ+income+employ",
y.link = "logit",
boot.num = 999,
boot.method = "cont-wt",
boot.seed = 77777
)
y.c0.form = "smoke˜age+race+educ+income+employ",
y.c1.form = "smoke˜age+race+educ+income+employ",
y.cm1.form = "smoke˜age+race+educ+income+employ+cig15+alc15",
nde0.c.form = "effect˜age+race+educ+income+employ",
y.link = "logit",
boot.num = 999,
boot.method = "cont-wt",
boot.seed = 77777
408 Weighting Estimators for Causal Mediation
The function estimate NDEpredR() has very similar arguments to those of estimate Y2predR().
The difference is that instead of using the model regressing the predicted cross-world potential out-
come on covariates (y.10.c.form), it uses a model regressing a proxy of the natural direct effect
on covariates (nde0.c.form). Also, this function does not have an effect.scale argument,
because it only applies to additive effects. The output structure is the same between the two functions.
est.Y2predR$estimates
est.Y2predR$plots$w.wt.distribution
est.Y2predR$plots$balance
The odds ratios associated with A0 and A1 estimate NDE0 and NIE1 , respectively, if
weight med() was run with cross.world = "10". They estimate NIE0 and NDE1 , respec-
tively, if weight med() was run with cross.world = "01".
Based on these fitted models, we can compute the weights and form the corresponding pseudo
samples. Here we compute w00 (for control units), and w11 and w10 (for treated units). If condition
(0, G1|C ) is of interest, the code is easily modified to compute w01 (for control units).
### Compute the C-weight term
tmp <- NSDUH_female
ps <- predict(a.on.c, type = "response")
tmp$c.wt <- tmp$lgb / ps + (1-tmp$lgb) / (1-ps)
rm(ps, a.on.c)
### Prepare the 3 pseudo samples before computing M-weight
dat00 <- tmp[tmp$lgb==0,]
dat10 <- tmp[tmp$lgb==1,]
dat11 <- tmp[tmp$lgb==1,]
rm(tmp)
### Compute M-weight on pseudo samples 00 and 11
dat00$m.wt <-
predict(m.on.ac, newdata = dat00, type = "response") /
predict(m.on.acl, newdata = dat00, type = "response")
dat11$m.wt <-
predict(m.on.ac, newdata = dat11, type = "response") /
predict(m.on.acl, newdata = dat11, type = "response")
### Compute M-weight on pseudo sample 10
tmp <- dat10
tmp$lgb <- 0
dat10$m.wt <-
predict(m.on.ac, newdata = tmp, type = "response") /
predict(m.on.acl, newdata = dat10, type = "response")
rm(tmp, m.on.ac, m.on.acl)
### Multiply C-weight and M-weight
dat00$wt <- dat00$c.wt * dat00$m.wt
dat10$wt <- dat10$c.wt * dat10$m.wt
dat11$wt <- dat11$c.wt * dat11$m.wt
### Collect all weighted data
wdat <- rbind(cbind(dat11, samp = "p11"),
cbind(dat00, samp = "p00"),
cbind(dat10, samp = "p10"))
rm(dat11, dat00, dat10)
The result of this code is a dataset, wdat, that includes three pseudo samples corresponding to the
three target conditions (0, G0|C ) (indicated by samp = "p00"), (1, G1|C ) (samp = "p11"),
and (1, G0|C ) (samp = "p10").
To use these data to estimate the IDE0 and IIE1 , we use the same trick of dummy coding
pseudo samples before fitting outcome regression used at the end of section 19.7.6 above. Note that
here the coefficients of the dummy variables A0 and A1 estimate IDE0 and IIE1 , respectively.
When targeting the other pair of interventional (in)direct effects, the coefficients of A0 and A1
estimate IIE0 and IDE1 . For completeness, the code follows.
### Make 2 dummies A0, A1
wdat$A0 <- ifelse(wdat$samp=="p00", -1, 0)
wdat$A1 <- ifelse(wdat$samp=="p11", 1, 0)
### Marginal additive effects: use either model below
410 Weighting Estimators for Causal Mediation
19.8 Conclusions
The focus of this chapter has been on weighting based estimators but it should be noted that there are
other estimators (see Nguyen et al. (2022) [16] for an overview). In particular, a method proposed by
Imai et al. (2010) [33] which is implemented in the R package mediation, an imputation-based
approach introduced by Vansteelandt et al. (2012) [8] and implemented in the medflex package,
and regression-based approaches which are covered in detail in VanderWeele (2015) [5] and most of
which are implemented in the CMAverse package.
As weighting is a key component in many estimators, it is crucial to understand the target
of the weighting components (the three conditions being contrasted, namely treated, control, and
cross-world), the different ways to implement them, and how to check for the desired balance that
should be achieved by the weighting. This book chapter and the code we cover here facilitate these
important tasks.
References
[1] M. S. Schuler and R. L. Collins. Sexual minority substance use disparities: Bisexual women
at elevated risk relative to other sexual minority groups. Drug Alcohol Depend, 206:107755,
2020.
[2] T. J. VanderWeele and S. Vansteelandt. Conceptual issues concerning mediation, interventions,
and composition. Statistics and Its Interface, 2:457–468, 2009.
[3] J. M. Robins and S. Greenland. Identifiability and exchangeability for direct and indirect effects.
Epidemiology, 3(2):143–55, 1992.
[4] Trang Quynh Nguyen, Ian Schmid, Elizabeth L. Ogburn, and Elizabeth A. Stuart. Clarifying
causal mediation analysis for the applied researcher: Effect identification via three assumptions
and five potential outcomes. 10, 246–279, 2022.
[5] T. J. VanderWeele. Explanation in Causal Inference: Methods for Mediation and Interaction.
Oxford University Press, New York, 2015.
[6] J. Pearl. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncer-
tainty in Artificial Intelligence. Morgan Kaufman, 2001.
Conclusions 411
[7] G. Hong. Ratio of mediator probability weighting for estimating natural direct and indirect
effects. In Joint Statistical Meetings. American Statistical Association, 2010.
[8] S. Vansteelandt, Maarten Bekaert, and Theis Lange. Imputation strategies for the estimation of
natural direct and indirect effects. Epidemiologic Methods, 1(1):131–158, 2012.
[9] Xu Qin, Guanglei Hong, and Fan Yang. rmpw: Causal Mediation Analysis Using Weighting
Approach, 2018. R package version 0.0.4.
[10] Johan Steen, Tom Loeys, Beatrijs Moerkerke, and Stijn Vansteelandt. medflex: An R package
for flexible mediation analysis using natural effect models. Journal of Statistical Software,
76(11):1–46, 2017.
[11] Martin Huber. Identifying causal mechanisms (primarily) based on inverse probability weight-
ing. Journal of Applied Econometrics, 29(6):920–943, 2014.
[12] Dan McCaffrey, Katherine Castellano, Donna Coffman, Brian Vegetabile, and Megan Schuler.
twangMediation: twang Causal Medition Modeling via Weighting, 2021. R package version
1.0.
[13] B. K. Lee, J. Lessler, and E. A. Stuart. Improving propensity score weighting using machine
learning. Statistics in Medicine, 29(3):337–346, 2010.
[14] Hugo Bodory and Martin Huber. causalweight: Estimation Methods for Causal Inference Based
on Inverse Probability Weighting, 2021. R package version 1.0.2.
[15] Trang Quynh Nguyen. mediationClarity: Estimation of Marginal Natural (in)direct Effects,
2022. R package version 1.0.
[16] Trang Quynh Nguyen, Elizabeth L. Ogburn, Ian Schmid, Elizabeth B. Sarker, Noah Greifer,
Ina M. Koning, and Elizabeth. A. Stuart. Causal mediation analysis: From simple to more
robust strategies for estimation of marginal natural (in)direct effects. arXiv:2102.06048, 2022.
[17] G. Hong, Jonah Deutsch, and Heather D. Hill. Ratio-of-mediator-probability weighting for
causal mediation analysis in the presence of treatment-by-mediator interaction. Journal of
Educational and Behavioral Statistics, 40(3):307–340, 2015.
[18] B. Shi, C. Choirat, B. A. Coull, T. J. VanderWeele, and L. Valeri. Cmaverse: A suite of functions
for reproducible causal mediation analyses. Epidemiology, 32(5):e20–e22, 2021.
[19] E. J. Tchetgen Tchetgen. Inverse odds ratio-weighted estimation for causal mediation analysis.
Statistics in Medicine, 32(26):4567–80, 2013.
[20] T. J. VanderWeele and S. Vansteelandt. Mediation analysis with multiple mediators. Epidemio-
logic Methods, 2(1):95–115, 2014.
[21] R. M. Daniel, B. L. De Stavola, S. N. Cousens, and S. Vansteelandt. Causal mediation analysis
with multiple mediators. Biometrics, 71:1–15, 2015.
[22] T. Lange, M. Rasmussen, and L. C. Thygesen. Assessing natural direct and indirect effects
through multiple pathways. American Journal of Epidemiology, 179(4):513–8, 2014.
[23] J. M. Albert. Distribution-free mediation analysis for nonlinear models with confounding.
Epidemiology, 23(6):879–88, 2012.
[24] Q. C. Nguyen, T. L. Osypuk, N. M. Schmidt, M. M. Glymour, and E. J. Tchetgen Tchetgen.
Practical guidance for conducting mediation analysis with multiple mediators using inverse
odds ratio weighting. American Journal of Epidemiology, 181(5):349–56, 2015.
412 Weighting Estimators for Causal Mediation
[29] T. Lange, S. Vansteelandt, and M. Bekaert. A simple unified approach for estimating natural
direct and indirect effects. American Journal of Epidemiology, 176(3):190–195, 2012.
[30] Matthew Cefalu, Greg Ridgeway, Dan McCaffrey, Andrew Morral, Beth Ann Griffin, and Lane
Burgette. twang: Toolkit for Weighting and Analysis of Nonequivalent Groups, 2021. R package
version 2.3.
[31] B. Efron. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 7(1),
January 1979. Publisher: Institute of Mathematical Statistics.
[32] Li Xu, Chris Gotwalt, Yili Hong, Caleb B. King, and William Q. Meeker. Applications of the
fractional-random-weight bootstrap. The American Statistician, 74(4):345–358, October 2020.
[33] K. Imai, L. Keele, and D. Tingley. A general approach to causal mediation analysis. Psycholog-
ical Methods, 15(4):309–334, 2010.
Part IV
CONTENTS
20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
20.2 Causal Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
20.2.1 Fair comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
20.2.2 Potential outcomes and causal quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
20.2.3 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
20.2.3.1 All confounders measured . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
20.2.3.2 Overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
20.2.3.3 SUTVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
20.3 Regression for Causal Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
20.3.1 Regression trees vs. linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
20.3.2 Boosted regression trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
20.4 Bayesian Additive Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
20.4.1 BART prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
20.4.1.1 Prior on the trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
20.4.1.2 Prior on the means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
20.4.1.3 Prior on the error term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
20.4.2 Gibbs sampler for BART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
20.5 BART for Causal Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
20.5.1 Basic implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
20.5.2 Software: bartCause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
20.6 BART Extensions and Other Considerations for Causal Inference . . . . . . . . . . . . . . . . . . 430
20.6.1 Overlap, revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
20.6.2 Treatment effect heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
20.6.3 Treatment effect moderation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
20.6.4 Generalizability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
20.6.5 Grouped data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
20.6.6 Sensitivity to unmeasured confounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
20.7 Evidence of Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
20.8 Strengths and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
20.8.1 Strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
20.8.2 Limitations and potential future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
20.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
20.1 Introduction
Estimation of causal effects requires making comparisons across groups of observations exposed
and not exposed to a a treatment or cause (intervention, program, drug, etc). To interpret differences
between groups causally we need to ensure that they have been constructed in such a way that the
comparisons are “fair.” This can be accomplished though design, for instance, by allocating treatments
to individuals randomly. However, more often researchers have access to observational data and
are thus in the position of trying to create fair comparisons through post-hoc data restructuring or
modeling. Many chapters in this book focus on the former approach (data restructuring). This chapter
will focus on the latter (modeling) to illuminate what can be gained from such an approach. We
illustrate the case for modeling the relationship between outcomes, covariates, and a treatment to
estimate causal effects using a Bayesian machine learning algorithm known as Bayesian Additive
Regression Trees (BART) [1–3].
In practice, however, we can never measure both potential outcomes at the same point in time
for the same person because we cannot observe that person both in the world where they received a
treatment and the world where they did not. Therefore, if we denote the binary treatment received by
individual i as Zi , we can express the observed outcome, Yi , as a function of the potential outcomes,
Yi = Zi Y1i + (1 − Zi )Y0i , such that Y1i is revealed for those who receive the treatment and Y0i is
revealed for those who do not. We can perhaps imagine a situation where we could clone individual
i to create individual j just at the moment of exposure to the treatment and have version i take
the treatment and version j refrain. This would create a fair comparison when we compared their
outcomes down the line because we would be assuming that both the individual and their clone would
have had the same Y0 and Y1 . Specifically we could assume Y0i = Y0j and Y1i = Y1j . Therefore to
estimate the causal effect for individual i, τi = Y1i − Y0i , and even though their Y0i would be missing,
we could just substitute Y0j which is observed (since individual j did not receive the treatment).
Most causal inference procedures try to mimic this situation but generally aim to estimate average
treatment effects such as the mean of individual causal effects over a sample or population. We
can express such an average effect generically as E[Y1i − Y0i ]. Some researchers may be interested
instead in the average treatment effect for the type of individual who we observed to receive the
treatment (or participate in the program). This quantity is referred to as the average effect of the
treatment on the treated (ATT) and is formalized as E[Y1i − Y0i | Zi = 1]. This quantity may be of
particular interest if we don’t expect that our full control group would be eligible for or interested in
the treatment. A reciprocal version of the ATT is the effect of the treatment on the controls, ATC,
E[Y1i − Y0i | Zi = 0], which captures the effect of the treatment for those we don’t observe to take
the treatment (or participate in the program). This may be of particular interest in situations where
policy makers or practitioners would like to expand eligibility or incentivize different types of people
to participate in a program or receive a treatment.
20.2.3 Assumptions
If our goal is to estimate the average causal effect for a group of individuals, we need to create fair
comparisons for the group. What would be the most pristine way to accomplish this? Suspending
disbelief for a moment, let’s suppose we could clone everyone who was willing to participate in
the study. Each of the original participants could be exposed to the treatment but none of the clones
would be exposed. What are the implications for the assumptions needed for causal inference?
Y0 , Y1 ⊥ Z.
While it is not possible to actually implement our hypothetical clone design, a completely randomized
experiment does a good job of at least replicating this independence property because the randomiza-
tion eliminates any systematic differences between the treatment and control groups. Unlike with
our clones, the randomized experiment is still most useful for estimating average treatment effects.
However, this does not guarantee that it will yield accurate individual-level treatment effect estimates.
418 Machine Learning for Causal Inference
Of course with small sample sizes a randomized experiment may still result in groups that
differ simply by chance. While results will still be technically unbiased, any given treatment effect
estimate may still be far from the truth1 . Randomized block experiments, in which the randomization
occurs within strata or blocks defined by covariates, can help to address this by ensuring perfect
balance with respect to the blocking variables. Generally speaking, in randomized blocks experiments,
independence is only achieved within blocks therefore these experiments satisfy a more specific
assumption,
Y0 , Y1 ⊥ Z | W,
where W denotes the blocks. For example, in a diet and exercise intervention we might seek to
randomize individuals after grouping them based on their starting cholesterol levels.
Of course many questions can’t be be addressed by randomized experiments due to any combi-
nation of logistical, financial, and ethical reasons. In that case researchers hope that the covariates
they’ve measured essentially act like blocks in a randomized block experiment. That is, they hope
that observations that have similar values on all their covariates have potential outcomes that are
similar as well, regardless of their treatment assignment.
This assumption can be expressed formally as
Y0 , Y1 ⊥ Z | X,
where X now denotes pre-treatment covariates. The intuition here is that for two groups (treatment
and control) with the same values on all the pre-treatment variables, X, we assume they have, in
effect, been randomized to the groups. This is basically the same assumption as is invoked by the
randomized block experiment. The difference is that in a randomized block experiment the W are
known and the assumption should hold in a pristine implementation of the design. In on observational
study, on the other hand, researchers must make a leap of faith that their pre-treatment covariates, X,
are sufficient to achieve this conditional independence. If X represents all confounders (informally,
variables that predict both treatment and outcome), this assumption should be satisfied.
This assumption is referred to by many different names depending on the discipline and subfield.
These include “ignorability,” “selection on observables,” “all confounders measured,” “exchangeabil-
ity,” the “conditional independence assumption,” and “no hidden bias” [6–8, 10, 23]. In this chapter
we will refer to this as the “all-confounders-measured” assumption.
Due to the critical role of this assumption in estimating unbiased treatment effects, the first step
in many causal inference approaches is to try to ensure sure that is satisfied by including all potential
confounders in X.2 The second step, which is the focus of many of chapters in this book, is to figure
out how to condition on these variables without making excessive additional assumptions. We’ll
illustrate some of the complications involved in this step in the next section with a hypothetical
example, but first we discuss a related assumption.
20.2.3.2 Overlap
Another property of our hypothetical example with clones is that for each individual in our dataset
there would be someone else with the exact same values of all pretreatment covariates (including
confounders) but who received a different treatment. We might relax this to say that this neighbor
1 For more discussion see [4].
2 The idea of including all possible confounders is in tension with the desire to avoid “overfitting”, discussed in Sec-
tion 20.3.1: it cannot be fully known whether a variable is truly a confounder or a randomly correlated covariate, so that
including additional but unrelated pre-treatment covariates may reduce generalizability. Another concern in including many
potential confounders in a model is “bias amplification,” which can occur when some true confounders are missing [10–12]
and covariates are included that are strongly predictive of the treatment but not the outcome. However, this may be rare in
real-world data, such that it is generally preferable to condition on most pre-treatment covariates [13]. Our advice would be to
always include every pre-treatment covariate that is believed to be predictive of the outcome, particularly if it is also related to
the treatment. If overfitting remains a concern use regularized/Bayesian models or perform a variable selection step.
Causal Foundations 419
would have to be in a sufficienly close neighborhood of the covariate space. Thus, by design, each
individual would have an “empirical counterfactual” in the dataset. A more general formalization of
this property is that, for every X and for z ∈ {0, 1}, 0 < Pr(Z = z | X) < 1. Conceptually we can
think of this expression as requiring that in every neighborhood of the covariate (X) space spanned
by our sample there has to be a positive (non-zero) probability of having both treated and control
units.
In theory a completely randomized experiment should create overlap by design since the multi-
variate distributions of all pre-treatment variables (measured of not) should be the same across groups.
Of course, in practice, if one were checking, it would be difficult to achieve perfect overlap even for
a moderate number of variables simply due to sparsity [14]. However it turns out the overlap assump-
tion is technically stronger than what we need to perform inference. A more precise requirement is
that we have overlap with respect to our true confounders, X C , as in 0 < Pr(Z = z | X C ) < 1.
We’ll return to this idea later in the chapter because BART affords some advantages with regard to
this goal [15]. Since a completely randomized experiment has no confounders the requirement is
satisfied trivially. A randomized block experiment would need to satisfy the overlap assumption with
respect to its blocks.3 Again this should be satisfied by design since randomized block experiments
require a non-zero probability of assignment for each treatment variable in each block.
In an observational study, however, there is nothing to guarantee that overlap exists across
treatment groups with respect to confounders. There may be certain type of people who will never
participate in a program or will always be exposed to a treatment. If the overlap assumption does
not hold, there may be some observations on our data set for which we simply don’t have enough
information about their counterfactual state to try to make inferences about them.
To push this to the extreme imagine what would happen in a study if all confounders are measured
but overlap is violated. For example, suppose that we have a study where individuals are assigned to
receive a treatment based on their age. Specifically, suppose that all individuals over age 50 receive
the treatment and all individuals 50 and younger do not. Further suppose that within each group,
treatment assignment has no impact on the potential outcomes; if the age restriction on treatment did
not exist, the experimental design would be valid. Thus, even though all confounders are measured
there is no overlap in the age distribution across treatment and control groups. In this situation
arguably none of the observations would have empirical counterfactuals. If you wanted to understand
the effect of the treatment for the individuals less than 50 you would be hampered by the absence
of treated units in this age range to provide data on the missing y1 ’s. If you wanted to understand
the effect of the treatment for individuals older than 50 you would be hampered by the absence of
control units in this age range to provide data on the missing y0 ’s. Without further assumptions your
only hope would be to focus on those observations closest to the threshold.4
20.2.3.3 SUTVA
While we won’t devote much time to it in this chapter, one of the most important assumptions in
causal inference is the Stable Unit Treatment Value Assumption (SUTVA) [18]. The basic idea
is that we need to assume that each person’s potential outcome is a function solely of their own
treatment assignment, not the treatment assignment of anyone else. This portion of SUTVA is
sometimes referred to as the “no-interference” assumption. This assumption also encapsulates the
idea of “consistency.” This can be formalized as the idea that Yaj = Yj when Aj = a [19]. This
reflects the idea that if the observed value of Y for an individual j who received treatment equal what
it would have been if the treatment was “set to” a. In other words, we assume that the manner in
which treatment was set to a is irrelevant. This is sometimes referred to as the “no-multiple-versions-
of-treatment” assumption [20].
3 Not all blocks act as confounders, however, so this is actually also too strong a statement.
4 For clever ideas of how you can estimate a causal effect in this situation, you can read about regression discontinuity
designs [4, 16, 17].
420 Machine Learning for Causal Inference
where Y denotes running time, Z denotes use of the hyperShoe, and X denotes age. But is the
coefficient on Z an estimate of the causal effect of the hyperShoe on running time?
Recall that in this example we are assuming that age is our only confounder. Thus the assumption
of “all confounders measured” is satisfied. If, additionally, our parametric model is correct, we can
say that
E[Y0 | X] = E[Y0 | X, Z = 0] = E[Y | X, Z = 0] = β0 + β1 X
and
E[Y1 | X] = E[Y1 | X, Z = 1] = E[Y | X, Z = 1] = β0 + β1 X + τ.
Regression for Causal Inference 421
220
195 200 205 210 215
running time in minutes
190
20 30 40 50
age
FIGURE 20.1
Hypothetical data on age and running times for a marathon. The solid line displays the true rela-
tionship between age and running time which, implausibly, is linear in this example. Lighter dots
represent “treated” units – those with the hyperShoe – and darker squares represent control observa-
tions. The lighter line represents the relationship between Y1 and age. The darker line represents the
relationship between Y0 and age.
That is, in this situation our regression model is also a causal model. Thus τ̂ = −10.21 should be an
unbiased estimate of the true treatment effect. In this case since we simulated the data we know that
the true treatment effect for this example is −10. Interpreted causally, the hyperShoe decreases race
time by 10 minutes.
In sum, if age is the only confounder and if the linear model is appropriate, then a linear regression
model can recover the causal effect of the hyperShoe on running times. Unfortunately, neither of
these assumptions is generally appropriate. We focus for the time being on the modeling component,
and return to the foundational assumptions for causal inference in a later section.
195
running time in minutes
175 180170 185 190
20 30 40 50
age
FIGURE 20.2
Hypothetical data on age and marathon running times in minutes. The solid curves (lighter for Y1
and darker for Y0 ) display the true relationship between the confounder and the potential outcomes.
The dashed lines (with the same color mapping) represent a linear regression fit. The long-dash
lines (with the same color mapping) represent the fit from a regression tree. Lighter dots are treated
observations. Darker squares are control observations.
covariate space where counterfactual units are rare. While the treatment group is mostly strongly
represented in the age range from 18 to 50, the control group more closely spans the full age range
from about 20 to 55. Thus there are more empirical counterfactuals for the full treatment group than
vice-versa.
Is there a simple way of providing a better fit to this response surface? Recall that regression is just
a way to summarize information about how average outcomes of the response variable vary across
subgroups defined by the covariates in our sample. In our current example we want to understand
how marathon running times vary with subgroups defined by age. Linear regression does this in
a way that places strong constraints on how these means are related to each other. Can we fit this
relationship using a regression model that makes fewer assumptions?
A regression tree fit to these data would deconstruct the problem a bit differently than linear
regression. Regression trees form subgroups within the dataset such that the within-subgroup variance
in the outcome variable across subgroups is minimized (see chapter 9 of The Elements of Statistical
Learning (ESL) by Hastie, Tibshirani, and Friedman (2009) [23]). The first step is to split the dataset
into two subgroups. In a regression tree fit to our example data, the first split divides those individuals
who are younger than 46 from those individuals older than 46. If we allow for further splits, the
tree will continue to subdivide individuals by age and by who wore and did not wear the hyperShoe.
This splitting process is repeated until a stopping condition is met, in our case until eight subgroups,
or terminal nodes, are found. The tree with this stopping rule is displayed in the right panel of
Figure 20.3 which shows the decision rules for each split and the mean in each terminal node.
For instance the right most terminal node shows a mean of 195 which is the average outcome for
individuals in our sample who did not receive the treatment (wear the shoe) and who are age 51 or
above. The fit from this regression tree is summarized by the mean outcomes for the terminal node
Regression for Causal Inference 423
195
yes
176 186
running time in minutes
190
z=1 z=1
185
177
age >= 24
175 170
FIGURE 20.3
The left side of this figure displays the regression tree fit to the response surface which is represented
by eight subgroup means. This means are displayed as horizontal line segments spanning the ages
of the individuals in their subgroups and with darker colors for control observations and lighter for
treated observations. The right side displays the branching and terminal nodes from the corresponding
regression tree. In this tree ”z” represents the binary treatment variable.
subgroups. These means are visualized by the horizontal lines (step function) in Figure 20.3. Notice
how this model is better able to follow the curve of the response surface displayed on the left.
The flexibility of the regression tree fit is appealing. What are the downsides? We didn’t mention
above specifically how we decided to stop “growing the tree” (i.e. creating more subgroups). This is
a tricky decision. If we stop too early we risk creating too crude a fit to the data. This is apparent in
the fit in Figure 20.3, where it is often further from the observed data than we would like. On the
other extreme we could allow the tree to grow so large that the fit yields a different “mean” for every
observation. This model fit would be terrific at predicting the outcome within the current sample but
is not likely to predict well at all in a new sample. This phenomenon is referred to as “overfitting.”
For a discussion of overfitting, see Chapter 7 of ESL.
In practice this tension is resolved by adjusting tuning parameters (also called hyperparameters in
some fields) for the algorithm that govern features such as the number of observations required for a
terminal node to be allowed to split, the minimum deviation within a terminal node to prevent further
split, and the maximum depth for a tree. Typically these parameters are chosen via cross-validation
(see chapters 7 and 9 of ESL). However cross-validation can only be directly used to understand
to guard against overfit with regard to our observed data. It cannot be used to understand potential
overfit with regard to unobserved counterfactuals.
There are several other downsides to regression trees. First, when we have multiple covariates,
they aren’t able to effectively capture additive effects well and tend to overemphasize multi-way
interactions. Second, they don’t directly estimate our uncertainty about our predictions or fit. This
latter issue can be addressed using bootstrapping but that comes at a high computational cost. Finally,
they tend to have a high variance – slightly different datasets can yield dramatically different trees.
195
running time in minutes
175 180170 185 190
20 30 40 50
age
FIGURE 20.4
This figure displays the same response surface as in the previous figure (light for treated and Y(1)
and dark for controls and Y(0)) but now with a boosted regression tree fit displayed as dashed lines.
is very clever! Instead of representing the fit to the data through a single regression tree, boosted
regression trees represent the fit to the data through the sum of fits from multiple small regression
trees. How does this work?
Consider our example above. A single tree fit would form subgroups based on treatment assign-
ment and age as displayed in Figure 20.2. The residuals from that fit represent the variation in the
outcome that has yet to be explained (which is a lot at this point!). What if we fit another tree to those
residuals to try to explain more? Now new residuals can be formed by subtracting the fit from the
second tree from the residuals from the previous step. This process can be repeated many times and
at each step more of the unexplained variation in the outcome can be explained. Using this strategy,
model complexity is controlled by using only small trees at each step, also known as “weak learners,”
tree predictions are combined using a weighted average or sum, reducing variance and forming an
“ensemble,” and by limiting the overall number of trees overfitting can be avoided.
Figure 20.4 displays a fit from a boosted regression tree with 100 trees. There is marked improve-
ment to the fit of the response surface relative to the smaller tree in 20.3. However, note that the tree
fit starts to depart from the true response surface when the treatment or control observations become
relatively scarce. Traditional, machine-learning-style fits typically provide no way of being alerted to
the increased uncertainty in this part of the covariate space.
While boosted regression trees were an important step forward as compared to standard regression
trees, they still require choice of tuning parameters (additionally now including the number of trees
included in the sum of trees) and fail to resolve the issues regarding uncertainty quantification.
The mean structure of BART is the same as the boosted regression tree. However, this structure
is then embedded in a likelihood framework so that the fit of the model is considered to be a random
variable and hence has a distribution. Given the importance of the treatment in this context we will
write the BART for causal inference model explicitly incorporating the treatment variable in the
notation. In addition, an error term is added to reflect deviations between the fit from the model and
our observed values of the outcome (which reflects our uncertainty). We can formalize the model as
iid
Yi = f (xi , z) + i , i ∼ N (0, σ 2 ), (20.1)
f (x, z) = g(x, z, T1 , M1 ) + g(x, z, T2 , M2 ) + · · · + g(x, z, Tm , Mm ). (20.2)
In this equation each of the g functions represents the fit from an individual tree, Th represents
the structure of the hth trees, and the corresponding Mh = (µh1 , µh2 , . . . µhbh )0 represents the set of
subgroup means corresponding to the bh terminal nodes of tree h.
FIGURE 20.5
This figure displays how BART is used for causal inference. BART is fit to the original dataset (on
the left). Predictions are made for two altered versions of that dataset (in the middle): one where all
observations are assignment the treatment (top) and one where all observations are assigned to the
control (bottom). These represent predictions of Y0 and Y1 for each person (last column of middle
datasets). Posterior predictive intervals for each potential outcome for the 2nd individual in the dataset
are displayed on the far right. The top plot is a histogram showing the empirical posterior distribution
of Y (1) for that individual. The darker histogram below it shows the empirical posterior distribution
of Y (0) for that individual. The difference between these distributions is the posterior distribution of
the treatment effect for the 2nd individual (bottom-most histogram). These individual-level treatment
effect distributions can be combined to estimate an average treatment effect for any of a variety of
average treatment effects.
make make predictions for that dataset and a version where the treatment assignment for all observations is flipped.
7 Since these distributions condition on the covariates in our analysis sample, the average causal effect technically represents
a conditional average causal effect rather than the sample average causal effect. Sample average effects can be obtained by
428 Machine Learning for Causal Inference
BART
195
running time in minutes
175 180170 185 190
20 30 40 50
age
FIGURE 20.6
This figure displays the same response surface as in the previous figure (darker curve for treated and
lighter curve for control) but now with a BART fit overlaid as dashed lines. 95% posterior uncertainty
intervals (vertical lines) for all individuals (again with lighter lines for the treated and darker lines for
the controls) are calculated by using normal approximations to their empirical marginal posteriors.
We present results of the BART fit to the data in our hypothetical example in Figure 20.6. This plot
reproduces the true response surface and observations from Figure 20.4 above but instead overlays a
BART fit. There are several ways to display this fit but we choose here to display two uncertainty
intervals for each observation, i on the plot: (a) a 95% uncertainty interval for E[Y0i | Xi ] and (b) a
95% uncertainty interval for E[Y1i | Xi ].
using the observed factual outcome, and draws from the posterior predictive counterfactual distribution [80]. The uncertainty
of population average effects can be obtained by using the posterior predictive distributions for both potential outcomes.
BART for Causal Inference 429
Additionally, bartCause accepts the standard model function fitting argument of a data object
where it will look to resolve symbols, which can be a data frame or list with named elements. In
other words, it takes inputs that are similar to the lm() and glm() functions in R. Since BART
will look for a flexible, non-parametric relationship between the confounders and the treatment or
response, the confounders themselves need merely be named. Deriving from the R model fitting
syntax, they can be separated with a “+” sign, however, this is interpreted figuratively as “include
this variable” and does not indicate a linear relationship. The estimand argument is also important
to specify up front because the methodology differs for some approaches when targeting the "ate",
"att", or "atc."
In the context of our illustrative example we could use the following command:
bartc_fit <- bartc(running_time, hyperShoe, age,
data = dat, estimand = "att",
seed = 0)
Once a model has been fit, callisng summary on it yields inferential results, including by default
an estimate of the relevant population average treatment effect (in this case the population average
effect of the treatment on the treated, PATT):
> summary(bartc_fit)
Call: bartCause::bartc(response = running_time, treatment = hyperShoe,
confounders = age, data = dat, estimand = "att",
seed = 0)
In addition, bartCause includes a number of convenience plotting functions, all of which take a
fitted model as their first argument.
• plot_sigma: for continuous responses only, produces by-chain trace plots of the residual
standard deviation, where the x axis is the sample number and the y axis is the value; for use in
assessing model convergence
• plot_est: by-chain trace plots of the estimand
• plot_indiv: histograms of individual-level quantities, including treatment effects and poste-
rior means
• plot_support: scatter plots of individual-level quantities, with observations highlighted by
the evidence of their common support, as discussed below
Finally, advanced users can access the posterior samples directly, using the extract, fitted,
and predict generic functions. These can be useful for obtaining subgroup estimates, weighted
averages, or for conducting additional diagnostics. bartCause can also be accessed now in a more
user-friendly software package, thinkCausal, that additionally incorporates educational components
to help the user understand the foundational concepts involved (https://apsta.shinyapps.
io/thinkCausal/).
430 Machine Learning for Causal Inference
Treatment
−1
Control
185
Removed
−2
180
−4
175
−5 −6
170
20 30 40 50 20 30 40 50
age age
FIGURE 20.7
The left panel displays the average treatment effect as it varies with age as a solid black curve. The
vertical lines represent 95% uncertainty intervals for the individual level treatment effect for each
individual in the sample. The lighter lines correspond to treatment observations. The darker lines
correspond to control observations. The right panel displays the response surface and observations
again but circles the observations that BART would flag as being at high-risk of lacking common
causal support. These assessments are related to the level of uncertainty in treatment effect estimation
displayed in the left panel.
The left hand panel of Figure 20.7 displays a curve demonstrating how the true treatment effect
varies with age. It also displays 95% uncertainty intervals for the individual-level treatment effects
for each of the observations in the sample.8 The right-hand panel displays the true response surface
and observations.
A helpful feature of the BART predictions is that the uncertainty around them expands precip-
itously once we leave the range of common support. Standard BART measures of overlap [15],
implemented in bartCause using the commonSup.rule and commonSup.cut arguments, would
suggest that the circled units in the right-hand plot are sufficiently outside the range of sufficient
overlap that we would not trust inferences about them. Therefore, we would discard those observa-
tions. The average treatment effect for the remaining observations is −3.92 and the BART estimate
for those observations is −3.78.
The plot on the left also reminds us to be cautious in interpreting individual level treatment
effects. While all of the intervals for the observations that would remain in our analysis still cover the
true treatment effect (and some of the intervals for the discarded units cover as well!), some of them
only just barely cover. Why is that? Because it’s an exceptionally difficult inferential task to estimate
a treatment effect specific to one covariate value, even when overlap exists in that neighborhood
of the covariate space. When we estimate average effects we get to capitalize on the natural bias
cancellation that occurs when adding up a bunch of slightly imprecise estimates, which leads to more
accuracy for the average effects.
On the other hand this type of individual level prediction is much harder to do with most
matching and weighting strategies that don’t get to borrow strength across observations in the way
8 Technicallythese are intervals of individual level conditional average treatment effects. For instance the interval displayed
for a person who was 38 years old is just the interval for the average effect for anyone aged 38 in the sample. Formally, we
can express each treatment effect as E[Y1i − Y0i | Xi ], as distinct from, for example, Y1i − E[Y0i | Xi ] (for a treatment
observation) or E[Y1i | Xi ] − Y0i (for a control observation).
432 Machine Learning for Causal Inference
that regression modeling approaches can. Even if we don’t want to trust any given individual level
prediction or interval we can still capitalize on the fact that they are estimated reasonably well to
facilitate better understanding of trends in treatment effect heterogeneity, as is discussed in the next
sections.
0.15
High Mileage
0
Moderate Mileage
Low Mileage
−5
0.10
individual CATE
Density
−10
0.05
−15
High Mileage
Moderate Mileage
Low Mileage
0.00
−20
−25 −20 −15 −10 −5 0 20 30 40 50
individual CATE age
FIGURE 20.8
The left panel shows posterior distributions of the ATEs specific to “mileage” groups. While there
is separation in the means of these distributions there is still a fair amount of overlap across these
distributions. The right panel shows individual-level treatment effects as they vary with respect to
levels of both mileage and age. If we hold age constant, we see more distinct separation in the
treatment effect posterior distributions across mileage groups.
with the covariates as predictors. This fit revealed that the modifier creating this clustering was a
measure of “urbanicity” of the schools. This is a simple and effective technique for discovering
moderators and understanding more about treatment effect heterogeneity. The thinkCausal software
mentioned previously automates implemention of this strategy (https://apsta.shinyapps.
io/thinkCausal/).
Causal BART models also yield full posterior distributions over the outcomes under treatment
and control. This can facilitate a formal decision theoretic treatment of optimal treatment selection
for individuals by selecting the treatment rule that maximizes individual posterior expected utility.
Logan et al. [41] use BART in this framework to estimate individualized treatment rules.
We note one caveat about these analyses. We recommend that exploration of treatment effect
moderation and estimation of individual-level treatment effects is most plausible in the settings of
randomized experiments. All-confounders-measured is a strong assumption in observational settings.
Moreover, the overlap assumption is much stronger in observational settings, even if all-confounders-
measured is satisfied. Satisfying these at an individual level is more difficult than at a group level.
These complications compound in an observation setting whereas in a randomized experiment setting
we mostly need to worry only about one of these issues.
20.6.4 Generalizability
When treatment effects vary across observations (e.g., students or schools) it is likely that the average
treatment effect for a given sample will not be the same as that for another sample. How can we
generalize the results from our original sample to a new sample or population that may represent
a different distribution of treatment effects? For instance, suppose we find through a randomized
experiment conducted in a dozen schools that an intervention was effective for lower performing
students but had no impact on other students. If we want to have a sense of how that same intervention
might impact a school with a different composition of lower and higher performing students, we
would need to reweight (explicitly or implicitly) our estimate to mimic the population of interest.
434 Machine Learning for Causal Inference
Of course in more real-world applications treatment effects may vary based on a much larger
number of individual-level (e.g. student) and group-level (e.g. school) characteristics. Thus simple
reweighting strategies would be more complicated to implement, particularly if the treatment effect
modifiers are not known at the outset. A BART approach capitalizes on its ability to estimate
individual level effects. If the target population is simply made up a different compositions of the
same types of observations that the algorithm was fit to originally,9 then this task reduces to a
prediction problem for a new sample of individuals (or groups) defined by these covariate values.
If covariates are available for this new population then we can use BART to generate posterior
distributions for both potential outcomes, which in turn can be used to generate posterior distributions
for the treatment effect for each person in that new sample and any average effects based on groupings
of these people.
If outcomes are additionally available (for instance we know test scores in the absence of an
intervention but want to predict test scores (and thus effects) that would occur given exposure to
the intervention then the prediction problem is less difficult because only half of the information is
missing (the counterfactual outcome). BART-based strategies for generalizing treatment effects were
discussed in [43], where BART was demonstrated to have superior performance over propensity score
based approaches to generalization in the setting where all confounders are measured and we observe
the covariates that modify the treatment effect. Generalization of average treatment effects from one
sample or population to another requires an accurate portrayal of the way that treatment effects vary
across subgroups. Thus it is no surprise that BART also demonstrates superior performance in this
task.
to the potential outcomes or the difference between them, and (2) that overlap exists between these groups. For more details
see Stuart et al. [42].
10 It should be noted that unless the foundational assumptions of Section 20.2.3 apply, the term “effect” is a misleading
overstatement.
BART Extensions and Other Considerations for Causal Inference 435
which determines if subgroup averages estimates should calculated as the result. Within the context
of our hyperShoe example, suppose that observations were grouped by country, which might serve as
a confounder through level of interest in running or funding available through a sports program. To
fit a varying intercept model in bartCause and report subgroup average treatment effects:
The result:
> summary(mlm_fit)
fitting treatment model via method 'bart'
fitting response model via method 'bart'
Call: bartCause::bartc(response = running_time, treatment = hyperShoe,
confounders = age, data = dat, estimand = "att",
group.by = country, group.effects = TRUE,
seed = 0)
More complicated grouping structures can be fit using the parametric argument of bartc.
This argument accepts a full parametric equation that is added to the treatment and response models.
Multilevel models, including nested or cross effects and varying slopes, are defined as in the lme4
package ( [45]), using a vertical bar notation ((var | group)). For example, a varying intercept
and slope model for our hyperShoe example that allowed a different coefficient for age by country
would use the parametric argument of (1 + age | country). More recently, extensions
of the BART algorithm that accommodate parametric and multilevel structure have been developed
in a package called stan4bart, available in R [27, 54].
6
6 9
0. S. .2
−4
−4
4
N. −0
5
.7
.0
7
4
0
4
6 −3
3
S. −3 1. .02
N. .87
.74
3
9
−0 .1 −3
−1
7
−1 .42
0.5
.99
2
.
N.S
2
−2.9
8
6 −1.48
−1.63 00.0
1
1
−2.53
−0.457
−0 .3 −0.97
−2 8
−2.0 N.S.
−2.17 −0.74
0
0
−2 −1 0 1 2 −2 −1 0 1 2
z z
Coef. on U in model for treatment, ζ Coef. on U in model for treatment, ζ
FIGURE 20.9
Contour lines corresponding to estimates of the treatment effect corrected for bias indicated by
confounding levels at that location on the plot. The left plot displays an example that is not particularly
sensitive to unobserved confounding. The right plot corresponds to an example that is more sensitive
to unobserved confounding.
One approach to this is to evaluate the sensitivity of a study to potential unmeasured confounding
across a variety of assumptions about the potential strength of that confounding. This allows the
researcher to understand what level of confounding would be needed to substantively change the
estimate of the treatment effect. For instance, what would it take to change the sign of this estimate
or drive it to zero?
There is much work in this area but we will briefly focus on work by Hill and co-authors ( [46]),
extending earlier work by Carnegie, Harada, and Hill, [47] that allows BART to be incorporated
into an existing sensitivity analysis framework. The original approach imposed a strict parametric
model with the two parameters for to capture the role of an unobserved confounder; the extension
relaxes the assumptions by allowing BART to model the response surface. This results in an easily
interpretable framework for testing the potential impact of an unmeasured confounder that also
limits the number of modeling assumptions. The performance of this approach was evaluated in a
large-scale simulation setting and its usefulness was also demonstrated with high blood pressure data
taken from the Third National Health and Nutrition Examination Survey [46].
To illustrate this approach we extended our original example to create two additional scenarios,
both of which include an unobserved confounder. We conceive of this variable as a indicator of
whether each runner had sponsorship for the race. Sponsorship is positively associated with the
probability of having the hyperShoe and negatively associated with the running time (that is if you
have sponsorship you are likely to have faster running times on average). In the first scenario the
confounding was relatively weak and in the second it was relatively strong. Figure 20.9 displays
results.
First consider the plot on the left. Each coordinate in the plot region represents a combination
of sensitivity parameters. These parameters reflect the strength of association between our binary
unobserved confounder, U (sponsorship), and the treatment (x-axis) as well as the strength of the
unobserved confounder and the outcome (y-axis). In particular we can think of the parameter on
the y axis as the difference in means in the outcome between groups with U = 0 and U = 1
after (non-parametrically) adjusting other the other covariates available; this parameter has been
Evidence of Performance 437
standardized to be represented in standard deviation units (with respect to the outcome variable).
The parameter on the x-axis represents the coefficient on U in a probit regression model of Z on U
and the other covariates. Each contour line in the plot reflects the set of such points (combinations
of sensitivity parameters) that would result in a particular (standardized) estimate of the treatment
effect, the magnitude of which is displayed on the line. The lighter dashed contour corresponds to a
treatment effect estimate of 0 and the darker long-dashed contours show when the estimate would
lose statistical significance. Finally, the plus sign and diamond represent that actual coefficients on
the other two covariates in the model for these two equations (the triangle presents the estimate with
a reversed sign so it is in a quadrant of the space represented by the plot). These help us to provide
some context for the size of the coefficients and what might be considered a large magnitude for each
of the sensitivity parameters.
Therefore, the left-hand plot, corresponding to the situation where the confounding is weak,
accurately reflects that situation. It tells the researcher that a missing confounder would have to have
a strong, negative relationship with the treatment and and exceptionally strong relationship with the
outcome to drive the treatment effect estimate to zero; the sizes of strengths of those relationships
would far exceed those of the observed confounders. On the other hand, the plot on the right displays
a situation where the confounding is much stronger. In this case the results are much more sensitive
to the unobserved confounder. The treatment effect estimates could be driven to zero with much
more moderate associations between U and the treatment and outcome.
20.8.1 Strengths
BART has three key features that contribute to its strengths: (1) flexible modeling strategy, (2) use
of the outcome variable, and (3) the Bayesian framework. We discuss the advantages of a BART
approach to causal inference by framing them in terms of their relationship to one or more of these
features.
One of the biggest strengths of BART is that it allows for robust estimation of a wide variety of
estimands, ranging from average treatment effects, to subgroup effects, to individual-level treatment
effects. This versatility results as a combination for the flexible sum-of-trees model and the Bayesian
framework which allows us to produce a full posterior distribution for each combination of obser-
vation and potential outcome. The ability to produce more robust estimates of individual treatment
effects not only allows us to better understand treatment effect heterogeneity but also to explore what
covariates moderate treatment effects. The fact that the BART modeling approach incorporates the
outcome variable allows for more efficient estimation of these estimands as well.
Capitalizing on the information in the outcome variable also provides BART with a way of
implicitly identifying which covariates are true confounders, based on which are most strongly
predictive of the outcome. The algorithm then weights the contributions of those variables more
strongly. In conjunction with the Bayesian framework which provides a principled strategy for
estimating uncertainty, BART can identify observations without sufficient common causal support
more easily than many other approaches to causal inference [15].
Additional advantages of the Bayesian framework include the ability to expand the model fairly
easily to accommodate extensions. Currently extensions include the stan4bart multilevel BART
model [32] and an amalgamation of the BART algorithm into the treatSens sensitivity analysis
package, both described above. These are but two of many potential opportunities.
A further strength relates to the high performance of the default settings of most BART implemen-
tations. This yields an automated approach which allows the researcher to more easily pre-specify
their model which, in turn, limits researcher degrees of freedom and makes the research more easily
reproducible. This feature of BART makes it compatible with other “honest” approaches to causal
inference because the researcher will not have the opportunity to adjust their model specification as a
response to initial looks at treatment effect estimates. While many propensity score approaches also
allow for this type of honesty because it is possible to choose a strategy without making use of the
outcome variable, these strategies often still allow for many researcher degrees of freedom as the
researcher searches for an optimal specification. The full modeling path can be difficult to reproduce
for similar reasons.
Hahn et al. [91] have pointed out the potential for the most basic implementation of BART for
causal inference to induce confounding through the regularization built into the prior. They have
created an extension called Bayesian causal forests that provides a promising way to address this
problem [91]. They also suggest that an approximate solution is to simply include a reasonable
estimate of the propensity score as a covariate in the BART model.
Currently the standard BART approach is mostly useful for studies with a single binary treatment
that occurs at one point in time. Many extensions to more complicated settings should be relatively
straightforward but don’t currently exist. Moreover, the primary implementations are in the R
programming language. However stand-alone, user friendly software now exists that provides access
to the software without requiring the user to program in R (https://apsta.shinyapps.io/
thinkCausal/).
Finally, BART approaches to causal inference typically assume that all confounders have been
measured. While BART has been incorporated into existing sensitivity analysis approaches [46], this
doesn’t entirely remove that problem. Use of these approaches requires humility in understanding
what conclusions can reliably be drawn and transparency about the assumptions.
20.9 Conclusion
This chapter has introduced the basics of a causal inference approach that capitalizes on a Bayesian
machine-learning-based algorithm, BART. It explains when and how flexible regression-based
approaches to causal inference can be useful and has highlighted some potential advantages of these
approaches relative to approaches that focus on data restructuring such as matching and weighting.
To our knowledge BART was the first machine-learning-based approach to causal inference
introduced (with scholarly talks starting in 2005 and journal publication in 2011 [80]). However
since that time several other machine-learning based approaches to causal inference have also been
developed [30,55,56,58,59]. Many of these algorithms have similar desirable features. A distinguish-
ing characteristic of BART is that the flexible model for the response surface is embedded within a
Bayesian likelihood framework. This offers advantages with regard to uncertainty quantification, de-
tection of observations that lack common causal support, simultaneous identification of a wide variety
of causal estimands, and the ability for reasonably straightforward model extensions to accommodate
features such as grouped data structures, sensitivity analysis, and varying distributional assumptions.
Both the simple BART-causal implementation and many of the additional features described in this
chapter are available in the bartCause package in R and the standalone thinkCausal software
(https://apsta.shinyapps.io/thinkCausal/).
References
[1] Hugh Chipman, Edward George, and Robert McCulloch. Bayesian ensemble learning. In
B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing
Systems 19. MIT Press, Cambridge, MA, 2007.
[2] H. A. Chipman, E. I. George, and R. E. McCulloch. BART: Bayesian additive regression trees.
Annals of Applied Statistics, 4(1):266–298, 2010.
[3] Jennifer Hill, Antonio Linero, and Jared Murray. Bayesian additive regression trees: A review
and look forward. Annual Review of Statistics and Its Application, 7(1):251–278, 2020.
440 Machine Learning for Causal Inference
[4] Andrew Gelman, Jennifer Hill, and Aki Vehtari. Regression and Other Stories. Cambridge
University Press, New York, 2020.
[5] Donald B. Rubin. Bayesian inference for causal effects: The role of randomization. Annals of
Statistics, 6:34–58, 1978.
[6] B. S. Barnow, G. G. Cain, and A. S. Goldberger. Issues in the analysis of selectivity bias. In
E. Stromsdorfer and G. Farkas, editors, Evaluation Studies, volume 5, pages 42–59. Sage, San
Francisco, 1980.
[7] Sander Greenland and James M Robins. Identifiability, exchangeability, and epidemiological
confounding. International Journal of Epidemiology, 15(3):413–419, 1986.
[8] Michael Lechner. Identification and estimation of causal effects of multiple treatments under
the conditional independence assumption. In Michael Lechner and Friedhelm Pfeiffer, editors,
Econometric Evaluation of Labour Market Policies, volume 13 of ZEW Economic Studies,
pages 43–58. Physica-Verlag HD, 2001.
[9] Paul R. Rosenbaum. Observational Studies. Springer, New York, 2002.
[10] J. Pearl. On a class of bias-amplifying variables that endanger effect estimates. In Proceedings of
the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pages 425–432, Catalina
Island, CA, 2010. Accessed 02/02/2016.
[11] J. Middleton, M. Scott, R. Diakow, and J. Hill. Bias amplification and bias unmasking. Political
Analysis, 24:307–323, 2016.
[12] P. Steiner and Y. Kim. The mechanics of omitted variable bias: Bias amplification and cancella-
tion of offsetting biases. Journal of Causal Inference, 4, 2016.
[13] P. Ding and L. Miratrix. To adjust or not to adjust? sensitivity analysis of m-bias and butterfly-
bias. Journal of Causal Inference, 3:41–57, 2014.
[14] Alexander D’Amour, Peng Ding, Avi Feller, Lihua Lei, and Jasjeet Sekhon. Overlap in
observational studies with high-dimensional covariates. Journal of Econometrics, 221(2):644–
654, 2021.
[15] Jennifer Hill and Yu-Sung Su. Assessing lack of common support in causal inference using
Bayesian nonparametrics: Implications for evaluating the effect of breastfeeding on children’s
cognitive outcomes. Annals of Applied Statistics, 7:1386–1420, 2013.
[16] Guido W. Imbens and Thomas Lemieux. Regression discontinuity designs: A guide to practice.
Journal of Econometrics, 142(2):615–635, 2008. The regression discontinuity design: Theory
and applications.
[17] Sebastian Calonico, Matias D. Cattaneo, Max H. Farrell, and Rocı́o Titiunik. Rdrobust: Software
for regression-discontinuity designs. The Stata Journal, 17(2):372–404, 2017.
[18] Guido Imbens and Donald Rubin. Causal Inference in Statistics, Social, and Biomedical
Sciences. Cambridge University Press, New York, 2015.
[19] James Robins. A new approach to causal inference in mortality studies with a sustained
exposure period—application to control of the healthy worker survivor effect. Mathematical
Modelling, 7(9):1393–1512, 1986.
[20] Luke Keele. The statistics of causal inference: A view from political methodology. Political
Analysis, 23:313–35, 2015.
Conclusion 441
[21] B. Lara, J. Salinero, and J. Del Coso. The relationship between age and running time in elite
marathoners is u-shaped. Age (Dordr), 36(2):1003–1008, 2014.
[22] Niklas Lehto. Effects of age on marathon finishing time among male amateur runners in
stockholm marathon 1979–2014. Journal of Sport and Health Science, 5(3):349–354, 2016.
[23] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2 edition, 2003.
[24] Luke Tierney. Markov Chains for Exploring Posterior Distributions. The Annals of Statistics,
22(4):1701 – 1728, 1994.
[25] George Casella and Edward I. George. Explaining the gibbs sampler. The American Statistician,
46(3):167–174, 1992.
[26] Vincent Dorie. stan4bart: Bayesian Additive Regression Trees with Stan-Sampled Parametric
Extensions, 2021. R package version 0.0-1.
[27] Vincent Dorie. stan4bart: Bayesian Additive Regression Trees with Stan-Sampled Parametric
Extensions, 2021. R package version 0.0-1.
[28] Jennifer Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational
and Graphical Statistics, 20(1):217–240, 2011.
[29] Hugh Chipman and Robert McCulloch. BayesTree: Bayesian Additive Regression Trees, 2016.
R package version 0.3-1.4.
[30] Adam Kapelner and Justin Bleich. bartMachine: Machine learning with Bayesian additive
regression trees. Journal of Statistical Software, 70(4):1–40, 2016.
[31] Vincent Dorie, Hugh Chipman, and Robert McCulloch. dbarts: Discerete Bayesian Additive
Regression Trees Sampler, 2014. R package version 0.8-5.
[32] Bereket Kindo. mpbart: Multinomial Probit Bayesian Additive Regression Trees, 2016. R
package version 0.2.
[33] Belinda Hernandez. bartBMA: Bayesian Additive Regression Trees Using Bayesian Model
Averaging, 2020. R package version 1.0.
[34] Robert McCulloch, Matthew Pratola, and Hugh Chipman. rbart: Bayesian Trees for Conditional
Mean and Variance, 2019. R package version 1.0.
[35] P. Richard Hahn, Jared S. Murray, and Carlos M. Carvalho. Bayesian Regression Tree Models
for Causal Inference: Regularization, Confounding, and Heterogeneous Effects (with Discus-
sion). Bayesian Analysis, 15(3):965–1056, 2020.
[36] Daniel Ho, Kosuke Imai, Gary King, and Elizabeth A. Stuart. Matchit: Nonparametric prepro-
cessing for parametric causal inference. Journal of Statistical Software, 42(8):1–28, 2011.
[37] Holger L Kern, Elizabeth A Stuart, Jennifer Hill, and Donald P Green. Assessing methods
for generalizing experimental impact estimates to target populations. Journal of Research on
Educational Effectiveness, pages 1–25, 2016.
[38] Siva Sivaganesan, Peter Müller, and Bin Huang. Subgroup finding via bayesian additive
regression trees. Statistics in Medicine, 36(15):2391–2403, 2017.
[39] Jerome H Friedman. Greedy function approximation: A gradient boosting machine. Annals of
Statistics, 29(5):1189–1232, 2001.
442 Machine Learning for Causal Inference
[40] N. Carnegie, V. Dorie, and J. Hill. Examining treatment effect heterogeneity using BART.
Observational Studies, 5:52–70, 2019.
[41] Brent R Logan, Rodney Sparapani, Robert E McCulloch, and Purushottam W Laud. Decision
making and uncertainty quantification for individualized treatments using bayesian additive
regression trees. Statistical Methods in Medical Research, page 0962280217746191, 2017.
[42] Elizabeth A Stuart, Stephen R Cole, Catherine P Bradshaw, and Philip J Leaf. The use of
propensity scores to assess the generalizability of results from randomized trials. Journal of the
Royal Statistical Society A, 174(2):369–386, 2011.
[43] Holger L Kern, Elizabeth A Stuart, Jennifer Hill, and Donald P Green. Assessing methods
for generalizing experimental impact estimates to target populations. Journal of Research on
Educational Effectiveness, pages 1–25, 2016.
[44] Jennifer Hill. Multilevel models and causal inference. In Marc A. Scott, Jeffrey S. Simonoff,
and Brian D. Marx, editors, The SAGE Handbook of Multilevel Modeling. Sage Publications
Ltd, 2013.
[45] Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. Fitting linear mixed-effects
models using lme4. Journal of Statistical Software, 67(1):1–48, 2015.
[46] Vincent Dorie, Masataka Harada, Nicole Carnegie, and Jennifer Hill. A flexible, interpretable
framework for assessing sensitivity to unmeasured confounding. Statistics in Medicine,
35(20):3453–3470, 2016.
[47] Nicole Bohme Carnegie, Masataka Harada, and Jennifer Hill. Assessing sensitivity to unmea-
sured confounding using a simulated potential confounder. Journal of Research on Educational
Effectiveness, 9:395–420, 2016.
[48] Jennifer L. Hill, Christopher Weiss, and Fuhua Zhai. Challenges with propensity score strategies
in a high-dimensional setting and a potential alternative. Multivariate Behavioral Research,
46:477–513, 2011.
[49] T Wendling, K Jung, A Callahan, A Schuler, NH Shah, and B Gallego. Comparing methods
for estimation of heterogeneous treatment effects using observational data from health care
databases. Statistics in Medicine, 2018.
[50] V. Dorie, J. Hill, U. Shalit, M. Scott, and D. Cervone. Automated versus do-it-yourself methods
for causal inference: Lessons learned from a data analysis competition. Statistical Science,
34(1):43–68, 2019.
[51] P. Richard Hahn, Vincent Dorie, and Jared S. Murray. Atlantic Causal Inference Conference
(ACIC) Data Analysis Challenge 2017. arXiv e-prints, page arXiv:1905.09515, May 2019.
[52] M. T. Pratola, H. A. Chipman, E. I. George, and R. E. McCulloch. Heteroscedastic bart via
multiplicative regression trees. Journal of Computational and Graphical Statistics, 29(2):405–
417, 2020.
[53] Jared S. Murray. Log-linear bayesian additive regression trees for multinomial logistic and
count regression models. Journal of the American Statistical Association, 116(534):756–769,
2021.
[54] Vincent Dorie, George Perrett, Jennifer L. Hill, and Benjamin Goodrich. Stan and bart for
causal inference: Estimating heterogeneous treatment effects using the power of stan and the
flexibility of machine learning. Entropy, 24(12):1782, Dec 2022.
Conclusion 443
[55] The H2O.ai team. h2o: R Interface for H2O, 2016. R package version 3.10.0.10.
[56] Erin LeDell. h2oEnsemble: H2O Ensemble Learning, 2016. R package version 0.1.8.
[57] Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects
using random forests. Journal of the American Statistical Association, 113(523):1228–1242,
2018.
[58] Sören R. Künzel, Jasjeet S. Sekhon, Peter J. Bickel, and Bin Yu. Metalearners for estimating
heterogeneous treatment effects using machine learning. Proceedings of the National Academy
of Sciences, 116(10):4156–4165, Feb 2019.
[59] Cheng Ju, Susan Gruber, Samuel D Lendle, Antoine Chambaz, Jessica M Franklin, Richard
Wyss, Sebastian Schneeweiss, and Mark J van der Laan. Scalable collaborative targeted learning
for high-dimensional data. Statistical Methods in Medical Research, 28(2):532–554, 2019.
PMID: 28936917.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
21
Treatment Heterogeneity with Survival Outcomes
Yizhe Xu, Nikolaos Ignatiadis, Erik Sverdrup, Scott Fleming, Stefan Wager, Nigam Shah
CONTENTS
21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
21.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
21.1.2 The PATH statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
21.1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
21.2 Problem Setup, Notation, and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
21.3 Metalearners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
21.3.1 S-learner: modeling risk as a function of baseline covariates, treatment
assignments, and their interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
21.3.2 T-learner: Risk modeling stratified by treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
21.3.3 Metalearning by directly modeling treatment heterogeneity . . . . . . . . . . . . . . . 453
21.3.3.1 Censoring adjustments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
21.3.3.2 M-learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
21.3.4 Modeling both the risk and treatment heterogeneity . . . . . . . . . . . . . . . . . . . . . . . 455
21.3.4.1 X-learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
21.3.4.2 R-learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
21.3.5 Summary of metalearners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
21.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
21.4.1 Estimators under comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
21.4.2 Performance evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
21.4.3 Data generating processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
21.4.3.1 Complexity of the baseline risk function . . . . . . . . . . . . . . . . . . . . . . 462
21.4.3.2 Complexity of the CATE function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
21.4.3.3 Magnitude of treatment heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . 464
21.4.3.4 Censoring mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
21.4.3.5 Unbalanced treatment assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
21.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
21.5.1 Description of simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
21.5.1.1 Results under varying baseline risk and CATE complexity . . . . . 465
21.5.1.2 Results under varying HTE magnitude . . . . . . . . . . . . . . . . . . . . . . . . 465
21.5.1.3 Results under varying censoring mechanisms . . . . . . . . . . . . . . . . . . 466
21.5.1.4 Results under unbalanced treatment assignment . . . . . . . . . . . . . . . 468
21.5.2 Main takeaways from simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
21.6 Case Study on SPRINT and ACCORD-BP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
21.6.1 Global null analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
21.6.2 CATE estimation in SPRINT and ACCORD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
21.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
Estimation of conditional average treatment effects (CATEs) plays an essential role in modern
medicine by informing treatment decision-making at a patient level. Several metalearners have
been proposed recently to estimate CATEs in an effective and flexible way by re-purposing pre-
dictive machine learning models for causal estimation. In this chapter we summarize the literature
on metalearners and provide concrete guidance for their application for treatment heterogeneity
estimation from randomized controlled trials’ data with survival outcomes. The guidance we provide
is supported by a comprehensive simulation study in which we vary the complexity of the underlying
baseline risk and CATE functions, the magnitude of the heterogeneity in the treatment effect, the
censoring mechanism, and the balance in treatment assignment. To demonstrate the applicability of
our findings, we reanalyze the data from the Systolic Blood Pressure Intervention Trial (SPRINT)
and the Action to Control Cardiovascular Risk in Diabetes (ACCORD) study. While recent literature
reports the existence of heterogeneous effects of intensive blood pressure treatment with multiple
treatment effect modifiers, our results suggest that many of these modifiers may be spurious discover-
ies. This chapter is accompanied by survlearners, an package that provides well-documented
implementations of the CATE estimation strategies described in this work, to allow easy use of our
recommendations as well as reproduction of our numerical study.
21.1 Introduction
Healthcare decisions are commonly informed by a combination of average treatment effects (ATEs)
from randomized controlled trials (RCTs), which do not account for fine-grained patient heterogeneity
and risk stratification that identify those at the highest need for an intervention. For example,
physicians assess patients’ ten-year atherosclerotic cardiovascular disease (ASCVD) risk based on
baseline covariates (such as age, blood pressure, and cholesterol levels) using the pooled cohort
equations (PCEs) [1] and initiate statin treatment based on clinical practice guidelines [2]. Prioritizing
treatment to patients who are at higher risk is a sensible starting point; however, may be suboptimal
when baseline risk is not an appropriate surrogate for treatment effects. For example, a young patient
with only one risk factor of elevated cholesterol level probably has a lower ASCVD risk than an older
subject who has a normal cholesterol level but multiple other risk factors, such as high blood pressure
and smoking, but the younger patient may benefit more from statin which reduces cholesterol [3, 4].
Therefore, a key question in personalized care remains: How should we treat each individual patient?
Consequently, a key challenge in enabling precision medicine is to go beyond estimation of
ATEs and risk stratification, and account for the varied patient response to treatment depending
on factors such as patient characteristics, baseline risk, and sensitiveness to treatment [5]. Such
heterogeneity in treatment effects (HTE) may be summarized by conditional average treatment effects
(CATEs) as a function of subject-level baseline covariates. CATE estimation, however, is a difficult
statistical task. Estimating CATEs, essentially requires estimating interactions of treatment and
baseline covariates, and such interaction effects are often small compared to main effects that drive
baseline risk. Conventional one-variable-at-a-time and subgroup analyses may produce false-positive
results due to multiple testing and false-negative results due to insufficient power, i.e., small sample
size in subgroups [6].
As we discuss more in the related work section below, a promising approach towards CATE
estimation that addresses some of the above challenges is the development of estimation strategies
that use flexible machine learning methods. A particularly convenient class of such methods are
called metalearners [7, 8]; The premise is that one can decompose the CATE estimation task into
Introduction 447
well-understood machine learning tasks (as would be conducted, e.g., for risk stratification). Statistical
and domain expertise in risk modeling can thus be repurposed for a causal task, namely estimation of
heterogeneous treatment effects.
In this chapter we build on the metalearners literature and provide concrete guidance for their
usage in estimating treatment effect heterogeneity from RCT data with right-censored survival
outcomes. Our contributions are as follows: 1) We provide an accessible summary of the mathematical
underpinning of five popular metalearners (S-, T-, X-, M-, and R-learners) when combined with two
popular machine learning strategies (Lasso and generalized random forests). While the described
metalearners have been developed for uncensored continuous or binary data, we explain how they
may be adapted to the survival setting through inverse probability of censoring weighting. 2) We
provide code in a package called survlearners [9] that demonstrates exactly how these
methods may be implemented in practice and describe how machine learning models (e.g., risk
models) are leveraged by each metalearner. 3) We conduct a comprehensive simulation study of
the above CATE estimation strategies using several data generation processes (DGPs) in which
we systematically vary the complexity of the baseline risk function, the complexity of the CATE
function, the magnitude of the heterogeneity in treatment effects, the censoring mechanism, and the
imbalance of treated and control units. 4) Based on the results of the simulation study, we summarize
considerations that matter in choosing and applying metalearners and machine learning models for
HTE estimation. 5) We apply our findings as a case study of HTE estimation on the systolic blood
pressure intervention trial (SPRINT) [10] and the action to control cardiovascular risk in diabetes
(ACCORD) trial [11].
“PATH risk modeling”: A predictive risk model is identified (or developed) and HTEs are
estimated as a function of predicted risk in the RCT data.
“PATH effect modeling”: A predictive model is developed for the outcome of interest with
predictors that include the risk predictors, the treatment assignment, as well as interaction terms.
Our work is complementary to the PATH statement and provides further methodological guidance
for HTE estimation in RCTs with survival outcomes. Concretely, the “PATH effect modeling”
approach coincides with the “S-learner,” a metalearner proposed in the machine learning literature
and described in Section 21.3.1. Our numerical results and literature review confirm the caveats
of effect modeling described in the PATH Statement. Other metalearners may be preferable in
settings where those caveats apply. The PATH guidelines for “PATH risk modeling” approach are
important for all the metalearners considered in this work. These guidelines advocate for the use of a
parsimonious set of predictors for HTEs. A risk score (developed previously, or blinded to treatment
assignment) is an important predictor for HTEs that is justified both mathematically and by clinical
experience. In this work by using modern machine learning and regularization techniques, we allow
for the possibility that Xi may be higher-dimensional. We do not provide guidance on how to choose
predictors Xi based on domain expertise but describe methods for efficiently using any such HTE
predictors chosen by the analyst (Xi may include, e.g., risk scores previously developed, but also
additional predictors determined based on domain expertise).
21.1.3 Outline
The outline of this chapter is as follows. We define the CATE estimation problem in Section 21.2.
In Section 21.3 we provide a brief tutorial on the use of metalearners and machine learning for
estimating treatment heterogeneity in RCT with right-censored, time-to-event data. We describe our
simulation study in Section 21.4 and summarize main takeaways in Section 21.5. We then present a
case study on SPRINT and ACCORD in Section 21.6. Finally, we conclude with a discussion and
future extensions in Section 21.7.
Consistency: The observed survival time in real data is the same as the potential outcome under the
actual treatment assignment, i.e., Ti = Ti (Wi ).
RCT: The treatment assignment is randomized such that Wi is independent of (Xi , Ti (1), Ti (0)),
⊥ (Xi , Ti (1), Ti (0)) and P[Wi = 1] = e with known 0 < e < 1.
i.e., Wi ⊥
Noninformative censoring: Censoring is independent of survival time conditional on treatment
assignment and covariates, i.e., Ci ⊥
⊥ Ti | Xi , Wi .
Positivity: There exists subjects who are at risk beyond the time horizon t0 , i.e., P[Ci > t0 |
Xi , Wi ] ≥ ηC for some ηC > 0.
Remark 21.1 (Observational studies). As mentioned above, throughout this manuscript we focus
our attention on RCT so as to provide a comprehensive discussion of issues involved in the estimation
of HTEs in the absence of confounding beyond censoring bias. Nevertheless, conceptually, the
metalearners we discuss are also applicable in the setting of observational studies under uncon-
foundedness. Concretely, we may replace the RCT assumption by the following two assumptions
Unconfoundedness: The potential survival times are independent of the treatment assignment Wi
conditionally on baseline covariates, that is, Wi ⊥
⊥ (Ti (1), Ti (0))|Xi .
Overlap: There exists η ∈ (0, 1) such that the propensity score e(x) = P[Wi = 1|Xi = x] satisfies
η ≤ e(x) ≤ 1 − η for all x in the support of Xi .
We refer the interested readers to the original manuscripts introducing the different metalearners for
an explanation of their application to observational studies. In short, for the methods we describe, it
suffices to replace the treatment probability e (whenever it is used by a method) by eb(Xi ) with eb(·)
is an estimate of the propensity score e(·) = P[Wi = 1|Xi = ·]. The statistical consequences of
estimation error in eb(·), as well as ways to make estimators robust to this error, are discussed further
in [8] and [29].
21.3 Metalearners
Metalearners are specific meta-algorithms that leverage predictive models to solve the causal task of
estimating treatment heterogeneity. Metalearners are motivated by the observation that predictive
models are applied ubiquitously, that we have a good understanding about fitting models with
strong out-of-bag predictive performance, and we know how to evaluate predictive models [51–53].
Metalearners repurpose this expertise to power the effective estimation of HTEs (a task which is less
well-understood). Proposed metalearners build upon different predictive tasks and also combine these
predictive models in distinct ways to estimate HTEs. In this Section we seek to provide a short, but
instructive, introduction to commonly used metalearners in the context of CATE estimation (21.1)
with survival data.
To emphasize the flexibility of metalearners in leveraging predictive models, we abstract away
the concrete choice of predictive model for each task by introducing the notation
M Ye ∼ X; e O,e [K]e , (21.2)
to denote a generic prediction model that predicts Ỹ as a function of covariates Xe based on the
dataset O with (optional) sample weights K. Note that by default we assume that Ki = 1 for all i, in
e e e
450 Treatment Heterogeneity with Survival Outcomes
which case we omit K e from the notation and write M(Ye ∼ X; e O).
e It will also be convenient to
1
introduce notation for predictive models that give out-of-bag (oob) predictions of Ye :
Moob Ye ∼ X; e O,e [K]
e . (21.3)
Below we will describe the high-level idea of each metalearner, followed by concrete examples for
possible choices of M. We first describe two metalearners – the S- and T-learners – for which we
only need to predict the probability that {Ti > t0 } as a function of certain covariates X
ei . We refer to
this task as risk modeling, since
τ (x) = µ([x, 1]) − µ([x, 0]) with µ([x, w]) = E[Yi | Xi = x, Wi = w]. (21.4)
In words, the S-learner first learns a risk model µ b(·) as a function of Xi and Wi , i.e., the treatment
assignment Wi is merely treated as “just another covariate” [28, 91]. Then, given baseline covariates
x, the S-learner applies the fitted model to [x, 1] (that is, to the feature vector that appends Wi = 1 to
x) to impute the response of the treated outcomes. The fitted model is then applied to [x, 0] to impute
the response of the control outcomes, and finally the CATE is estimated as the difference thereof.
In (21.4), any predictive model for P[Ti > t0 | Xi , Wi ] may be used, for example, a random
survival forest.
Example 21.1 (S-learner with Random Survival Forest). A concrete example of a fully nonparametric
model M(Y ∼ [X, W ]; O) is the random survival forest of [55]. The basic idea of the random
survival forest, and more generally, of generalized random forests (GRF) [26]) is the following: As
1 We use the term out-of-bag loosely also using it to refer to out-of-sample or out-of-fold predictions.
Metalearners 451
in the traditional random forest of [56], a collection of trees is grown. Each tree is grown based
on a randomly drawn subsample of the training data and by recursively partitioning the covariate
space. These trees are then used to adaptively weight [57] new test points. To be precise, let x̃ be
a test covariate (e.g., x̃ = [x,w] for the S-learner), then the i-th data point in the training dataset
B i ∈ Lb (x̃)})/(B||Lb (x̃)|), where B is the total number of
is assigned weight αi (x̃) = b=1 1({X
i falls in the same leaf of the b-th tree
trees, Lb (x̃) is the set of all training examples i such that X
as x̃, and #| · | symbolizes the number of instances in a set. Then, given these weights αi (x̃), the
Nelson-Aalen estimator with these weights is used to predict [Ti > t0 | X̃i ]. Using the random
survival forest in the package grf [58], the S-learner may be implemented as follows:
library(grf)
m <- survival_forest(data.frame(X=X, W=W), U, D, prediction.type="Nelson-Aalen")
m1_x <- predict(m, data.frame(X=x, W=1), t0)$predictions
m0_x <- predict(m, data.frame(X=x, W=0), t0)$predictions
tau_x <- m1_x - m0_x
In the first line we load the grf package. In the second line we fit the random survival forest based
on the full dataset with augmented covariates for which we concatenate the baseline covariates X
and treatment assignment vector W and with follow-up times U and event indicators D. Then, given
test baseline covariates x, we impute (predict) the survival probability at t0 (t0 is a scalar that
equals to t0 ) using the fitted survival forest for the treated outcome (third line) and control outcome
(fourth line), and finally we take their difference (fifth line).
The random survival forest automatically captures interactions between baseline covariates Xi
and treatment assignment Wi through the tree structure.2 When a conventional regression is used as
the predictive model, one needs to explicitly specify treatment-covariate interaction terms in order to
model HTE:
Example 21.2 (S-learner with Cox-Lasso). A commonly used risk model for survival data is given
by the Cox proportional hazards (PH) model with Lasso penalization [59, 60]. Given covariates
Xi , the PH model assumes that the log-hazard is equal to β X i , for an unknown coefficient vector
β that is estimated by minimizing the negative log partial likelihood [61] plus the sum of absolute
values of the coefficient βj multiplied by the regularization parameter λ ≥ 0 (i.e., λ · j |βj |). One
of the upshots of the Cox-Lasso is that it automatically performs shrinkage and variable selection; λ
determines the sparsity of the solution (i.e., how many of the βj are equal to zero). In the context of
CATE estimation, the Cox-Lasso is typically fitted using the covariate vector X i = [Xi , Wi , Wi · Xi ],
that is, by explicitly including interaction terms Wi · Xi for the linear predictors. The S-learner with
Cox-Lasso then takes the following form:3
(·) = MCox-Lasso
µ
λ
(Y ∼ [X, W, W · X]; O) , τ(x) = µ
([x, 1, 1 · x]) − µ
([x, 0, 0 · x]),
2 There is a caveat to this claim: Since the treatment indicator is included in the same way as the other covariates, it is
likely to be ignored in several trees that never split on it. This can cause the S-learner with Random Survival Forest to perform
poorly in some situations.
3 In this note we provide some more details about fitting the Cox-Lasso: Given a survival dataset O indexed by I
and with
covariates X ∈ d˜, and a tuning parameter λ ≥ 0, the Cox-PH Lasso model for predicting the survival probability at t0 is
fitted as follows (assuming for simplicity that there are no ties in the observed event times Ui )
d˜
βλ ∈ arg minβ − ∆i X i β − log exp( X̃ j β) + λ |β j | ,
i∈I
j∈I : Uj ≥Ui j=1
H λ (t0 ) = ∆i exp(X βλ )
.
j
: Ui ≤t0
i∈I : Uj ≥Ui
j∈I
452 Treatment Heterogeneity with Survival Outcomes
where we make explicit in the notation that λb is typically chosen in a data-driven way by minimizing
the cross-validated log partial likelihood [16, 63].
A further challenge (beyond explicitly modeling interactions Wi · Xi ) in applying the S-learner
with Cox-Lasso is that there are many possible choices with respect to normalization of covariates,
interaction terms, and to applying different penalty factors to different coefficients. In our implemen-
tation we do not apply any shrinkage on the coefficient of Wi . We discuss the Cox-Lasso model in
more detail, as well as our normalization/shrinkage choices in Supplement A.1.
τ (x) = µ(1) (x) − µ(0) (x), where µ(w) (x) = P[Ti (w) > t0 | Xi = x]. (21.6)
library(grf)
m1 <- survival_forest(X[W==1,], U[W==1], D[W==1], prediction.type="Nelson-Aalen")
m0 <- survival_forest(X[W==0,], U[W==0], D[W==0], prediction.type="Nelson-Aalen")
m1_x <- predict(m1, x, t0)$predictions
m0_x <- predict(m0, x, t0)$predictions
tau_x <- m1_x - m0_x
In the second line we fit the random survival forest with covariates X only on the treated subjects
through the subsetting W==1. In the third line we fit the same model on control subjects (subsetting
W==0). Then we estimate the survival probability with covariates x under treatment (line 4) and
control (line 5), and take the difference to estimate the CATE (line 6).
The next example describes the T-learner with Cox-Lasso. In contrast to the S-learner from
Example 21.2, here no special tuning or normalization is required when fitting the Cox-Lasso since
the risks are modeled as functions of baseline covariates only.
In the main text (e.g., in Example 21.4), we use the following notation for the above predictive risk modeling procedure:
MCox-Lasso
λ Y ∼ X;e O e .
Metalearners 453
Example 21.4 (T-learner with Cox-Lasso). The CATE estimate is computed as follows:
4 The issue of regularization-induced bias becomes even more nuanced in the observational study setting of Remark 21.1,
where F−` = U \ F` is the set of all subjects outside fold F` and ∆k = 1 {min(Ti , Uk ) ≤ Ci }.
If censoring may depend on (Xi , Wi ), then a more complicated model is required. For example,
if one is willing to assume proportional hazards, then similarly to Examples 21.2 and 21.4, one
could estimate Kb i by running the Cox-Lasso. A more nonparametric estimate is provided by random
survival forests:
Example 21.6 (Censoring weights with Random Survival Forests). Following the notation in Exam-
ples 21.3,21.1, the grf package may be used as follows for IPCW.
library(grf)
cen <- survival_forest(data.frame(X,W), U, 1-D, prediction.type="Nelson-Aalen")
K <- 1/predict(cen, failure.times=pmin(U,t0), prediction.times="time")$predictions
In Line 2, we fit the forest with flipped event indicator 1-D and covariates X,W, and in the last
line we compute the vector of censoring weights K. We note that survival_forest in the grf
package computes out-of-bag predictions by default (compute.oob.predictions = TRUE).
If we hypothetically had access to the oracle scores Yi∗,o = Yi (1) − Yi (0) for the uncensored
samples i ∈ Icomp and weights K b i as in (21.8), then we could estimate HTEs via weighted predictive
∗,o
modeling as τb(·) = M(Y ∼ X; O, K) b (see examples below). In the next subsections we describe
three metalearners, the M-, X-, and R-learners that address the challenge that – even in the absence
of censoring – the oracle scores Yi∗,o = Yi (1) − Yi (0) are not available, due to the fundamental
challenge of causal inference.
The observation driving these methods is that for any given (observable) score Yi∗ with the
property
E Yi∗ | Xi = x = τ (x) or E[Yi∗ | Xi = x] ≈ τ (x), (21.10)
one can estimate τb(x) by predictive modeling of Yi∗ as a function of Xi . The oracle score Yi∗,o =
Yi (1) − Yi (0) satisfies (21.10) by definition, however, it is not the only score with this property.
Remark 21.2 (Doubly robust censoring adjustments). The IPCW (Inverse Probability of Censoring
Weights) adjustment removes censoring bias, but it can be inefficient and unstable when the censoring
7 Assuming no ties for simplicity in the formula.
Metalearners 455
rate is high in a study and a majority of the censoring events happened before t0 . It also may be
more sensitive to misspecification of the censoring model. In such cases, one can consider a doubly
robust correction [67] similar to the augmented inverse-propensity weighting estimator of [68]. We
do not pursue such a doubly robust correction here, because it is substantially more challenging to
implement with general off-the-shelf predictive models. Case-by-case constructions are possible, e.g.,
the Causal Survival Forest (CSF) of [40] uses a doubly robust censoring adjustment, and we will
compare to CSF in the simulation study.
21.3.3.2 M-learner
The modified outcome method (M-learner) [12, 14, 69, 70] leverages the aforementioned insight with
the following score based on the [71] transformation / inverse propensity weighting (IPW):
Wi 1 − Wi
Yi ∗,M
= Yi − , E[Yi∗,M | Xi = x] = τ (x). (21.11)
e 1−e
The predictive model for (21.12) could be any predictive model. For example, a random (regression)
forest could be used [56].
Example 21.7 (M-learner with Random Forest CATE modeling). Let K_hat be a vector of cen-
soring weights, derived as in Examples 21.5 or 21.6. Also let e be the treatment probability.
Then the M-learner with random forest CATE modeling may be implemented as follows with the
regression_forest function in the grf package.
library(grf)
idx <- (D == 1) | (U >= t0)
Y_M <- (U > t0) * (W/e - (1-W)/(1-e))
tau_hat_forest <- regression_forest(X[idx,], Y_M[idx], sample.weights = K_hat[idx])
tau_x <- predict(tau_hat_forest, x)$predictions
In Line 2 we subset to the complete cases. In Line 3 we generate the IPW response in (21.11), in Line
4 we fit the random forest and finally in Line 5 we extract the estimated CATE at x.
In the context of survival data, [36] proposed a related M-learner approach that uses a single
regression tree (rather than forest). Another alternative could be to fit the CATE with Lasso:
Example 21.8 (M-learner with Lasso CATE modeling). If we seek to approximate τ (X) as a sparse
linear function of Xi , we can use the Lasso [72] with squared error loss:
X 2 d
b i · Y ∗,M − β0 − β | Xi + λ
X
(βb0 , βb1 ) ∈ arg minβ0 ,β1 K i 1
bτ |β1,j | , τb(x) = βb0 +βb1| x,
i∈Icomp j=1
CATE. By including estimated risk models in the definition of the scores, these learners can estimate
the CATE with lower variance.
21.3.4.1 X-learner
The X-learner [7] constructs scores that satisfy the approximate identity in (21.10) that build on the
following observation:
where µ(0) (x) = E[Yi (0) | Xi = x] is defined in (21.6). Since µ(0) (x) is unknown, it needs to be
estimated in a first stage (as in the T-learner (21.7)). The X-learner thus takes the following form
In words, τb(1) is the CATE estimated using data from treated units, for which Yi (1) is observed,
and then the unobserved Yi (0) is imputed as µ
b(0) (Xi ). The role of treatment and control groups in
(21.13) may be switched8 and hence analogously to (21.14) we could consider:
In a last stage, the X-learner combines the two CATE estimates as follows:
The intuition here is that we should upweight (21.14) if there are fewer treated units, and (21.15) if
there are more treated units.
The two CATE models in (21.14) and (21.15) may be fitted using the same methods, described
e.g., for the M-learner. We provide an example using random forests (analogous to Example 21.7):
Example 21.9 (X-learner with Random Forest CATE Model). Suppose m1_hat, resp. m0_hat are
vectors of length n with i-th entry equal to µb(1) (Xi ), resp. µ
b(0) (Xi ), with the models µ
b(1) (·), µ
b(0) (·)
fitted as in the T-learner (Subsection 21.3.2).9 Furthermore, let K_hat be a vector of length n
corresponding to IPC-Weights estimated as in Subsection 21.3.3.1. Then the X-learner that models
the CATE using the random forest function in the grf package may be implemented as follows. First,
we fit (21.14) by only retaining complete, treated cases (Line 2 below), constructing the estimated
scores Yi∗,X,1 (21.13) (Line 3), fitting a random forest with IPCW (Line 4) and finally extracting the
model prediction at x:
library(grf)
idx_1 <- ((D == 1) | (U >= t0)) & (W == 1)
Y_X_1 <- (U > t0) - mu0_hat
tau_hat_1 <- regression_forest(X[idx_1,], Y_X_1[idx_1], sample.weights
= K_hat[idx_1])
tau_x_1 <- predict(tau_hat_1, x)$predictions
8 That is, for Yi∗,X,0 = µ(1) (Xi ) − Yi (0), it holds that E[Yi∗,X,0 | Xi = x] = τ (x).
9 For example, µ b(1) (·), µ
b(0) (·) could be Cox-Lasso models as in Example 21.4, or survival forests as in Example 21.3. In
the latter case, the following code could be used for computing m1_hat and m0_hat using grf:
21.3.4.2 R-learner
The R-learner is a metalearner proposed by [8] that builds upon a characterization of the CATE
in terms of residualization of Wi and Yi [73]. To be concrete, we start by centering Wi and Yi
around their conditional expectation given X, that is, we consider Wi − e11 and Yi − m(Xi ) with
m(x) = E[Yi | Xi = x] equal to:
With the above definition, the following calculations hold for the expectation of Yi − m(Xi ) condi-
tionally on Xi = x, Wi = 1:
E Yi − m(Xi ) | Xi = x, Wi = 1 = µ(1) (x) − m(x) = (1 − e)(µ(1) (x) − µ(0) (x)) = (1 − e)τ (x).
(21.18)
Similarly E Yi − m(Xi ) | Xi = x, Wi = 0 = −eτ (x), and so, E Yi − m(Xi ) | Xi = x, Wi = w =
(w − e)τ (x). The preceding display enables characterization of the CATE τ (·) through the loss-based
representation [74],
n h io
τ (·) ∈ arg minτe(·) E ({Yi − m(Xi )} − {Wi − e}e
2
τ (Xi )) . (21.19)
(21.19) suggests the following CATE estimation strategy. First, we estimate the unknown m(·)
out-of-bag (21.3):12
idx_0 <- ((D == 1) | (U >= t0)) & (W == 0); Y_X_0 <- mu1_hat - (U > t0)
tau_hat_0 <- regression_forest(X[idx_0,], Y_X_0[idx_0], sample.weights =
K_hat[idx_0]) tau_x_0 <- predict(tau_hat_0, x)$predictions
11 Note that E[W | X = x] = E[W ] = P[W = 1] = e, due to our assumption that we are in the setting of an RCT.
12 In the case without censoring, it suffices to fit a single predictive model by regressing Y on X , that is, m(·) =
i i b
Moob (Y ∼ X; O), and this is the approach suggested in [8]. However, under censoring that may depend on treatment
assignment, fitting Moob (Y ∼ X; O) becomes challenging – for example, if we naively use a survival forest with covariates
X, then the fitted model will typically be inconsistent for m(·). By fitting two separate predictive models to the two treatment
arms, as outlined in (21.20), and taking their convex combination (weighted by the treatment probability), we overcome this
challenge and may use general predictive models.
13 The reason is that the following characterization of τ (·) in lieu of Equation (21.19) also holds:
We emphasize that we use out-of-bag estimates of m(·)b in (21.20). If we fit µb(1) (·), µ
b(0) (·) with
grf survival forests, as in Example 21.3, then we obtain out-of-bag estimates by default. If the
Cox-Lasso is used, as in Example 21.4, then we can get out-of-bag estimates by splitting I into 10
folds and using out-of-fold predictions.
Second, we let K b i be IPC-weights as in (21.9) and Icomp be the index set of complete cases.
Finally, we estimate τ (·) by fitting a model that leads to small values of the R-learner loss
P 2
i∈Icomp Ki ({Yi − m(X
b i )} − {Wi − e}b τ (Xi )) . Such a fitting procedure is not directly accom-
b
modated by models of the form (21.2). Below, we will describe how we may achieve this task with
general predictive models (21.2). However, before doing so, we provide some examples of procedures
that directly operate on the R-learner loss. We start with the simplest case:
Example 21.10 (R-learner for estimating a constant treatment effect). Let m(·) b be as in (21.20)
and Ki as in (21.9). Suppose there is no treatment heterogeneity, that is, τ (x) = constant for all
b
x.14 Then, estimating the constant τb with the R-learner loss boils down to fitting a weighted linear
regression on the complete cases with response Yi − m(X b i ), predictor Wi − e, without an intercept,
and with weights K b i , and letting τb be equal to the slope of Wi − e in the above regression.
Generalizing the above example, if we seek to approximate τ (·) as a sparse linear function of Xi ,
then we may use the Lasso (compare to Example 21.8):
library(grf)
idx <- (D == 1) | (U >= t0)
m_hat <- mu1_hat * e + mu0_hat * (1 - e)
Y <- U > t0
tau_hat <- causal_forest(X[idx,], Y[idx], W[idx], Y.hat = m_hat[idx],
W.hat = e[idx], sample.weights = K_hat[idx])
tau_x <- predict(tau_hat, x)$predictions
14 The procedure described in this example is valid also in the presence of HTEs and asymptotically recovers the overlap-
weighted average treatment effect (see e.g., [75]). In the setting without censoring, the procedure is described also in [76].
Metalearners 459
In Line 2 we get the indices of complete cases. In Line 3 we combine the estimate of µ(1) (·), µ(0) (·)
to get an estimate of m(·) as in (21.20). In Line 4 we compute the survival indicator Y. In Lines 5
and 6 we fit a causal forest that targets the R-learner objective. Note that here we subset only to
complete cases given by idx, and we also specify the censoring weights K_hat, as well as the
expected responses m_hat for Y, and e for W. Finally, in Line 7, we extract the causal forest estimate
of the CATE at x.
The preceding examples presented three approaches that directly operate on the R-learner loss
function. It is possible, however, to cast R-learner based estimation of τb(·) in the form (21.2). To do
so, we rewrite (21.19) equivalently as:
( " 2 #)
Y − m(X )
τ (·) ∈ arg minτe(·) E (Wi − e)2
i i
− τe(Xi ) . (21.21)
Wi − e
This is a weighted least squares objective with weights (Wi − e)2 . Under unbalanced treatment
assignment, e.g., when there are fewer treated units (e < 0.5), then the R-learner upweights the
treated units compared to control units by the factor (1 − e)2 /e2 . The upweighting of treated units by
the R-learner is similar to the behavior of the X-learner (compare to (21.16)). In fact, the predictions
of R-learner and X-learner (when used with the same predictive models) are almost identical in the
case of strong imbalance (e ≈ 0) [8].
(21.21) justifies the following equivalent implementation of the R-learner. Let m(·) b be as
in (21.20) and let Kb i be IPC-weights, then we may estimate τb(·) as follows:
Y − m(X)
Yb ∗,R = , τb(·) = M(Yb ∗,R ∼ X; Ocomp , K
b · (W − e)2 ).
b
(21.22)
W −e
We make two observations: First, Yb ∗,R approximates a score in the sense of (21.10).15 Second, the
weights used by predictive models are the product of (Wi − e)2 and the IPC weights K bi.
To further illustrate how to apply (21.22), we revisit Example 21.11 and provide an equivalent
way of implementing the R-learner with the Lasso.
Example 21.13 (Weighted representation of R-learner with Lasso). The estimator τb(·) in Exam-
ple 21.11 may be equivalently expressed as,16
X ∗,R 2 d
X
(βb0 , βb1 ) ∈ arg minβ0 ,β1 b i · (Wi − e)2 · Ybi
K − β0 − β1| Xi + λ
bτ |β1,j | ,
i∈Icomp j=1
Lasso first standardize all covariates to unit variance (e.g., option standardize=TRUE in the glmnet package [77]). In
that case, one would get conflicting answers when implementing the R-learner as described in Example 21.11, respectively
Example 21.13. In our implementation of the R-learner with the Lasso, we enable standardization and follow Example 21.13.
460 Treatment Heterogeneity with Survival Outcomes
In Line 1 we load the xgboost package. In Line 2 we get the indices of complete cases. In
Lines 3 and 4 we construct the R-learner score (21.23). In Lines 5–8 we fit the XGBoost model
with sample weights K_hat[idx]*(W[idx]-e[idx])ˆ2. Finally, in Line 9, we extract the
XGBoost estimate of the CATE at x.
Y ∗,M = W Y /e + (1 − W )Y /(1 − e)
M Not applicable
τb(x) = M(Y ∗,M ∼ X; Ocomp , K)
b
TABLE 21.2
Metalearner combinations considered in the simulation study. For risk and CATE models, we apply
either the Cox-Lasso regression or the generalized random forest approach. For censoring models,
we either employ the Kaplan-Meier estimator without covariates adjustment or the random survival
forest method with variable adjustment.
Cox-proportional hazards model (CPH): This represents our baseline approach as it is very
widely used in practice [81–83]. This method is the same as the S-learner in Example 21.2 but
without Lasso penalization, i.e., λ = 0.
Causal Survival Forest (CSF): This estimator, proposed by [40], is similar to the RFF estimator
we consider, with censoring model estimated with survival forests. The main difference, as discussed
in more detail in Remark 21.2 is that instead of adjusting for censoring via IPCW, it implements a
doubly robust adjustment. We use the CSF implementation in the package grf.
where we set the coefficients β1 = 1 and γ1 = 0.5. The censoring time Ci is independent of
(Ti , Wi , Xi ) and follows a Weibull distribution with hazard function
λC = κρ−1 , (21.25)
where we set κ = 4 and ρ = 2 for the scale and shape parameters, respectively.
models for Ti . To do so, we start by increasing the complexity of the baseline risk by including a
larger number of predictors in fR (21.24), by utilizing nonlinear fR such as indicator functions, or
both:
25
√
β1 Xij / p, (Nonlin., pR = 1) : fR (Xi ) = β1 1{Xi1 > 0.5},
X
(Lin., pR = 25) : fR (Xi ) =
j=1
12
(Nonlin., pR = 25) : fR (Xi ) = β̃1 1{Xi1 > 0.5} + β̃2 1{Xi(2j) > 0.5}1{Xi(2j+1) > 0.5}.
X
j=1
We set β1 = 1 (as in the baseline DGP), and β̃1 = 0.99, β̃2 = 0.33. We note that the last specification
(Nonlin., pR = 25) also includes second-order interactions of baseline covariates in the log-hazard.
Furthermore, in all cases, Xi ∈ R25 , i.e., we do not change the dimension of the baseline covariates
(e.g., in the case pR = 1, only the first feature influences the baseline risk, yet this information is not
“revealed” to the different methods).
We set γ1 = 0.5, γ̃1 = 0.99 and γ̃2 = 0.33. In varying both the baseline risk through fR and the
CATE complexity through fτ , we only consider combinations so that fR is at least as complex as
464 Treatment Heterogeneity with Survival Outcomes
fτ (that when pτ = 25, then also pR = 25, and when fτ is nonlinear, then we also take fR to be
nonlinear). This reflects the fact that the baseline risk could be arbitrarily complicated, but HTEs
may be less so [7, 8].
where α = 2 and δ = 2. Under this setting, subjects’ censorship depends on their characteristics and
treatment type.
A more interesting scenario that builds on top of heterogeneous censoring is unbalanced censor-
ing, that is, subjects in treated or untreated arms may be more likely to get censored. While κi already
depends on W in the above setting, we make the censoring more unbalanced by also including a
main effect of Wi .
Under this formulation, the censoring rate in the untreated arm is much higher than that in the treated
arm (60% vs. 30%), which may happen e.g., if patients realize they are on the inactive treatment and
drop early.
Simulation Results 465
CPH
M*L
RFL
RLL
M*F
SF*
RFF
CSF
XFL
SF*
XLL
SL*
TL*
TF*
XFF
M*F
SF*
M*F
RFF
M*L
CSF
XFF
TF*
XFL
RLL
RFL
RFF
CPH
CSF
CPH
M*L
XFF
XLL
pR = 1, pτ = 1
RLL
RFL
TF*
XFL
1.0
XLL
SL*
TL*
SL*
TL*
0.3
3.0
CPH
SF*
TL*
SF*
M*L
RFL
M*F
RLL
XFL
TF*
XLL
XFF
RFF
CSF
SL*
XFF
TF*
RMSE / SD[τ(X)]
M*F
RFF
M*F
CSF
SF*
XFL
M*L
RFL
RLL
RFF
XLL
XFF
M*L
pR = 25, pτ = 1
TF*
CPH
CPH
CSF
TL*
XFL
TL*
1.0
RFL
RLL
XLL
SL*
SL*
0.3
3.0
SF*
SF*
XFF
XFF
RFF
CSF
RFF
CSF
M*F
M*F
TF*
TF*
CPH
M*L
RFL
M*L
RLL
XFL
XLL
M*L
SL*
XFL
TL*
RFL
M*F
RLL
pR = 25, pτ = 25
RLL
RFF
RFL
XFL
CSF
XFF
XLL
SF*
TF*
CPH
XLL
SL*
CPH
TL*
1.0
SL*
TL*
0.3
FIGURE 21.1
Rescaled root mean squared errors of metalearners across various levels of complexity of baseline
risk functions (R) and treatment-covariate interactions (τ ). The function forms (fR and fτ ) vary
between linear (Lin) and nonlinear (Nonlin) across columns, and the numbers of predictors (pR and
pτ ) vary across rows. We use 3-letter acronyms for each method, wherein the first letter corresponds
to the meta-learner, the second to the risk model, and the third to the CATE model. For example, XFF
is the X-learner with risk models fitted by GRF Survival forests, and CATE fitted by GRF Random
forests.
(γ = 1), most estimators yield smaller RRMSEs; more importantly, metalearners, such as RFF or
XFF, that apply machine learning approaches that are misaligned with the underlying linear risk
(Figure 21.2, fR = Lin, fτ = Lin) or CATE functional forms (Figure 21.2, fR = Nonlin, fτ = Lin)
now perform similarly as the estimators whose predictive models match with the true functional
forms. Moreover, methods show similar performance in terms of Kendall’s τ correlation when γ = 1
and fτ is linear (Figure S5).
CPH
CPH
SF*
TL*
TL*
SF*
M*L
M*L
SF*
M*F
M*F
XLL
XLL
SL*
SL*
XFL
XFL
RLL
TF*
RFL
TF*
RFL
RLL
RFF
CSF
CSF
XFF
RFF
XFF
CPH
M*L
TL*
XFL
M*F
RFL
TF*
XLL
RLL
CSF
RFF
XFF
SL*
γ=0
1.0
RMSE / SD[τ(X)]
0.1
10.0
CPH
SF*
M*L
RLL
M*F
RFL
TL*
XFL
XLL
SL*
SF*
RFF
TF*
XFF
CSF
XFF
M*F
TF*
RFF
CSF
M*L
XFL
RFL
M*F
RLL
γ=1
XLL
SF*
RFF
1.0
M*L
CSF
RFL
RLL
XFF
CPH
XFL
XLL
TF*
TL*
CPH
TL*
SL*
SL*
0.1
FIGURE 21.2
Rescaled root mean squared errors of metalearners under various levels of treatment heterogeneity.
γ = 0 corresponds to zero treatment heterogeneity on the relative scale, and γ = 1 yields larger
heterogeneity than in the base case (Table 21.3). The function forms vary across three combinations
of linear and nonlinear. pR = 25 and pτ = 1. Censoring is modeled using the Kaplan-Meier estimator.
The metalearners are labeled in the same way as in Figure 21.1.
C ⊥ X,W (κ=4, ρ=2) C ⊥ X,W (κ=7, ρ=2) C ⊥ X,W (κ=4, ρ=1) C ~ X1 + X2W C ~ X1 + X2W + W
M*F
M*L
RMSE / SD[τ(X)]
M*F
3.0
RFF
XFL
SF*
XFF
XLL
RFL
CSF
SF*
M*L
RLL
RFF
XFF
SF*
TF*
CSF
RFF
RFL
SF*
XFL
TF*
M*F
RLL
RFL
M*F
RLL
SF*
M*F
CPH
XFF
XLL
RFF
M*L
CSF
CPH
CPH
XFL
RFF
CSF
M*L
CPH
XFF
1.0
CSF
TF*
M*L
TF*
XFF
RLL
CPH
XFL
RLL
XLL
RFL
TF*
RFL
XFL
SL*
SL*
TL*
XLL
XLL
TL*
SL*
SL*
TL*
SL*
TL*
TL*
0.3
FIGURE 21.3
Rescaled root mean squared errors of metalearners under varied censoring mechanism. C ⊥ ⊥ (X, W )
and C ∼ (X, W ) symbolizes censoring is not and is a function of covariates and treatment, respec-
tively. κ and ρ are the scale and shape parameters in the hazard function of censoring. Censoring
is modeled using the Kaplan-Meier estimator (except CSF). pR = 1, pτ = 1, fR = Nonlin and
fτ = Lin. The metalearners are labeled in the same way as in Figure 21.1.
C ⊥ X,W (κ=4, ρ=2) C ⊥ X,W (κ=7, ρ=2) C ⊥ X,W (κ=4, ρ=1) C ~ X1 + X2W C ~ X1 + X2W + W
2.0
M*L
RMSE(GRF) / RMSE(KM)
RFL
M*L
RLL
RLL
RFL
RFF
M*F
M*F
XFL
RLL
M*L
RFL
RFF
M*F
XFL
RFF
M*L
M*F
XFL
XFF
XFL
XFF
XFF
XLL
XFF
XLL
XLL
XFF
RFL
XLL
XLL
M*L
RLL
RFF
XFL
RFF
M*F
RLL
RFL
1.0
0.5
FIGURE 21.4
Ratios (log-scale) of rescaled root mean squared errors of metalearners under varied censoring
mechanism. The contrast is formed between using a random survival forest model versus the Kaplan-
Meier estimator for modeling censoring. C ⊥ ⊥ (X, W ) and C ∼ (X, W ) symbolizes censoring is not
and is a function of covariates and treatment, respectively. κ and ρ are the scale and shape parameters
in the hazard function of censoring. pR = 1, pτ = 1, fR = Nonlin and fτ = Lin. The metalearners
are labeled in the same way as in Figure 21.1.
SF*
CPH
3.0
TF*
XFL
RMSE / SD[τ(X)]
TL*
M*F
M*L
RFL
SF*
XLL
RLL
XFL
SF*
SL*
XFF
RFF
CSF
CPH
TF*
TL*
XFF
M*F
CPH
M*L
M*F
RFL
RFF
TF*
CSF
XFF
M*L
RFF
CSF
XLL
RLL
XFL
RFL
RLL
XLL
TL*
SL*
1.0
SL*
0.3
FIGURE 21.5
Rescaled root mean squared errors of metalearners unbalanced treatment assignment. Only 8% of
subjects are treated. The function forms vary across three combinations of linear and nonlinear.
pR = 25 and pτ = 1. Censoring is modeled using the Kaplan-Meier estimator. The metalearners are
labeled in the same way as in Figure 21.1.
FIGURE 21.6
Main takeaways from the simulation study. For each listed item on the left, we summarize metalearn-
ers by assigning three types of labels: Required/Yes, Not required/No, and Not applicable, depending
on the specific requirements or recommendations described. “and” indicates that both conditions
need to be satisfied, and “or” means that only one of the two conditions is necessary.
accurate CATE estimates. Meeting this requirement is not easy in general and becomes particularly
challenging under situations with unbalanced treatment assignment with a small number of subjects
in one of the arms. The X- and R-learners are less sensitive to unbalanced treatment assignments. In
the case of unbalanced treatment assignment, the X-learner performs well as long as the risk model
for the arm with most subjects (typically, the control arm) is well-estimated. The R-learner only
depends on the two risk models in order to decrease variance, and so is even more robust than the
X-learner.
Second, we discuss the requirements for the CATE model. Estimating CATE functions is a
crucial step for metalearners that directly model treatment effect heterogeneity, including M-, X-, and
R-learners. The general intuition is that CATE functions are often simpler than risk functions, and
we recommend applying parsimonious models to ensure the interpretability of CATE estimates. All
three metalearners can flexibly estimate CATE functions by fitting a separate model, but M-learners
tend to be unstable compared to X- and R-learners.
Third, we discuss the censoring model requirement. All approaches need to account for censoring
one way or the other: S- and T-learners need to account for censoring in the process of fitting the
risk models, while the other metalearners require explicit models for the censoring weights. When
censoring functions are independent of baseline covariates and treatment, the Kaplan-Meier estimator
is appropriate to use. However, the Kaplan-Meier estimator can induce performance degradation,
when the censoring depends on baseline covariates or treatment. If it is unclear whether censoring is
completely independent of treatment and covariates, we suggest to use random survival forests to
model censoring as a function of relevant predictors – this choice appears to perform well even when
the simpler Kaplan-Meier censoring model is correctly specified. When censoring rates are high we
recommend applying the CSF [40] method instead of RFF.
We note that S-learners may perform well, however typically not when applied with off-the-shelf
predictive models, but only with tailored models [14, 85, 91]. For instance, when used with flexible
470 Treatment Heterogeneity with Survival Outcomes
machine learning approaches such as random forests, S-learners do not give the treatment variable
a special role (and so by default, some trees will not split on treatment assignment, even if HTEs
are strong). When used with regression, analysts have to specify interaction terms, which typically
requires substantial domain expertise [47].
To conclude, we recommend applying R- and X-learners for CATE estimation as strong default
choices as any off-the-shelf machine learning models can be used. Besides, they allow imposing
separate structural assumptions in the CATE and risk functions (stratified by treatment assignment),
which is an important feature when these two functions are of different complexity. When background
information on the control arm risk is available, analysts can apply X-learners with carefully chosen
machine learning approaches that match with the possible underlying functional form. But if little
is known about suitable risk models, we recommend implementing the R-learner with risk models
estimated with survival forests (e.g., RFL or CSF) as default approaches as they are robust to
misspecified risk models and provide stable CATE estimation.
In all our analyses we seek to estimate the CATEs, defined as the difference in survival probabili-
ties at the median follow-up time (i.e., 3.26 years). Our outcome is a composite of CVD events and
deaths, which includes nonfatal myocardial infarction (MI), acute coronary syndrome not resulting
in MI, nonfatal stroke, acute decompensated congestive heart failure, or CVD-related death. We
identified 13 predictors from reviewing prior works [1, 10, 91], including age, sex, race black, systolic
BP, diastolic BP, prior subclinical CVD, subclinical chronic kidney disease (CKD), number of
antihypertensives, serum creatinine level, total cholesterol, high-density lipoprotein, triglycerides,
and current smoking status. After retaining only subjects with no missing data on any covariate, the
SPRINT sample includes 9,206 participants (98.3% of the original cohort)
In addition to the main predictor set (13 predictors) used above, we also consider a second, reduced
predictor set with only two predictors. The first predictor is the 10-year probability of developing a
major ASCVD event (“ASCVD risk”) predicted from Pooled Cohort Equations (PCE) [1], which
may be computed as a function of a subset of the aforementioned 13 predictors. Including the PCE
score as a predictor is justified by the “PATH risk modeling” framework (Subsection 21.1.2 and [47])
according to which absolute treatment effects are expected to be larger for larger values of the
PCE score. Such evidence was provided for the SPRINT trial by [94] who conducted subgroup
analyses with subgroups stratified by quartiles of PCE scores. The second predictor is one of the 13
original predictors, namely the binary indicator of subclinical CKD. A subgroup analysis by CKD
was pre-specified in the SPRINT RCT and considered, e.g., in [86, 87].
Beyond the reanalysis of SPRINT, we also apply some of our analyses to the ACCORD-BP
trial [11] that was conducted at 77 clinical cites across U.S. and Canada. ACCORD-BP also evaluated
the effectiveness of intensive BP control as in SPRINT, but one major difference is that all subjects
in ACCORD-BP are under type 2 diabetes mellitus. Moreover, ACCORD participants are slightly
younger (mean age = 62.2 years) than SPRINT subjects (mean age = 67.9 years) and are followed for a
longer time on average (mean follow-up time = 3.3 years for SPRINT vs. 4.7 years for ACCORD-BP).
Our study sample contains 4,711 patient-level data (99.5% of the original data).
Standard Intensive
Survival Forest Kaplan-Meier Survival Forest Kaplan-Meier
Estimator RMSE Estimator RMSE Estimator RMSE Estimator RMSE
SPRINT
SF* 0.0005 SF* 0.0005 SF* 0.0003 SF* 0.0003
M*L 0.0008 M*L 0.0007 M*L 0.0021 M*L 0.0017
RFL 0.0011 RFL 0.0015 RLL 0.0024 RFL 0.0022
RLL 0.0021 RLL 0.0112 RFL 0.0027 RLL 0.0022
XLL 0.0085 XFF 0.0120 SL* 0.0090 XLL 0.0088
XFF 0.0114 RFF 0.0123 XFL 0.0094 SL* 0.0090
RFF 0.0118 TF* 0.0139 XLL 0.0127 XFL 0.0116
TF* 0.0139 CSF 0.0145 RFF 0.0136 RFF 0.0137
M*F 0.0141 M*F 0.0145 TF* 0.0141 TF* 0.0141
CSF 0.0145 SL* 0.0161 XFF 0.0146 CSF 0.0151
SL* 0.0161 XFL 0.0179 M*F 0.0147 XFF 0.0152
XFL 0.0176 XLL 0.0205 CSF 0.0151 M*F 0.0152
TL* 0.0351 TL* 0.0351 TL* 0.0280 TL* 0.0280
ACCORD-BP
SF* 0.0010 SF* 0.0010 SF* 0.0013 SF* 0.0013
M*L 0.0103 M*L 0.0104 RLL 0.0026 RLL 0.0027
RFL 0.0106 RFL 0.0108 RFL 0.0034 RFL 0.0038
RLL 0.0121 RLL 0.0167 SL* 0.0047 SL* 0.0047
XLL 0.0132 XFF 0.0177 M*L 0.0062 M*L 0.0065
XFF 0.0178 XLL 0.0189 XLL 0.0144 XLL 0.0157
TF* 0.0204 TF* 0.0204 XFL 0.0283 XFL 0.0305
RFF 0.0212 RFF 0.0213 XFF 0.0329 XFF 0.0320
CSF 0.0220 CSF 0.0220 CSF 0.0336 CSF 0.0336
M*F 0.0253 XFL 0.0223 TL* 0.0338 TL* 0.0338
XFL 0.0265 M*F 0.0245 RFF 0.0347 RFF 0.0346
SL* 0.0291 SL* 0.0291 TF* 0.0367 TF* 0.0367
TL* 0.0296 TL* 0.0296 M*F 0.0438 M*F 0.0421
*The asterisk is part of the method’s label, please refer to TABLE 21.2 where we introduce the
abbreviations/labels for the methods we considered.
concordant with the “PATH risk modeling” recommendations [47]. In contrast, the learners that used
GRF for the CATE function have a lower RMSE when the main predictor covariate set is used. A
possible explanation is that the presence of multiple uninformative predictors drives different trees
of the forest to split on different variables; upon averaging across trees the estimated CATEs are
approximately zero (as they should be under the global null).
Figure 21.7 displays the relationship between the estimated CATEs in untreated SPRINT subjects
and the ten-year PCE scores for a subset of the methods that use survival forest IPCW. We see
Case Study on SPRINT and ACCORD-BP 473
Original Covariates
0.2
0.1
0.0
Estimated CATEs
−0.1
−0.2
No CKD
0.3 CKD
FIGURE 21.7
Scatter plot of ten-year PCE scores and estimated CATEs in SPRINT under a global null model.
The CATE models were derived using 70% of the untreated patients with an artificial randomized
treatment assignment then used to make predictions on the rest 30% test data. The censoring weights
are estimated using a survival forest model. The analysis was replicated with the original covariates
in SPRINT as the predictors (Row 1) and the estimated ten-year CVD risk (using pooled cohort
equation) and subclinical CKD as the predictors (Row 2).
that RFL estimate constants CATEs (that is, all CATEs are equal to the estimated ATE) that are
nearly zero. Under the global null, this behavior is desirable to avoid detection of spurious HTEs and
showcases the benefit of R-learners in enabling direct modeling of CATEs and imposing, e.g., sparsity
assumptions. Other approaches yielded non-zero CATEs with large variations. In particular, when the
PCE scores and CKD are used as predictors, T-learner with Lasso (TL*) shows an increasing trend
as a function of PCE scores for both subjects with or without prior subclinical CKD. This illustrates
the regularization bias of the T-learner, as pointed out by [8], and explains the large RMSE of TL*.
Figures S8–S14 are analogous to Figure 21.7 and present the other three settings (treated units in
SPRINT, untreated and treated units in ACCORD-BP) for both Kaplan-Meier IPCW and survival
forest IPCW.
rule with a large, positive AUTOC effectively distinguishes patients with greater treatment benefits
from those with lesser treatment benefits by assigning them a high versus low treatment priority, re-
spectively. The Expected Calibration Error for predictors of Treatment Heterogeneity (ECETH) [95]
is another novel metric for quantifying the `2 calibration error of a CATE estimator. The calibration
function of treatment effects is estimated using an AIPW (augmented inverse propensity weighted)
score, which makes the metric robust to overfitting and high-dimensionality issues.
Table 21.5 shows that none of the methods detect significant treatment heterogeneity in SPRINT
or ACCORD-BP. The insignificant results from the RATE evaluation suggest that there may be a
lack of evidence for treatment heterogeneity. In Figure 21.8, the CATEs estimated using RFL also
show an independent relationship with the PCE score when the original predictor set is used. But
when the PCE score used as a predictor, the CATE estimates from all methods showed an overall
increasing trend with the ten-year PCE score, consistent with e.g., the finding of [94]. Such a trend
is less obvious in the external validation results (Figure S15), and all the resulting AUTOCs are
nonpositive (under PCE + CKD), which suggest that the prioritization rules based on estimated
CATEs in ACCORD-BP lead to similar treatment benefits as average treatment effects.
TABLE 21.5
Internal and external validation performance of CATE estimation in SPRINT and ACCORD-BP,
respectively. None of the five metalearners showed significant AUTOCs, suggesting the lack of
treatment heterogeneity of intensive BP therapy. RECETH is the square root of the default calibration
error given by ECETH.
21.7 Discussion
Given the increasing interest in personalized medicine, a number of advanced statistical methods
have been developed for estimating CATEs, often referred to as personalized treatment effects.
We focused on characterizing the empirical performance of metalearners in an RCT setting. An
important direction for future work is to extend our tutorial and benchmark to the observational study
setting where confounding plays a crucial role. In (21.1) we considered treatment effects in terms of
difference in survival probabilities. CATEs can be also be defined on the relative scale such as hazard
ratios and as the difference in restricted mean survival times. The latter is a popular estimand as it
can be measured under any distribution of survival times and has a straightforward interpretation,
that is, the expected life expectancy between an index date and a particular time horizon [96]. Further
Discussion 475
Original Covariates
0.2
0.0
No CKD
0.3
CKD
0.1
0.0
FIGURE 21.8
Scatter plot of ten-year PCE score and estimated CATEs in SPRINT. This figure is analogous to
Figure 21.7, but with the full SPRINT cohort and using the true treatment assignment.
investigations on metalearners targeting such estimands may improve HTE estimation in clinical
settings.
The current work aims to provide guidance on when and how to apply each approach for time-
to-event outcomes in light of the specific characteristics of a dataset. We conducted comprehensive
simulation studies to compare five state-of-the-art metalearners coupled with two predictive modeling
approaches. We designed a spectrum of data generating processes to explore several important factors
of a data structure and summarized their impacts on CATE estimation, as well as the weakness and
strengths of each CATE estimator. Based on our findings from the simulation study, we highlighted
main takeaways as a list of requirements, recommendations, and considerations for modeling CATEs,
which provides practical guidance on how to identify appropriate CATE approaches for a given
setting, as well as a strategy for designing CATE analyses. Finally, we reanalyzed the SPRINT and
ACCORD-BP studies to demonstrate that some prior findings on heterogeneous effects of intensive
blood pressure therapy are likely to be spurious, and we present a case study to demonstrate a proper
way of estimating CATEs based on our current learning. To facilitate the implementation of our
recommendations for all the CATE estimation approaches that we investigated and to enable the
reproduction of our results, we created the package survlearners [9] as an off-the-shelf tool for
estimating heterogeneous treatment effects for survival outcomes.
Acknowledgments
This work was supported by R01 HL144555 from the National Heart, Lung, and Blood Institute
(NHLBI).
References
[1] David C. Goff, Donald M. Lloyd-Jones, Glen Bennett, Sean Coady, Ralph B. D’Agostino, Ray-
mond Gibbons, Philip Greenland, Daniel T. Lackland, Daniel Levy, Christopher J. O’Donnell,
Jennifer G. Robinson, J. Sanford Schwartz, Susan T. Shero, Sidney C. Smith, Paul Sorlie, Neil J.
476 Treatment Heterogeneity with Survival Outcomes
Stone, and Peter W. F. Wilson. 2013 ACC/AHA guideline on the assessment of cardiovascular
risk. Circulation, 129(25 suppl 2):S49–S73, 2014.
[2] Donna K Arnett, Roger S Blumenthal, Michelle A Albert, Andrew B Buroker, Zachary D
Goldberger, Ellen J Hahn, Cheryl Dennison Himmelfarb, Amit Khera, Donald Lloyd-Jones,
J William McEvoy, et al. 2019 acc/aha guideline on the primary prevention of cardiovascular
disease: a report of the american college of cardiology/american heart association task force on
clinical practice guidelines. Journal of the American College of Cardiology, 74(10):e177–e232,
2019.
[3] Handrean Soran, Jonathan D Schofield, and Paul N Durrington. Cholesterol, not just cardio-
vascular risk, is important in deciding who should receive statin treatment. European Heart
Journal, 36(43):2975–2983, 2015.
[4] George Thanassoulis, Michael J Pencina, and Allan D Sniderman. The benefit model for
prevention of cardiovascular disease: an opportunity to harmonize guidelines. JAMA Cardiology,
2(11):1175–1176, 2017.
[5] RL Kravitz, N Duan, and J Braslow. Evidence-based medicine, heterogeneity of treatment
effects, and the trouble with averages. The Milbank Quarterly, 82:661–687, 2004.
[6] Rui Wang, Stephen Lagakos, James Ware, David Hunter, and Jeffrey Drazen. Statistics in
medicine — reporting of subgroup analyses in clinical trials. The New England Journal of
Medicine, 357:2189–94, 12 2007.
[7] Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. Metalearners for estimating
heterogeneous treatment effects using machine learning. Proceedings of the national academy
of sciences, 116(10):4156–4165, 2019.
[8] Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects.
Biometrika, 108(2):299–319, 2021.
[9] Yizhe Xu, Nikolaos Ignatiadis, Erik Sverdrup, Scott Fleming, Stefan Wager, and Nigam Shah.
survlearners: Metalearners for Survival Data, 2022. R package version 0.0.1.
[10] Jackson Wright, Jeff Williamson, Paul Whelton, Joni Snyder, Kaycee Sink, Michael Rocco,
David Reboussin, Mahboob Rahman, Suzanne Oparil, Cora Lewis, Paul Kimmel, Karen
Johnson, Goff Jr, Lawrence Fine, Jeffrey Cutler, William Cushman, Alfred Cheung, Walter
Ambrosius, Mohammad Sabati, and Kasa Niesner. A randomized trial of intensive versus
standard blood-pressure control. New England Journal of Medicine, 373:2103–2116, 2015.
[11] William Cushman, Gregory Evans, Robert Byington, David Jr, Richard Jr, Jeffrey Cutler, Denise
Simons-Morton, Jan Basile, Marshall Corson, Jeffrey Probstfield, Lois Katz, Kevin Peterson,
William Friedewald, John Buse, Thomas Bigger, Hertzel Gerstein, Faramarz Ismail-Beigi, and
Elias Siraj. Effects of intensive blood-pressure control in type 2 diabetes mellitus. New England
Journal of Medicine, 362:1575–1585, 04 2010.
[12] Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects: Table
1. Proceedings of the National Academy of Sciences, 113:7353–7360, 07 2016.
[13] Alberto Caron, Gianluca Baio, and Ioanna Manolopoulou. Estimating individual treatment
effects using non-parametric regression models: A review. Journal of the Royal Statistical
Society: Series A (Statistics in Society), 2022.
Discussion 477
[14] Scott Powers, Junyang Qian, Kenneth Jung, Alejandro Schuler, Nigam Shah, Trevor Hastie,
and Robert Tibshirani. Some methods for heterogeneous treatment effect estimation in high-
dimensions. Statistics in Medicine, 37, 07 2017.
[15] T. Wendling, K. Jung, A. Callahan, A. Schuler, N. Shah, and Blanca Gallego. Comparing
methods for estimation of heterogeneous treatment effects using observational data from health
care databases. Statistics in Medicine, 37, 06 2018.
[16] Avi Feller, Jared Murray, Spencer Woody, and David Yeager. Assessing treatment effect
variation in observational studies: Results from a data challenge. Observational Studies,
5:21–35, 01 2019.
[17] Yuta Saito, Hayato Sakata, and Kazuhide Nakata. Doubly robust prediction and evaluation
methods improve uplift modeling for observational data. In Proceedings of the 2019 SIAM
International Conference on Data Mining, pages 468–476. SIAM, 2019.
[18] Alicia Curth and Mihaela van der Schaar. Nonparametric estimation of heterogeneous treat-
ment effects: From theory to learning algorithms. In International Conference on Artificial
Intelligence and Statistics, pages 1810–1818. PMLR, 2021.
[19] Alicia Curth, David Svensson, Jim Weatherall, and Mihaela van der Schaar. Really doing
great at estimating CATE? a critical look at ML benchmarking practices in treatment effect
estimation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and
Benchmarks Track, 2021.
[20] Daniel Jacob. CATE meets ML–The Conditional Average Treatment Effect and Machine
learning. arXiv preprint arXiv:2104.09935, 2021.
[21] Michael C Knaus, Michael Lechner, and Anthony Strittmatter. Machine learning estimation
of heterogeneous causal effects: Empirical monte carlo evidence. The Econometrics Journal,
24(1):134–161, 2021.
[22] Yaobin Ling, Pulakesh Upadhyaya, Luyao Chen, Xiaoqian Jiang, and Yejin Kim. Heterogeneous
treatment effect estimation using machine learning for healthcare application: tutorial and
benchmark. arXiv preprint arXiv:2109.12769, 2021.
[23] Andrea A. Naghi and Christian P. Wirths. Finite sample evaluation of causal machine learning
methods: Guidelines for the applied researcher. Tinbergen Institute Discussion Paper TI
2021-090/III, 2021.
[24] Weijia Zhang, Jiuyong Li, and Lin Liu. A unified survey of treatment effect heterogeneity
modelling and uplift modelling. ACM Computing Surveys (CSUR), 54(8):1–36, 2021.
[25] Gabriel Okasa. Meta-Learners for estimation of causal effects: Finite sample cross-fit perfor-
mance. arXiv preprint arXiv:2201.12692, 2022.
[26] Susan Athey, Julie Tibshirani, and Stefan Wager. Generalized random forests. The Annals of
Statistics, 47(2):1148–1178, 2019.
[27] P Richard Hahn, Jared S Murray, and Carlos M Carvalho. Bayesian regression tree models for
causal inference: Regularization, confounding, and heterogeneous effects (with discussion).
Bayesian Analysis, 15(3):965–1056, 2020.
[28] JL Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and
Graphical Statistics, 20(1):217–240, 2011.
478 Treatment Heterogeneity with Survival Outcomes
[29] Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. arXiv preprint
arXiv:1901.09036, 2019.
[30] Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects
using random forests. Journal of the American Statistical Association, 113(523):1228–1242,
2018.
[31] Jiabei Yang and Jon Steingrimsson. Causal interaction trees: Finding subgroups with heteroge-
neous treatment effects in observational data. Biometrics, 2021.
[32] Zijun Gao and Yanjun Han. Minimax optimal nonparametric estimation of heterogeneous
treatment effects. Advances in Neural Information Processing Systems, 33:21751–21762, 2020.
[33] Edward H Kennedy. Optimal doubly robust estimation of heterogeneous causal effects. arXiv
preprint arXiv:2004.14497, 2020.
[34] Edward H Kennedy, Sivaraman Balakrishnan, and Larry Wasserman. Minimax rates for
heterogeneous causal effect estimation. arXiv preprint arXiv:2203.00837, 2022.
[35] Szymon Jaroszewicz and Piotr Rzepakowski. Uplift modeling with survival data. In ACM
SIGKDD Workshop on Health Informatics (HI-KDD–14), New York City, 2014.
[36] Weijia Zhang, Thuc Duy Le, Lin Liu, Zhi-Hua Zhou, and Jiuyong Li. Mining heterogeneous
causal effects for personalized cancer treatment. Bioinformatics, 33(15):2372–2378, 2017.
[37] Nicholas C Henderson, Thomas A Louis, Gary L Rosner, and Ravi Varadhan. Individualized
treatment effects with censored data via fully nonparametric bayesian accelerated failure time
models. Biostatistics, 21(1):50–68, 2020.
[38] Sami Tabib and Denis Larocque. Non-parametric individual treatment effect estimation for
survival data with random forests. Bioinformatics, 36(2):629–636, 2020.
[39] Jie Zhu and Blanca Gallego. Targeted estimation of heterogeneous treatment effect in observa-
tional survival analysis. Journal of Biomedical Informatics, 107:103474, 2020.
[40] Yifan Cui, Michael R Kosorok, Erik Sverdrup, Stefan Wager, and Ruoqing Zhu. Estimating
heterogeneous treatment effects with right-censored data via causal survival forests. arXiv
preprint arXiv:2001.09887, 2022.
[41] Paidamoyo Chapfuwa, Serge Assaad, Shuxi Zeng, Michael J Pencina, Lawrence Carin, and
Ricardo Henao. Enabling counterfactual survival analysis with balanced representations. In
Proceedings of the Conference on Health, Inference, and Learning, pages 133–145, 2021.
[42] Alicia Curth, Changhee Lee, and Mihaela van der Schaar. SurvITE: Learning heterogeneous
treatment effects from time-to-event data. Advances in Neural Information Processing Systems,
34, 2021.
[43] Liangyuan Hu, Jung-Yi Lin, Keith Sigel, and Minal Kale. Estimating heterogeneous survival
treatment effects of lung cancer screening approaches: A causal machine learning analysis.
Annals of Epidemiology, 62:36–42, 2021.
[44] Liangyuan Hu, Jiayi Ji, and Fan Li. Estimating heterogeneous survival treatment effect in
observational data using machine learning. Statistics in Medicine, 40(21):4691–4713, 2021.
[45] Tony Duan, Pranav Rajpurkar, Dillon Laird, Andrew Y Ng, and Sanjay Basu. Clinical value
of predicting individual treatment effects for intensive blood pressure therapy: a machine
learning experiment to estimate treatment effects from randomized trial data. Circulation:
Cardiovascular Quality and Outcomes, 12(3):e005010, 2019.
Discussion 479
[46] Jeroen Hoogland, Joanna IntHout, Michail Belias, Maroeska M Rovers, Richard D Riley, Frank
E. Harrell Jr, Karel GM Moons, Thomas PA Debray, and Johannes B Reitsma. A tutorial
on individualized treatment effect prediction from randomized trials with a binary endpoint.
Statistics in Medicine, 40(26):5961–5981, 2021.
[47] David M. Kent, Jessica K. Paulus, David van Klaveren, Ralph B. D’Agostino, Steve Goodman,
Rodney A. Hayward, John P. A. Ioannidis, Bray Patrick-Lake, Sally C. Morton, Michael J.
Pencina, Gowri Raman, Joseph Ross, Harry P. Selker, Ravi Varadhan, Andrew Julian Vickers,
John B. Wong, and Ewout Willem Steyerberg. The predictive approaches to treatment effect
heterogeneity (PATH) statement. Annals of Internal Medicine, 172(1):35–45, 2020.
[48] J. Neyman. Sur les applications de la th’eorie des probabilit’es aux experiences agricoles: Essai
des principes. Roczniki Nauk Rolniczych, 10, 01 1923.
[49] Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized
studies. Journal of Educational Psychology, 66(5):688, 1974.
[50] Paul Rosenbaum and Donald Rubin. The central role of the propensity score in observational
studies for causal effects. Biometrika, 70:41–55, 04 1983.
[51] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, Second Edition. Springer Series in Statistics. Springer New York,
2009.
[52] Hans van Houwelingen and Hein Putter. Dynamic prediction in clinical survival analysis. CRC
Press, 2011.
[53] Frank E Harrell. Regression modeling strategies: with applications to linear models, logistic
and ordinal regression, and survival analysis, volume 3. Springer, 2015.
[54] Gary S Collins, Johannes B Reitsma, Douglas G Altman, and Karel GM Moons. Transparent
reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD):
the TRIPOD statement. Journal of British Surgery, 102(3):148–158, 2015.
[55] Hemant Ishwaran, Udaya B Kogalur, Eugene H Blackstone, and Michael S Lauer. Random
survival forests. The Annals of Applied Statistics, 2(3):841–860, 2008.
[56] L Breiman. Random forests. Machine Learning, pages 45: 5–32, 2001.
[57] Yi Lin and Yongho Jeon. Random forests and adaptive nearest neighbors. Journal of the
American Statistical Association, 101(474):578–590, 2006.
[58] Julie Tibshirani, Susan Athey, Erik Sverdrup, and Stefan Wager. grf: Generalized Random
Forests, 2022. R package version 2.1.0.
[59] Robert Tibshirani. The lasso method for variable selection in the Cox model. Statistics in
Medicine, 16(4):385–395, 1997.
[60] Jelle J Goeman. l1 penalized estimation in the Cox proportional hazards model. Biometrical
Journal, 52(1):70–84, 2010.
[61] David R Cox. Regression models and life-tables. Journal of the Royal Statistical Society:
Series B (Methodological), 34(2):187–202, 1972.
[62] Hans C Van Houwelingen, Tako Bruinsma, Augustinus AM Hart, Laura J Van’t Veer, and
Lodewyk FA Wessels. Cross-validated Cox regression on microarray gene expression data.
Statistics in Medicine, 25(18):3201–3216, 2006.
480 Treatment Heterogeneity with Survival Outcomes
[63] Noah Simon, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for
Cox’s proportional hazards model via coordinate descent. Journal of Statistical Software,
39(5):1, 2011.
[64] Michael Kohler, Kinga Máthé, and Márta Pintér. Prediction from randomly right censored data.
Journal of Multivariate Analysis, 80(1):73–100, 2002.
[65] Mark J Van der Laan and James M Robins. Unified methods for censored longitudinal data
and causality, volume 5. Springer, 2003.
[66] Humza Haider, Bret Hoehn, Sarah Davis, and Russell Greiner. Effective ways to build and
evaluate individual survival distributions. Journal of Machine Learning Research, 21(85):1–63,
2020.
[80] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
pages 785–794, 2016.
[81] Aaron Baum, Joseph Scarpa, Emilie Bruzelius, Ronald Tamler, Sanjay Basu, and James Fagh-
mous. Targeting weight loss interventions to reduce cardiovascular complications of type 2
diabetes: a machine learning-based post-hoc analysis of heterogeneous treatment effects in the
look ahead trial. The Lancet Diabetes and Endocrinology, 5(10):808–815, 2017.
[82] Ann Lazar, Bernard Cole, Marco Bonetti, and Richard Gelber. Evaluation of treatment-effect
heterogeneity using biomarkers measured on a continuous scale: Subpopulation treatment effect
pattern plot. Journal of Clinical Oncology: Official Journal of the American Society of Clinical
Oncology, 28:4539–44, 2010.
[83] Adam Bress, Tom Greene, Catherine Derington, Jincheng Shen, Yizhe Xu, Yiyi Zhang, Jian
Ying, Brandon Bellows, William Cushman, Paul Whelton, Nicholas Pajewski, David Reboussin,
Srinivasan Beddu, Rachel Hess, Jennifer Herrick, Zugui Zhang, Paul Kolm, Robert Yeh, Sanjay
Basu, and Andrew Moran. Patient selection for intensive blood pressure management based on
benefit and adverse events. Journal of the American College of Cardiology, 77:1977–1990, 04
2021.
[84] David Kent, Jason Nelson, Peter Rothwell, Douglas Altman, and Rodney Hayward. Risk and
treatment effect heterogeneity: Re-analysis of individual participant data from 32 large clinical
trials. International Journal of Epidemiology, 45:dyw118, 07 2016.
[85] Kosuke Imai and Marc Ratkovic. Estimating treatment effect heterogeneity in randomized
program evaluation. The Annals of Applied Statistics, 7(1):443–470, 2013.
[86] Alfred K. Cheung, Mahboob Rahman, David M. Reboussin, Timothy E. Craven, Tom Greene,
Paul L. Kimmel, William C. Cushman, Amret T. Hawfield, Karen C. Johnson, Cora E. Lewis,
Suzanne Oparil, Michael V. Rocco, Kaycee M. Sink, Paul K. Whelton, Jackson T. Wright, Jan
Basile, Srinivasan Beddhu, Udayan Bhatt, Tara I. Chang, Glenn M. Chertow, Michel Chonchol,
Barry I. Freedman, William Haley, Joachim H. Ix, Lois A. Katz, Anthony A. Killeen, Vasilios
Papademetriou, Ana C. Ricardo, Karen Servilla, Barry Wall, Dawn Wolfgram, and Jerry Yee.
Effects of intensive BP control in CKD. Journal of the American Society of Nephrology,
28(9):2812–2823, 2017.
[87] Ara H Rostomian, Maxine C Tang, Jonathan Soverow, and Daniel R Sanchez. Heterogeneity of
treatment effect in sprint by age and baseline comorbidities: The greatest impact of intensive
blood pressure treatment is observed among younger patients without CKD or CVD and in
older patients with CKD or CVD. The Journal of Clinical Hypertension, 22(9):1723–1726,
2020.
[88] Joseph Scarpa, Emilie Bruzelius, Patrick Doupe, Matthew Le, James Faghmous, and Aaron
Baum. Assessment of risk of harm associated with intensive blood pressure management among
patients with hypertension who smoke: a secondary analysis of the systolic blood pressure
intervention trial. JAMA network open, 2(3):e190005–e190005, 2019.
[89] Yaqian Wu, Jianling Bai, Mingzhi Zhang, Fang Shao, Honggang Yi, Dongfang You, and Yang
Zhao. Heterogeneity of treatment effects for intensive blood pressure therapy by individual
components of FRS: An unsupervised data-driven subgroup analysis in SPRINT and ACCORD.
Frontiers in cardiovascular medicine, 9, 2022.
482 Treatment Heterogeneity with Survival Outcomes
[90] Krishna K Patel, Suzanne V Arnold, Paul S Chan, Yuanyuan Tang, Yashashwi Pokharel, Philip G
Jones, and John A Spertus. Personalizing the intensity of blood pressure control: modeling the
heterogeneity of risks and benefits from SPRINT (Systolic Blood Pressure Intervention Trial).
Circulation: Cardiovascular Quality and Outcomes, 10(4):e003624, 2017.
[91] Sanjay Basu, Jeremy B Sussman, Joseph Rigdon, Lauren Steimle, Brian T Denton, and Rod-
ney A Hayward. Benefit and harm of intensive blood pressure treatment: derivation and
validation of risk models using data from the SPRINT and ACCORD trials. PLoS Medicine,
14(10):e1002410, 2017.
[92] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of
the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
[93] Steve Yadlowsky, Scott Fleming, Nigam Shah, Emma Brunskill, and Stefan Wager. Evaluating
treatment prioritization rules via rank-weighted average treatment effects. arXiv preprint
arXiv:2111.07966, 2021.
[94] Robert A Phillips, Jiaqiong Xu, Leif E Peterson, Ryan M Arnold, Joseph A Diamond, and
Adam E Schussheim. Impact of cardiovascular risk on the relative benefit and harm of intensive
treatment of hypertension. Journal of the American College of Cardiology, 71(15):1601–1610,
2018.
[95] Yizhe Xu and Steve Yadlowsky. Calibration error for heterogeneous treatment effects. In
International Conference on Artificial Intelligence and Statistics, pages 9280–9303. PMLR,
2022.
[96] Lu Tian, Hua Jin, Hajime Uno, Ying Lu, Bo Huang, Keaven Anderson, and Leejen Wei. On the
empirical choice of the time window for restricted mean survival time. Biometrics, 76, 02 2020.
22
Why Machine Learning Cannot Ignore Maximum
Likelihood Estimation
CONTENTS
22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
22.2 Nonparametric MLE and Sieve MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
22.2.1 Nonparametric MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
22.2.2 Sieve MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
22.2.3 Score equations solved by sieve MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
22.3 Special Sieve MLE: HAL-MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
22.3.1 Definition of HAL-MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
22.3.2 Score equations solved by HAL-MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
22.4 Statistical Properties of the HAL-MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
22.4.1 Rate of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
22.4.2 Asymptotic efficiency for smooth target features of target function . . . . . . . . 488
22.4.3 Global asymptotic efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
22.4.4 Nonparametric bootstrap for inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
22.4.5 Higher-order optimal for smooth target features . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
22.5 Contrasting HAL-MLE with Other Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
22.5.1 HAL-MLE vs. other sieve MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
22.5.2 HAL-MLE vs. general machine learning algorithms . . . . . . . . . . . . . . . . . . . . . . . 491
22.6 Considerations for Implementing HAL-MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
22.6.1 HAL-MLE provides interpretable machine learning . . . . . . . . . . . . . . . . . . . . . . . . 491
22.6.2 Estimating the target function itself . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
22.6.3 Using the HAL-MLE for smooth feature inference . . . . . . . . . . . . . . . . . . . . . . . . 493
22.7 HAL and TMLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
22.7.1 Targeting a class of target features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
22.7.2 Score equation preserving TMLE update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
22.8 Considerations for Implementing HAL-TMLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
22.8.1 Double robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
22.8.2 HAL for treatment and censoring mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
22.8.3 Collaborative TMLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
22.9 Closing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
22.1 Introduction
The statistics literature proposed the familiar maximum likelihood estimators (MLEs) for √ parametric
models and established that these estimators are asymptotically linear such that their n-scaled
and mean-zero-centered version is asymptotically normal [1]. This allowed the construction of
confidence intervals and formal hypothesis testing. However, due to the restrictive form of these
models, these parametric model-based MLEs target a projection of the true density on the parametric
model that is hard to interpret. In response to this concern with misspecified parametric models, a rich
statistical literature has studied so-called sieve-based or sieve MLEs involving specifying a sequence
of parametric models that grow toward the desired true statistical model [2, 3]. Such a sieve will rely
on a tuning parameter that will then need to be selected with some method such as cross-validation.
Simultaneously, and to a large degree independently from the statistics literature, the computer
science literature developed a rich set of tools for constructing data-adaptive estimators of functional
parameters, such as a density of the data, although this literature mostly focused on learning prediction
functions. This has resulted in a wealth of machine learning algorithms using a variety of strategies
to learn the target function from the data. In addition, a more recent literature in statistics established
super learning based on cross-validation as a general approach to optimally combine a library of such
candidate machine learning algorithms into an ensemble that performs asymptotically as well as an
oracle ensemble under specified conditions [4, 5]. Although the latter framework is optimal for fast
learning of the target function, it lacks formal statistical inference for smooth features of the target
function and for the target function itself.
In this chapter, we will argue that, in order to preserve statistical inference, we should
preference sieve MLEs as much as possible. If the target function is not a data density, one can
use, more generally, minimum loss estimation with an appropriate loss function. For example, if
the goal is to learn the conditional mean of an outcome, one could use the squared-error loss and
use minimum least squares estimation. Therefore, in this chapter, the abbreviation MLE also
represents the more general minimum loss estimator.
We will demonstrate the power of sieve MLEs with a particular sieve MLE that also relies
on the least absolute shrinkage and selection operator (LASSO) [38] and is termed the highly
adaptive LASSO minimum loss estimator (HAL-MLE) [7, 8]. We will argue that the HAL-MLE is
a particularly powerful sieve MLE, generally theoretically superior to other types of sieve MLEs.
Moreover, for obtaining statistical inference for estimands that aim to equal or approximate a
causal quantity defined in an underlying causal model, we discuss the combination of HAL-MLE
with targeted maximum likelihood estimators (TMLEs) [9–11] in HAL-TMLEs as well as using
undersmoothed HAL-MLE by itself as a powerful statistical approach [8, 10, 12, 13]. HALs can be
Nonparametric MLE and Sieve MLE 485
used for the initial estimator in TMLEs as well as for nuisance functions representing the treatment
and missingness/censoring mechanism. We will also clarify why many machine learning algorithms
are not suited for statistical inference by having deviated from sieve MLEs. This chapter is a compact
summary of recent and ongoing research; further background and references can be found in the
literature cited. Sections 22.2–22.4 and 22.7 are more technical in nature, introducing core definitions
and properties. Some readers may be interested in jumping to the material in Sections 22.5, 22.6, and
22.8 for technical but narrative discussions on contrasting HAL-MLE with other estimators as well
as implementation of HAL-MLE and HAL-TMLE.
However, for realistic statistical models M, the parameter space is generally infinite dimensional
and too flexible such that the minimizer of the empirical risk will overfit to the data and results
in an ill-defined Q̂(Pn ) or a statistically poor estimator.
A key property of a sieve MLE is that it solves score equations. That is, one can construct a
family of one dimensional paths {Qhn,λ,δ : δ} ⊂ Q through Qn,λ at δ = δ0 ≡ 0 indexed by a
direction h ∈ H, and we have
d
0= Pn L(Qhn,λ,δ0 ) for all paths h ∈ H.
dδ0
We can define a path-specific score: AQn,λ (h) ≡ dδd0 L(Qn,λ,δ0 ), for all h. This mapping AQn,λ (h)
will generally be a linear operator in h and can therefore be extended to a linear operator on a Hilbert
space generated or spanned by these directions H. One will then also have that Pn AQn,λ (h) = 0 for
any h in the linear span of H, thereby obtaining that Qn,λ solves a space of scores. For scores SQn,λ
that can be approximated by scores AQn,λ (h) for certain h, this might provide the basis for showing
that Pn SQn,λ = oP (n−1/2 ) is solved up to the desired approximation. In particular, one can apply
this class of score equations at λ = λn,cv to obtain the score equations solved by the cross-validated
sieve MLE Qn,λn,cv .
This representation shows that any cadlag function can be represented as a linear combination of
indicator basis functions x → Ius ≤xs indexed by knot point us , across all subsets s of {1, . . . , d}
with coefficients given by Q(dus , 0−s ). In particular, if Q is such that all its sections are like discrete
cumulative distribution functions, then this representation of Q is literally a finite linear combination
Q these indicator basis functions. We note that each basis function Ius ≤xs is a tensor product
of
j∈s I(uj ≤ xj ) of univariate indictor basis functions I(uj ≤ xj ), functions that jump from 0 to 1
Special Sieve MLE: HAL-MLE 487
at knot point uj . Moreover, the L1 -norm of these coefficients in this representation is the so-called
sectional variation norm:
X
kQk∗v =| Q(0) | + | Q(dus , 0−s ) | .
s⊂{1,...,d}
and resulting Qn,λ = Qβn,λ . As above, λ is then selected with the cross-validation selector λn,cv .
This can generally be implemented with LASSO software implementations such as glmnet in
R [19]. Additionally, HAL9001 provides implementations for linear, logistic, Cox, and Poisson
regression, which also provides HAL estimators of the conditional hazards and intensities [20].
It can be verified that now kβδh k1 = kβk1 for δ in a neighborhood of 0. Therefore, we know that the
HAL-MLE solves the score equation for each of these paths:
This shows that the HAL-MLE solves a class of score equations. Moreover, this result can be used to
prove that
d
Pn L(Qβn,λ ) = OP (n−2/3 ),
dβn,λ (j)
for all j for which βn,λ (j) 6= 0. That is, the L1 -norm constrained HAL-MLE also solves the
unconstrained scores solved by the MLE over the finite linear model {Qβ : β(j) = 0 if βn,λ (j) =
0} implied by the fit βn,λ . By selecting the L1 -norm λ to be larger, this set of score equations
488 Why Machine Learning Cannot Ignore Maximum Likelihood Estimation
approximates any score, thereby establishing that the HAL-MLE, if slightly undersmoothed, will
solve score equations up to OP (n−2/3 ) uniformly over all scores. For further details we refer prior
work [13].
This capability of the HAL-MLE to solve all score equations, even uniformly over a class that
will contain any desired efficient influence curve of a target parameter, provides the fundamental
basis for establishing its remarkably strong asymptotic statistical performance in estimation of
smooth features of Q0 as well as of Q0 itself.
we have weak convergence of n1/2 (Qn,h − Q0,h ) to a Gaussian process. That is, the undersmoothed
HAL-MLE is an efficient estimator of the kernel smoothed functional of Q0 , for any h.
This may make one wonder if Qn (x) itself is not asymptotically normally distributed
as well? While not a currently solved problem, we conjecture that, indeed, (Qn (x) −
Q0 (x))/σn (x) converges weakly to a normal distribution, where the rate of convergence
may be as good as n−1/3 (log n)d/2 . If this results holds, then the HAL-MLE also allows
formal statistical inference for the function itself!
This demonstrates that solving score equations has enormous implications for first- and higher-
order behavior of a plug-in estimator. Typical machine learning algorithms are generally not
490 Why Machine Learning Cannot Ignore Maximum Likelihood Estimation
tailored to solve score equations, and, thereby, will not be able to achieve such statistical
performance for their plug-in estimator. In fact, most machine learning algorithms are not
grounded in any asymptotic limit distribution theory.
The simple answer is that these sieve MLEs generally do not have the (essentially) dimension-
free/smoothness-free rate of convergence n−1/3 (log n)d/2 , but instead their rates of convergence
heavily rely on assumed smoothness. HAL-MLE uses a global complexity property, the sectional
variation norm, rather than relying on local smoothness. The global bound on the sectional
variation norm provides a class of cadlag functions F that has a remarkably nice covering
number log N (, F, L2 ) . 1/(log(1/))2d−1 , hardly depending on the dimension d. Due to
this covering number, the HAL-MLE has this powerful rate of convergence – only assuming that
the true target function is a cadlag function with finite sectional variation norm.
A related advantage of the special HAL sieve is that the union of all indicator basis functions is a
small Donsker class, even though it is able to span any cadlag function. Most sets of basis functions
include “high frequency” type basis functions that have a variation norm approximating infinity. As
a consequence, these basis functions do not form a nice Donsker class. In particular, this implies
that the HAL-MLE does not overfit, as long as the sectional variation norm is controlled. The fact
that HAL-MLE itself is situated in this Donsker class also means that the efficient influence curves
at HAL-MLE fits will fall in a similar size Donsker class. As a consequence, the Donsker class
condition for asymptotic efficiency of plug-in MLE and TMLE is naturally satisfied when using
the HAL-MLE, while other sieve-based estimators easily cause a violation of the Donsker class
condition.
This same powerful property of the Donsker class spanned by these indicator basis functions
also allows one to prove that nonparametric bootstrap works for the HAL-MLE, while, generally
speaking, the nonparametric bootstrap generally fails to be consistent for machine learning
algorithms.
Another appealing feature of the HAL sieve is that it is only indexed by the L1 -norm, while
many sieve MLEs rely on a precise specification of the sequence of parametric models that grow in
dimension. It should be expected that the choice of this sequence can have a real impact in practice.
HAL-MLE does not rely on an ordering of basis functions, but rather it just relies on a complexity
measure. For each value of the complexity measure, it includes all basis functions and represents
an infinite dimensional class of functions rich enough to approximate any function with complexity
Considerations for Implementing HAL-MLE 491
satisfying this bound. A typical sieve MLE can only approximate the true target function, while the
HAL-MLE includes the true target function when the sectional variation norm bound exceeds the
sectional variation norm of the true target function.
As mentioned, the HAL-MLE solves the regular score equations from the data-adaptively selected
HAL-model at rate OP (n−2/3 ). As a consequence, the HAL-MLE is able to uniformly solve the
class of all score equations – only restricted by some sectional variation norm bound, where this
bound can go to infinity as sample size increases. This strong capability for solving score equations
appears to be unique for HAL-MLE relative to other sieve-based estimators. It may be mostly due to
actually being an MLE over an infinite function class (for a particular variation norm bound). We
also note that a parametric model-based sieve MLE would be forced to select a small dimension
to avoid overfitting. However, the HAL-MLE adaptively selects such a model among all possible
basis functions, and the dimension of this data-adaptively selected model will generally be larger.
The latter is due to the HAL-MLE only being an L1 -regularized MLE for that adaptively selected
parametric model, and thereby does not overfit the score equations, while the typical sieve MLE
would represent a full MLE for the selected model.
By solving many more score equations approximately HAL-MLE can span a much larger space
of scores than a sieve MLE that solves many fewer score equations perfectly.
Many machine learning algorithms fail to be an MLE over any subspace of the parameter space.
Such algorithms will have poor performance in solving score equations. As a consequence, they
will not result in asymptotically linear plug-in estimators and will generally be overly biased
and nonrobust.
have a certain smoothness. These higher-order spline HAL-MLE have the same statistical properties
as reviewed above, as long as the true function satisfies the enforced level of smoothness. Such
smooth HAL-MLE can be expected to result in even sparser fits, e.g., a smooth monotone function
can be fitted with a few first-order splines (piecewise linear) while it needs relatively many knot
points when fitting with a piecewise constant function. Either way, the HAL-MLE is a closed form
object that can be evaluated and is thus interpretable. Therefore, the HAL-MLE has the potential to
play a key role in interpretable machine learning.
HAL-MLE requires selecting the set of knot points and, thereby, the collection of spline basis
functions. The largest set of knot points we have recommended (and suffices for obtaining the
nonparametric rate of convergence n−1/3 (log n)d/2 ) is given by: {Xs,i : i = 1, . . . , n, s ⊂
{1, . . . , d}}, which corresponds (for continuous covariates) to N = n(2d − 1) basis functions.
The LASSO-based fit will then select a relatively small subset (<< n) of this user-supplied
collection.
Rather than selecting this full set of basis functions, one can incorporate model assumptions. This
could include only selecting up to two-way tensor products, ranking the basis functions by their
sparsity (i.e., proportion of 1’s among the n observations) and selecting the top k, and specifying
a specific additive model using standard glm-formula notation, such as f (x1 , x2 , x3 ) = f1 (x1 ) +
f2 (x2 , x3 ) + f3 (x3 ), and selecting the knot points accordingly. In particular, one could use some
glmnet fit using main terms and standard interactions to decide on this type of additive model. In
addition, one can specify that the coefficients of a certain set of basis functions should be nonnegative
and others should be non-positive, thereby enforcing monotonicity of functions in the additive model.
Finally, one can select among zero order and more generally k-th order spline basis functions, thereby
specifying a desired smoothness of the HAL-MLE.
To reduce the computational burden of the implementation of HAL-MLE, one can randomly
subset from a large set of basis functions or subset in a deterministic manner.
For example, for a continuous covariate Xj , rather than using as knot points Xj,i , i = 1, . . . , n,
one selects only five knot points – the five quantiles of the n observations Xj,i . In this manner,
the number of basis functions will not grow linearly in n, it is now growing by a fixed factor 5.
Similarly, the above restriction on the degree of the tensor products to only 2 reduces the 2d -factor to
d2 . Therefore, such choices reduces the total number of basis functions from n(2d − 1) to around
5 ∗ d2 basis functions.
In certain cases, one might view some of these restrictions as a specification of the actual
statistical model. For example, the statistical model might assume that Q0 is an additive model
including any one-way, two-way, and three-way function. However, in general, many of these model
choices for the HAL-MLE will be hard to defend based on prior knowledge, although they might
result in a statistically improved estimator relative to using the most nonparametric HAL-MLE.
Therefore, we recommend building a super learner whose library contains a variety of such HAL-
MLE specifications. In addition, by ranking these HAL-MLE fits by their complexity, one could
implement a cross-validation scheme that implements and evaluates estimators from least complex to
increasingly complex and stops when the cross-validated risk of the next HAL-MLE drops below the
performance of the less complex previous HAL-MLE. In this manner, the resulting super learner
avoids having to implement the highly computer intensive HAL-MLEs, except when they are really
needed. Because the discrete super learner is asymptotically equivalent with the oracle selector, the
HAL and TMLE 493
resulting discrete super learner will perform as well as the best possible HAL-MLE, thereby also
inheriting the rate of convergence of the most nonparametric HAL-MLE.
Let Q0n be a HAL-MLE, possibly a discrete super learner based on a library of different HAL-
∗
MLE specifications. Let (Ψt (P0 ) : t) be our class of target features and let (Dt,Q,G : t) be the
corresponding class of canonical gradients. We can then construct a universal least favorable
path {Q0n, : } through Q0n at = 0, possibly using an initial estimator Gn , and resulting MLE:
∗ −1/2
n = arg min Pn L(Q0n, ), so that supt | Pn Dt,Q 0 ,Gn |= oP (n ). Such a TMLE update
n,n
Because d0 (Q∗n , Q0 ) = OP (n−2/3 (log n)d ), and assuming a HAL-MLE Gn also has
d01 (Gn , G0 ) = OP (n−2/3 (log n)d ), under a strong positivity assumption, we will have:
In addition, because HAL-MLEs fall in the well understood Donsker class of cadlag functions with a
∗
uniform bound on their sectional variation norm [21], it generally also follows that {Dt,Q,G :Q∈
Q, G ∈ Q} is a Donsker class with the same covering number rate as this cadlag function class.
Therefore, we will also have:
∗ ∗
(Pn − P0 )Dt,Q∗ ,G
n n
= (Pn − P0 )Dt,Q0 ,G0
+ OP (n−2/3 (log n)d ),
where the rate of the remainder would follow from using the empirical process finite sample asymp-
totic equicontinuity results [21]. As a consequence, we have
Therefore, in great generality, we have that TMLEs that use HAL as an initial estimator are
asymptotically efficient for the target features targeted by the TMLE assuming only that the
nuisance functions are cadlag, have finite sectional variation norm, and a strong positivity
assumption.
It is not required that the HAL-MLE is undersmoothed for TMLE. The solving of the score equations
is carried out by the TMLE so that the HAL-MLE can be optimized for estimating Q0 and G0 . In
particular, one can now use the discrete super learner discussed above with a library of HAL-MLEs.
This motivates us to generalize the TMLE update step to a TMLE update that is not only solving
the target score equations but also preserves the score equations already solved by the initial
estimator.
In particular, this is a motivation for using universal least favorable paths in the definition of
the TMLE update, because such paths are only maximizing the likelihood in the direction of the
target score equations, thereby not affecting any score equation orthogonal to these target score
equations. However, in addition, one might do the following. We already specified the score equations
exactly solved by the HAL-MLE above, one score for each coefficient that has a nonzero coefficient
Considerations for Implementing HAL-TMLE 495
corresponding with a path that keeps the L1 -norm constant. Given this specified set of score equations
and its linear span Hn of scores, we could compute the projection of the first-order canonical gradient
∗ ∗
Dt,Q,G onto the space Hn spanned by these scores, and subtract it from Dt,Q,G , resulting in an
∗ ∗ ∗
orthogonalized D̃t,Q,G = Dt,Q,G − Π(Dt,Q,G | Hn ).
One now defines the TMLE using the universal least favorable path based on this orthogonalized
∗ ∗
set of scores (D̃t,Q,G : t) rather than (Dt,Q,G : t). In this case the TMLE update Q∗n will still solve
the score equations in Hn that were solved by the initial HAL-MLE Q0n , and, in addition, it will
∗ ∗
solve the score equations Pn D̃t,Q ∗ ,G
n n
= 0. As a consequence, it will also solve Pn Dt,Q ∗ ,G
n n
= 0.
So in this way, we have used the general TMLE to obtain a TMLE that not only targets a new set of
score equations but also preserves the score equations already solved by the initial estimator Q0n .
In future work, we will study, implement, and evaluate this score equation preserving TMLE,
and other variations of such score equation preserving TMLEs. The key message is that
we will have further robustified the TMLE by not only solving the target score equations
and preserving the rate of convergence of the initial estimator, but also preserving the
score equations solved by the initial estimator with all its important additional statistical
benefits.
nonparametric, then the TMLE will even be efficient, despite the inconsistency of Qn . Similarly, the
inverse probability of treatment and censoring weighted estimator using an undersmoothed HAL-
MLE is asymptotically linear in these causal inference problems, and its efficiency is maximized
by using an undersmoothed nonparametric HAL-MLE. On the other hand, in such a setting, if one
would use other machine learning algorithms, including a super learner, these same estimators would
lose their asymptotic linearity and not even converge at the desired n−1/2 -rate – failing to provide
valid inference.
Therefore, we learn that it is not only beneficial to use HAL-MLE as initial estimator of Q0 due
to above mentioned reasons, but it is also highly beneficial to use a HAL-MLE for Gn .
Let’s consider the case where it is known that the treatment mechanism only depends on 2
confounders. By estimating G0 with a model-based HAL-MLE, perhaps undersmoothed, the above
arguments show important gains. However, the above arguments also state that ignoring the model
for G0 will even make it more efficient. Therefore, there is an important selection problem among
candidate HAL-MLEs of G0 that have varied complexity, ranging from the actual model to a fully
nonparametric HAL-MLE. Cross-validation would then select the model-based HAL-MLE with high
probability and thus be ignorant of the subtle bias-variance trade-offs at stake.
More generally, selecting among candidate estimators of G0 based on the log-likelihood loss can
be problematic when the positivity assumption is practically violated. For example, this could result
in an estimator Gn that approaches zero, and, as a consequence, results in erratic TMLE updates.
This is typically resolved by truncation, but one still needs to select the truncation level. Similarly,
the adjustment set used by Gn might need to be tailored toward the MSE of the TMLE of the target
estimand using Gn , rather than toward the estimation of G0 . For example, some baseline covariates
might be highly predictive of treatment while not being real confounders (like instrumental variables),
and it is well established that adjustment for such variables can easily increase both variance and
bias [28–31,42]. Therefore, an important feature of causal inference problems is the targeted selection
among candidate estimators of G0 .
C-TMLE procedure to select the L1 -norm in the HAL-MLE of Q0 and the resulting HAL-MLE of
G0 (as well as the truncation level). The latter type of estimator was termed the outcome adaptive
HAL-TMLE [12], which built on prior work in outcome-adaptive LASSOs [30], providing a powerful
tool for variable selection in Gn , and was shown to have strong practical performance.
22.9 Closing
There is a tendency for the machine learning literature to focus on piecemeal and small extensions
of the “flashy” estimator of the moment. This was recently random forests, but is currently
deep learning. However, the statistical theory and empirical process literature has offered strong
guidance on the development of data-adaptive estimators for features of the data distribution
that provide inference, while fully utilizing the knowledge of a statistical model.
In particular, for optimal robust (higher-order) estimation of target estimands, we need to solve
specific first- and higher-order canonical gradients of the target estimand. Also, having the functional
parameter of the data distribution needed for estimation of the target estimand as a member of a
function class with a well behaving entropy integral (as implied by the covering number of the
function class) provides good rates of convergence and allows for bootstrap based inference. HAL-
MLE and HAL-TMLE appear to satisfy these key fundamental properties. It would be exciting for
the general machine learning literature to build on these areas.
References
[1] Stephen Stigler. The epic story of maximum likelihood. Statistical Science, 22(4) 598–620,
2007.
[2] Whitney Newey. Convergence rates and asymptotic normality for series estimators. Journal of
Econometrics, 79(1):147–168, 1997.
[3] Xiaotong Shen. On methods of sieves and penalization. The Annals of Statistics, 25(6)
2555–2591, 1997.
[4] Eric Polley, Sherri Rose, and Mark van der Laan. Super learning. In Targeted Learning: Causal
Inference for Observational and Experimental Data, pages 43–66. Springer, 2011.
[5] Mark van der Laan, Eric Polley, and Alan Hubbard. Super learner. Statistical Applications in
Genetics and Molecular Biology, 6(1), 2007.
[6] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
[7] David Benkeser and Mark van der Laan. The highly adaptive lasso estimator. In 2016 IEEE
International Conference on Data Science and Advanced Analytics (DSAA), pages 689–696.
IEEE, 2016.
498 Why Machine Learning Cannot Ignore Maximum Likelihood Estimation
[8] Mark van der Laan. A generally efficient targeted minimum loss based estimator based on
the highly adaptive LASSO. The International Journal of Biostatistics, 13(2), 2017. DOI:
10.1515/ijb-2015-0097
[9] Mark van der Laan and Sherri Rose. Targeted Learning: Causal Inference for Observational
and Experimental Data. Springer Science & Business Media, 2011.
[10] Mark van der Laan and Sherri Rose. Targeted Learning in Data Science: Causal Inference for
Complex Longitudinal Studies. Springer, 2018.
[11] Mark van der Laan and Daniel Rubin. Targeted maximum likelihood learning. The international
Journal of Biostatistics, 2(1), 2006.
[12] Cheng Ju, David Benkeser, and Mark van der Laan. Robust inference on the average treatment
effect using the outcome highly adaptive LASSO. Biometrics, 76(1):109–118, 2020.
[13] Mark van der Laan, David Benkeser, and Weixin Cai. Efficient estimation of pathwise differ-
entiable target parameters with the undersmoothed highly adaptive LASSO. arXiv preprint
arXiv:1908.05607, 2019.
[14] Georg Neuhaus. On weak convergence of stochastic processes with multidimensional time
parameter. The Annals of Mathematical Statistics, 42(4):1285–1295, 1971.
[15] Mark van der Laan and Sandrine Dudoit. Unified cross-validation methodology for selection
among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample
oracle inequalities and examples. Technical Report 130, Division of Biostatistics, University of
California, Berkeley, 2003.
[16] Mark van der Laan, Sandrine Dudoit, and Aad van der Vaart. The cross-validated adaptive
epsilon-net estimator. Statistics & Decisions, 24(3):373–395, 2006.
[17] Aad van der Vaart, Sandrine Dudoit, and Mark van der Laan. Oracle inequalities for multi-fold
cross validation. Statistics & Decisions, 24(3):351–371, 2006.
[18] Richard Gill, Mark van der Laan, and Jon Wellner. Inefficient estimators of the bivariate survival
function for three models. Annales de l’IHP Probabilités et statistiques, 31(3):545–597, 1995.
[19] Jerome Friedman, Trevor Hastie, Rob Tibshirani, Balasubramanian Narasimhan, Kenneth Tay,
Noah Simon, and Junyang Qian. Package ‘glmnet’. CRAN R Repository, 2021.
[20] Nima Hejazi, Jeremy Coyle, and Mark van der Laan. hal9001: Scalable highly adaptive lasso
regression in R. Journal of Open Source Software, 5(53):2526, 2020.
[21] Aurélien Bibaut and Mark van der Laan. Fast rates for empirical risk minimization over cadlag
functions with bounded sectional variation norm. arXiv preprint arXiv:1907.09244, 2019.
[22] Weixin Cai and Mark van der Laan. Nonparametric bootstrap inference for the targeted highly
adaptive least absolute shrinkage and selection operator (LASSO) estimator. The International
Journal of Biostatistics, 16(2), 2020. doi: 10.1515/ijb-2017-0070
[23] Mark van der Laan, Zeyi Wang, and Lars van der Laan. Higher order targeted maximum
likelihood estimation. arXiv preprint arXiv:2101.06290, 2021.
[24] Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions
and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, 2019.
[25] Judea Pearl. Causality. Cambridge University Press, 2009.
Closing 499
[26] Maya Petersen and Mark van der Laan. Causal models and learning from data: integrating
causal modeling and statistical estimation. Epidemiology, 25(3):418, 2014.
[27] Megan Schuler and Sherri Rose. Targeted maximum likelihood estimation for causal inference
in observational studies. American Journal of Epidemiology, 185(1):65–73, 2017.
[28] Sander Greenland. Invited commentary: variable selection versus shrinkage in the control of
multiple confounders. American Journal of Epidemiology, 167(5):523–529, 2008.
[29] Jessica Myers, Jeremy Rassen, Joshua Gagne, Krista Huybrechts, Sebastian Schneeweiss,
Kenneth Rothman, Marshall Joffe, and Robert Glynn. Effects of adjusting for instrumen-
tal variables on bias and precision of effect estimates. American Journal of Epidemiology,
174(11):1213–1222, 2011.
[30] Andrea Rotnitzky, Lingling Li, and Xiaochun Li. A note on overadjustment in inverse probabil-
ity weighted estimation. Biometrika, 97(4):997–1001, 2010.
[31] Enrique Schisterman, Stephen Cole, and Robert Platt. Overadjustment bias and unnecessary
adjustment in epidemiologic studies. Epidemiology, 20(4):488, 2009.
[32] Sebastian Schneeweiss, Jeremy Rassen, Robert Glynn, Jerry Avorn, Helen Mogun, and M. Alan
Brookhart. High-dimensional propensity score adjustment in studies of treatment effects using
health care claims data. Epidemiology, 20(4):512, 2009.
[33] Mark van der Laan, Antoine Chambaz, and Cheng Ju. C-tmle for continuous tuning. In Targeted
Learning in Data Science, pages 143–161. Springer, 2018.
[34] Susan M Shortreed and Ashkan Ertefaie. Outcome-adaptive lasso: variable selection for causal
inference. Biometrics, 73(4):1111–1122, 2017.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
23
Bayesian Propensity Score Methods and Related
Approaches for Confounding Adjustment
Joseph Antonelli
The University of Florida (USA)
CONTENTS
23.1 Introduction to Bayesian Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
23.1.1 Why use Bayesian methods for causal inference? . . . . . . . . . . . . . . . . . . . . . . . . . 503
23.1.2 The progression of Bayesian causal inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
23.2 Bayesian Analysis of Propensity Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
23.2.1 Issues with model feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
23.2.2 Accounting for uncertainty in propensity score estimation . . . . . . . . . . . . . . . . . 507
23.3 Covariate Selection in Propensity Score Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
23.3.1 The goal of Bayesian model averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
23.3.2 Bayesian model averaging in propensity score models . . . . . . . . . . . . . . . . . . . . 510
23.3.3 Bayesian model averaging for related causal estimators . . . . . . . . . . . . . . . . . . . 510
23.4 Doubly Robust Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
23.5 Other Issues at the Intersection of Confounding Adjustment and Bayesian Analysis 515
23.5.1 Sample estimands and fully Bayesian analysis of potential outcomes . . . . . . 516
23.5.2 Incorporating nonparametric Bayesian prior distributions . . . . . . . . . . . . . . . . . 517
23.5.2.1 Modeling the mean function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
23.5.2.2 Modeling the joint distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
23.5.3 Treatment effect heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
The goal of this chapter is to introduce readers to the different ways in which Bayesian inference
has been successfully applied to causal inference problems. Our emphasis will be on approaches
that utilize the propensity score, though we briefly discuss how other estimation approaches can be
improved by using the Bayesian paradigm. As the chapter focuses on propensity scores, we will focus
on the scenario where the treatment W is binary and the estimand of interest is the average treatment
effect. Similar ideas apply to multilevel or continuous treatments as well as other estimands, and
we will discuss some of these extensions toward the end of the chapter. We begin by introducing
readers to the Bayesian paradigm and why it can be useful to adopt in the context of causal inference.
Then we detail various ways in which the Bayesian paradigm has been used with propensity scores
and doubly robust estimators, and we finish with extensions to Bayesian nonparametrics and other
estimands.
p(y|θ)p(θ) p(y|θ)p(θ)
p(θ|y) = =R ∝ p(y|θ)p(θ),
p(y) θ
p(y|θ)p(θ)dθ
where ∝ implies that two quantities are proportional to each other and differ by a constant factor that
does not depend on θ. The posterior distribution reflects our uncertainty in the unknown parameter
θ after observing data y. We can see that the posterior distribution is a function of both the prior
distribution and the likelihood and therefore combines information from both of these sources. Figure
23.1 illustrates the relationship between the prior distribution, likelihood, and posterior distribution.
In the left panel, the prior distribution is relatively informative with a small variance. This leads to
the prior distribution having a large influence on the resulting posterior distribution, which is pulled
toward the prior distribution. In the middle and right panels, as we increase the variance of the prior
distribution, the posterior looks closer to the likelihood, resulting in the likelihood and posterior
being almost indistinguishable when the prior variance is large enough.
Once the posterior distribution is obtained, it is straightforward to perform inference for any
function of θ. Point estimates of unknown quantities can be obtained using the corresponding
posterior mean or median. (1 − α)100% credible intervals can be obtained in a number of ways. The
most common of which is to construct a credible interval using the α/2 and 1 − α/2 quantiles of
the posterior distribution for the quantity of interest. Unfortunately, the posterior distribution does
not typically follow a known probability distribution and does not have a closed form solution that
can be used to perform inference. Markov chain Monte Carlo (MCMC) methods are used to sample
draws from the posterior distribution, and we can use these draws to approximate functionals of
the posterior distribution. We won’t discuss sampling considerations in this chapter, but it will be
assumed throughout that we have obtained B draws from the posterior distribution of the unknown
parameters, denoted by θ (b) for b = 1, . . . , B. For additional information on computational issues
and MCMC sampling, we point interested readers to existing textbooks on this topic [4, 5]. Given
Introduction to Bayesian Analysis 503
Informative prior distribution Vaguely informative prior distribution Non−informative prior distribution
1.5
1.0
0.5
0.0
−2 0 2 4 −2 0 2 4 −2 0 2 4
θ
FIGURE 23.1
Interplay between the prior distribution, likelihood, and posterior distribution.
PB
these posterior draws, the posterior mean can be approximated as B1 b=1 θ (b) , and analogous
operations can be used to approximate other functionals such as the quantiles of the distribution for
inference.
Another Bayesian quantity that is relevant to causal inference problems is the posterior predictive
distribution, i.e., the distribution of a new, unobserved data point given the observed data. This can
be defined as
Z Z
y |y) =
p(e y |y, θ)p(θ|y)dθ =
p(e y |θ)p(θ|y)dθ.
p(e
θ θ
This is useful for prediction as it allows the analyst to provide estimates and uncertainty assessments
around predictions of new observations. This is useful for causal inference where we may be interested
in predicting the outcome for an individual under a treatment level that is not observed.
posterior distribution and can be more computationally intensive than frequentist inference, so it must
provide something different than frequentist inference; otherwise why bother? We will see in this
chapter that there are a number of desirable features that Bayesian inference can provide relatively
easily that are useful in causal inference problems. Bayesian inference provides natural solutions to
model averaging and variable selection, which is useful when the number of confounders is relatively
large compared to the sample size. Sensitivity analysis, which is an important issue in causal
inference, has natural solutions within the Bayesian paradigm where prior distributions on sensitivity
parameters can be used to assess the robustness to key, untestable assumptions. Additionally, there
are nice features of Bayesian inference that are not unique to causal inference problems, but are
useful in general. These include the use of highly flexible Bayesian nonparametric approaches, easily
constructing hierarchical models, and the ability to perform inference without relying on asymptotic
approximations.
Part of this chapter focuses on propensity score estimation, and there has been some debate in the
causal inference literature about whether an analysis that incorporates the propensity score can ever
be fully Bayesian [6–9]. From our perspective, what determines whether an analysis is fully Bayesian
or not is not well defined, nor is it clear why this would be advantageous. We do not attempt to
clarify these arguments or claim that any of the methods discussed in this chapter are fully Bayesian.
We believe that there are certain beneficial features of Bayesian methods that can be incorporated
into causal analyses to improve their performance. Whether or not an analysis is fully Bayesian is
irrelevant as long as the resulting inferential procedure is valid and the operating characteristics are
well understood.
paradigm. For a nice review of these ideas, we point readers to [11]. First we must specify the
assumptions under which we can identify the causal effect in this setting. We will be assuming the
stable unit treatment value assumption that states that the treatment status of one unit does not affect
outcomes for other units and that there are not multiple versions of the treatment. Additionally, we
assume unconfoundedness and positivity, which are defined as:
Positivity: 0 < P (W = 1|X = x) < 1 with probability 1.
Unconfoundedness: Yw ⊥ ⊥ W |X for w = 0, 1.
Before discussing Bayesian propensity score analysis, it is important to first clarify the different
ways in which propensity scores can be used in an analysis, as this can greatly affect how easily the
Bayesian paradigm can be incorporated. Throughout we will denote propensity scores by e(X, α) =
P (W = 1|X, α), where we now include α to represent unknown parameters of the propensity score
model. While nonparametric models for the propensity score are possible and many of the same
ideas will apply, we will focus on the following generalized linear model for the propensity score:
p
X
−1
gw P (W = 1|X = x, α) = α0 + αj xj (23.1)
j=1
where gw (·) is a standard link function such as the logistic or probit link functions. Once the
propensity score is obtained, there are a myriad of ways in which they can be used to estimate the
effect of the treatment. Matching on the propensity score is one common approach [12, 13], where
two individuals are considered good matches if they have similar values of the propensity score.
Inverse probability weighting is another estimator that estimates the average treatment effect as
n
b = 1
X Wi Yi (1 − Wi )Yi
∆ − ,
n i=1 e(Xi , α) 1 − e(Xi , α)
which is effectively a weighted average of the outcomes in the treated and control groups, where
weights are used to construct a population that has no association between W and X and is therefore
unconfounded [14]. Different numerators in this expression can be used to target different estimands
[15], though similar ideas apply. A third approach, which we will focus on for much of this section,
is to include the propensity score into an outcome regression model as follows:
gy−1 E(Y |W = w, X = x, β) = β0 + βw w + h(e(x, α)),
where gy is a standard link function. Here the h(·) function controls how the propensity score
is included in the outcome model. It could be included linearly, using nonlinear functions of the
propensity score, or through indicator functions that indicate membership into particular strata of the
propensity score distribution [16]. Matching, IPW, and related estimators are not likelihood based
and therefore don’t have an immediate Bayesian formulation. For this reason, we will begin our
discussion of Bayesian propensity score analysis with a focus on outcome regression as a method
of estimating causal effects with propensity scores. We will discuss extensions of these ideas to
non-likelihood-based estimators at the end of this section.
Now that our focus is on outcome regression models, we can discuss the Bayesian implementation
of these ideas, which are discussed in detail in [17]. The two key components of any Bayesian analysis
are the likelihood and prior distribution. Independent and non-informative prior distributions can be
assigned for α and β that are commonly used in Bayesian generalized linear models. Here, we need
to model both the outcome and treatment data generating processes, and therefore the likelihood is
given by
The likelihood factorizes into one component from the outcome model and one component from
the treatment or propensity score model. One key distinction here is that α is in both the likelihood
for the treatment and for the outcome, despite the fact that these are the parameters of the propensity
score model. The reason for this is that the propensity score, which is a function of α, is included in
the outcome model. This is a crucial point that will make a Bayesian analysis of propensity scores
more nuanced than traditional propensity score analysis. We will discuss the difficulties that arise
from this complication in the following section, but first it is important to understand that this means
the propensity score fit will be affected by the outcome since inference for α will be based on X, W ,
and Y . It has been argued that the design phase of a causal analysis, which the propensity score is
one component of, should be separate from the analysis stage and should not use information from
the outcome [39]. A traditional frequentist propensity score analysis would first estimate the MLE
of α, using only information from the treatment, and then estimate the causal effect in a second
stage conditional on these propensity score estimates. A fully Bayesian analysis on the other hand
incorporates all information simultaneously to update parameters in a single analysis. While it may
lead to additional complications in a propensity score analysis, there are certain reasons why one
might wish to proceed with Bayesian inference here. One important reason is that it allows us to
account for difficult sources of uncertainty, which we explore in Section 23.2.2. Another reason
is that it permits the use of uniquely Bayesian tools such as Bayesian model averaging, which is
beneficial when the number of confounders is large, or Bayesian nonparametric models that have
been shown to work well in a wide range of settings.
Clearly if we sample from this distribution, the resulting α values will be a function of both the
treatment and outcome as both components of the likelihood are in this expression. Cutting the
feedback would amount to an approximately Bayesian approach that simply updates from the
modified full conditional distribution given by
This distribution will only use information from the treatment to update α and will therefore be
unaffected by any model misspecification from the outcome model component of the likelihood. The
problem of model misspecification in Bayesian propensity score analysis was first considered in [22].
Their motivation was to better understand the potential impact of model misspecification in these
situations, and they compared the fully Bayesian approach to the approximately Bayesian one that
cuts feedback from the outcome model. They found similar results between the two approaches and
suggested that authors use the fully Bayesian approach if they are confident in their outcome model
or are able to use a flexible model that is less susceptible to misspecification.
More recent work [23] has identified a much larger issue inherent to a fully Bayesian analysis
of propensity scores and an outcome regression that incorporates the propensity score. A central
Bayesian Analysis of Propensity Scores 507
property of propensity scores that enables them to identify causal effects is the so-called balancing
property [52], which states that
W ⊥ ⊥ X|e(X, α)
In [23] the authors noted that it is not clear whether the propensity scores in a fully Bayesian analysis
that incorporate the outcome will possess this property, and therefore, it is not clear how useful they
are as a measure to remove confounding bias. Further, they showed that a fully Bayesian analysis
can only lead to unbiased estimates of the treatment effect in extremely specific situations that are
unlikely to hold. Intuitively, an outcome model that conditions on the propensity score is by definition
misspecified as we don’t typically think the true outcome process is a function of e(X, α). Given
that model feedback is problematic under model misspecification, this implies that model feedback
will always be a problem for propensity score estimation. It can be shown that the only way in which
this approach will provide unbiased estimates is if the true outcome model coefficients are a simple
re-scaling of the propensity score coefficients, α. Clearly this is overly restrictive, and effectively
renders the fully Bayesian approach as defined above useless.
In light of this, there are two ways to incorporate Bayesian analysis into propensity score analysis
if an outcome model is used. The first is to use an approximately Bayesian approach that cuts
the feedback and does not use information from the outcome when updating the propensity score
parameters. Related two step Bayesian procedures to propensity score analysis are described in detail
in [25]. While these ideas work, and can still propagate uncertainty from propensity score estimation
into causal estimates, they are not always feasible. One key instance is when doing variable selection,
which we cover in the following section, as we want the chosen set of variables to depend on both
the treatment and outcome. [23] found that another way to solve the issue of feedback between the
propensity score and outcome model is to include additional covariate adjustment in the outcome
model as
p
X
gy−1 (E(Y |W = w, X = x, β) = β0 + βw w + h(e(x, α)) + βj xj . (23.3)
j=1
This additional covariate adjustment has been shown to work well in propensity score analyses
[13, 26]. Not only does this address the issue of model feedback and allow users to utilize a fully
Bayesian analysis of propensity scores, but it has also been shown to be doubly robust in the sense
that only the additional covariate adjustment or propensity score need to be correctly specified in
order to estimate causal effects. Lastly, as we will see in the following section, this approach will
allow users to perform model averaging or variable selection in a Bayesian framework that utilizes
both the treatment and outcome information, which is desirable for variable selection in causal
inference problems [27–30].
that are not easily dealt with in traditional analyses such as model uncertainty [29], or uncertainty in
which observations are dropped in matching analyses [34, 35].
Despite the work described above, there is not a clear consensus on the overall benefits of
Bayesian analysis of propensity scores. Certainly there are benefits in being able to use some of the
unique features of Bayesian inference, though in terms of uncertainty quantification the story is less
clear. Many authors found that the Bayesian approaches provided similar interval coverage rates in
simulations to their frequentist counterparts. Additionally, the approaches considered above were
restricted to a specific type of estimation procedure, frequently an outcome model based procedure as
this is the easiest to imbed in the Bayesian paradigm. Other estimators that are not likelihood based
such as IPW estimators do not easily extend to Bayesian inference. Though work has been done to
provide Bayesian versions of these estimators [7], there is some debate about their utility [8].
Recent progress has been made at bridging the gap between Bayesian inference and more
general causal estimators that rely on propensity score approaches in [36]. This paper uses Bayesian
inference for parameter estimation along with a broad class of causal estimators, and aims to provide
an inferential strategy that works across all estimators. Using similar notation, we can define ν
to be the output of the design stage of the study, or the output from a particular propensity score
implementation. This can represent weights for an IPW estimator, a set of matched observations after
matching on the propensity score, or a partition of the data into propensity score strata, among others.
Letting ∆ represent the treatment effect of interest, the goal is to find
Z
P (∆|Y , W , X) = P (∆|Y , W , X, ν)P (ν|W , X)dν.
ν
Note that the distribution of ν here does not depend on the outcome, following the principle that the
outcome should not influence the design stage of a study. Interestingly, this posterior distribution can
be decomposed even further as:
Z Z
P (∆|Y , W , X, ν) P (ν|α, W , X)P (α|W , X)dαdν.
ν α
This suggests a sequential strategy to sampling from the posterior distribution of ∆. First, the propen-
sity score parameters can be sampled from the posterior distribution P (α|W , X). Conditional on α,
the design stage can be sampled from P (ν|α, W , X). Lastly, the causal effect can be sampled from
the posterior distribution of P (∆|Y , W , X, ν). These three sources of uncertainty were referred to
in [36] as analysis estimation uncertainty, design decision uncertainty, and design estimation uncer-
tainty, respectively. In many cases, such as for IPW weights, there is no design decision uncertainty
as the weights used in IPW are a deterministic function of α and X. Design decision uncertainty is
more prevalent in matching estimators where there is uncertainty in the matched sets, and different
replications of the same matching algorithm can lead to different analyses. If the final analysis
strategy is likelihood based, such as through an outcome model, the analysis estimation uncertainty
can be accounted for by the posterior distribution of the parameters in this model. If instead the
analysis is done using non-likelihood based estimators such as IPW, doubly robust estimators, or
matching, then this distribution can be replaced by the asymptotic distribution of that estimator
leading to an approximately Bayesian analysis.
unmeasured confounding more plausible than if fewer covariates were measured. From an estimation
perspective, it complicates analyses for a number of reasons. The most commonly used approach
to model selection in propensity score models is to simply include all available covariates in the
propensity score model to ensure that all important confounders are included. Unfortunately, when
the number of possible confounders is moderate or high-dimensional then this approach is inefficient
at best, and infeasible at worst. In some cases the number of covariates is so large that traditional
approaches to propensity score estimation do not apply, or they lead to severely overfit propensity
scores that are effectively equal to 1 for treated subjects and 0 for control subjects. Many ad-hoc
approaches to variable selection could be used in this setting such as stepwise regression approaches
on the propensity score model, or simply looking at univariate correlations between predictors and
the treatment or outcome [37]. High-dimensional models such as the lasso [38] and related models
could be used to model the propensity score, but it has been noted in many cases that this can
lead to substantial finite sample bias of the treatment effect [28–30, 39]. The crucial point is that
confounders will be associated with both the treatment and outcome, and therefore the outcome
should be used to help identify the important confounders [12, 40, 41]. The problem stems from the
fact that a variable can be strongly associated with the outcome, and only weakly associated with the
treatment, yet still induce non-negligible bias for the treatment effect. These are likely to be excluded
by algorithms that select confounders based only on the treatment and ignore information from the
outcome. An additional reason to incorporate the outcome into the confounder selection process is
that variables associated with only the outcome can help improve the efficiency of resulting treatment
effect estimates if they are included in the propensity score model. Note that this problem has spurred
interest in model selection for causal inference [27, 30, 39, 42, 43], as well as interest in extending
these ideas to the truly high-dimensional setting where p > n [28, 44–50]. In this section we restrict
attention to Bayesian approaches to model averaging and confounder selection in this context, but
refer readers to the aforementioned papers for frequentist estimators in this setting.
The main goal of Bayesian model averaging for causal inference is to assign as much posterior
probability as possible to models in M∗ . Specifically, the goal is to obtain a posterior distribution
such that P (M ∈ / M∗ |Y , W , X) ≈ 0 so that nearly all posterior probability is placed on models
that satisfy the no unmeasured confounding assumption. For the rest of this section, we provide
details on specific implementations of Bayesian model averaging in causal inference.
where δ0 is a point mass at zero, and p is the prior probability of including a confounder into
the models. This distribution is a mixture between a continuous density (the slab) and a discrete
point mass at zero (the spike). The crucial parameter is γj ∈ {0, 1}, which indicates whether a
parameter is nonzero or not. A key difference here between standard spike-and-slab prior distribution
implementations is that γj is shared between both models, which means that a confounder is either
included in both models or excluded from both models. This induces two critical features for this
model to perform well: 1) It eliminates model feedback as described in Section 23.2.1 and 2) it
ensures that information from both the treatment and outcome is used when updating the posterior
distribution of γj . Variables that have a strong association with either the treatment or outcome,
but only a weak association with the other, are much more likely to be included using this strategy,
which should increase the probability of including all the necessary confounders. This strategy has
been shown to have strong empirical performance when the number of confounders is large, and
interestingly can outperform (in terms of mean squared error) the approach of simply including all
covariates even when the number of covariates is not prohibitively large.
Spike-and-slab prior distributions could simply be applied to the βj coefficients, but this would ignore
each variable’s association with the treatment and could lead to substantial bias in finite samples. For
this reason, they estimate both the outcome model above and the propensity score model defined in
(23.1). Spike-and-slab prior distributions are then placed on the parameters of both models as
Note here that there are distinct γjw and γjy parameters indicating that the covariates in the treatment
and outcome models can differ. To increase the likelihood that all confounders have γjy = 1, the
following prior distribution is used for the binary inclusion parameters:
ω
P (γjw = 0, γjy = 0) = P (γjw = 0, γjy = 1) = P (γjw = 1, γjy = 1) =
3ω + 1
w y 1
P (γj = 1, γj = 0) = ,
3ω + 1
where ω ≥ 1 is a tuning parameter that controls the degree of linkage between the propensity score
and outcome models. If ω = 1, then the prior distributions for γjw and γjy are independent; however,
as ω increases the dependence between the two grows. It is easier to see this dependence by looking
at the conditional odds implied by this prior distribution:
This shows that if a covariate is included in the treatment model, then it is far more likely to
be included in the outcome model. This reduces the finite sample bias of the treatment effect by
increasing the probability that all confounders are included into the outcome model. These ideas have
been explored in a range of contexts in causal inference such as treatment effect heterogeneity [53],
missing data [54], exposure-response curve estimation [55], and multiple exposures [56].
Now that Bayesian model averaging has been utilized in both propensity score and outcome
regression models for estimating causal effects, it is natural to assume that they can be used for
doubly robust estimation as well. We will cover doubly robust estimation in detail in the following
section, but loosely it allows for consistent estimates of causal effects if either (but not necessarily
both) of the treatment and outcome model are correctly specified. Clearly this is a desirable feature
and was therefore adopted in [57], in which the authors utilized Bayesian model averaging within the
context of doubly robust estimators. Let Mom and Mps represent the model space for the outcome
model and propensity score model, respectively. Further, let Mps 1 be the null model that includes
no predictors in the propensity score. They assign a uniform prior on the space of outcome models,
but use a prior distribution for the propensity score model space that is conditional on the outcome
model, thereby linking the two models. The prior distribution for the propensity score model space is
given by
P (Mps
(
om
i |Mj ) 1, Mps i ⊂ Mj
om
ps om =
P (M1 |Mj ) τ, otherwise.
512 Bayesian Propensity Score Methods and Related Approaches for Confounding Adjustment
As with the BAC prior distribution, there is a tuning parameter τ ∈ [0, 1] that controls the amount
of linkage between the propensity score and outcome models. When τ = 0, the propensity score
can only include variables that are included in the outcome model. When τ is between 0 and 1,
smaller weight is given to propensity score models that include terms that are not included in the
outcome model. For each combination of the possible treatment and outcome models, the doubly
robust estimator is calculated, and the overall estimator is a weighted average of these individual
estimates given by the respective weights given to each model. Formally, their estimator is given by
X
∆
b = wij ∆
b ij ,
i,j
where pwi = P (Wi = w|X = Xi ) and mwi = E(Y |W = w, X = Xi ) are the treatment
and outcome models, respectively. Note that we have written this estimator as a function of both
D = (Y, W, X) and unknown parameters Ψ, which encapture all unknown parameters of both
the propensity score and outcome regression models. This is not a likelihood based estimator and
therefore does not have a natural Bayesian formulation. Recent work in [9] aims to construct doubly
robust estimators using Bayesian posterior predictive distributions and a change of measure using
Doubly Robust Estimation 513
importance sampling, which they argue emits a natural Bayesian interpretation. In this chapter we
will not address whether any of the following approaches are fully Bayesian, nor do we attempt to
construct a fully Bayesian estimator. Rather we will highlight how certain Bayesian ideas can be
very useful for certain aspects of doubly robust estimation, and how finite sample inference can be
improved using Bayesian models for the nuisance parameters in a doubly robust estimator. In the
previous section we saw how Bayesian model averaging could improve inference in doubly robust
estimation, but there are a multitude of other useful properties that we would ideally be able to imbed
within doubly robust estimation such as Bayesian nonparametrics, easy handling of missing data, or
not needing to rely on asymptotic theory for inference.
While the doubly robust estimator described above does not have a Bayesian interpretation, both
the propensity score and outcome regression models, which the doubly robust estimator is a function
of, can be estimated from likelihood-based procedures and therefore within the Bayesian paradigm.
Using the notation above, we can estimate Ψ within the Bayesian paradigm and obtain the posterior
distribution of these parameters, which we denote by P (Ψ|D). There are two important questions
that are left to be answered: 1) Once we have the posterior distribution of the propensity score and
outcome model parameters, how do we construct point estimates and confidence intervals, and 2)
why is this a useful pursuit? Do these estimators actually provide something that existing frequentist
estimators do not, or are we simply performing frequentist-pursuit as some critics of Bayesian causal
inference like to state? We answer both of these questions in what follows.
There are two possible estimators of ∆ once the posterior distribution is obtained. The more
common approach is the more common approach is to use an estimate of both pwi and mwi . We
could use the posterior mean by setting pbwi = EΨ|D [pwi ] and m b wi = EΨ|D [mwi ]. Then, we plug
these values into (23.4) to estimate the average treatment effect. Using the notation above, this
can be defined as ∆ b = ∆(Ψ, b D) with Ψ b = EΨ|D [Ψ]. This is the common approach taken in
frequentist analyses, and while this strategy is a reasonable one, inference is more challenging. Either
the bootstrap can be used to account for uncertainty in both stages of the estimator, or the estimator’s
asymptotic distribution is obtained from which inference can proceed. These approaches, however,
may not be valid or may not perform well in finite samples with complex or high-dimensional models
for the propensity score and outcome regression. It is natural to assume that the posterior distribution
can be used to provide measures of uncertainty, but it is not clear how it can be used for this estimator.
In light of this, a second approach can be use for estimating the ATE, which is to construct an
estimator as
∆
b = EΨ|D [∆(Ψ, D)], (23.5)
which is the posterior mean of the ∆(Ψ, D) function. Intuitively, for every posterior draw of the
propensity score and outcome regression models we evaluate (23.4), and the mean of these values is
our estimator. The posterior mean can be approximated using B posterior draws as
B
1 X
EΨ|D [∆(Ψ, D)] ≈ ∆(Ψ(b) , D),
B
b=1
This is also a somewhat obvious choice for an estimator, but we’ll see that this estimator leads
to a strategy for inference that works in difficult settings such as when the covariate space is high-
dimensional, or when the models used to estimate the propensity score and outcome regression are
highly flexible. The goal will be to construct an estimate of the variance of the effect estimate that
accounts for all sources of uncertainty. The target variance is the variance of the sampling distribution
of the estimator, defined by VarD EΨ|D [∆(Ψ, D)]. There are two main sources of variability in this
estimator: 1) the uncertainty in parameter estimation for the propensity score and outcome regression
models and 2) sampling variability in Di that is present even if we knew the true outcome and
propensity score models. Following ideas seen in [50], we will show that the posterior distribution
514 Bayesian Propensity Score Methods and Related Approaches for Confounding Adjustment
of model parameters can be combined with a simple resampling procedure to provide a variance
estimator that is consistent when both the propensity score and outcome regression models are
correctly specified and contract at sufficiently fast rates and is conservative in finite samples or under
model misspecification.
Before defining our variance estimator, we must introduce additional notation. Let D (m) be a
resampled version of our original data D, where resampling is done with replacement as in the
nonparametric bootstrap. The variance estimator is defined as
The first of these two terms resembles the true variance, except the outer variance is no longer with
respect to D, but is now with respect to D (m) . This is a crucial difference, however, as the first
term does not account for variability due to parameter estimation. The inner expectation of the first
term is with respect to the posterior distribution of Ψ given the observed data D, not the resampled
data D (m) . This means that this variance term does not account for variability that is caused by the
fact that different data sets would lead to different posterior distributions. Ignoring this source of
variability will likely lead to anti-conservative inference as our estimated variance will be smaller
than the true variance of our estimator. To fix this issue, the second term is introduced, which is the
variability of the estimator due to parameter uncertainty.
Computing this variance estimator is relatively straightforward once we have the posterior distri-
bution of Ψ. The second term, given by VarΨ|D [∆(Ψ, D)], is simply the variability across posterior
samples of the doubly robust estimator in (23.4) evaluated at the observed data. To calculate the
first term in (23.6), we can create M new data sets, D (1) , . . . , D (M ) , by sampling with replacement
from the empirical distribution of the data. For each combination of resampled data set and bootstrap
sample, we can calculate the doubly robust estimator defined in (23.4). This creates an M × B matrix
of treatment effect estimates as given below.
Once this matrix of estimates is obtained, we can take the mean within each row, which corresponds
to the posterior mean of the estimator at each of the M resampled data sets. Taking the variance of
these M estimators leads to an estimate of VarD(m) {EΨ|D [∆(Ψ, D (m) )]}.
The variance estimator in (23.6) makes sense intuitively as it involves adding posterior variability
to the first term, which was ignoring uncertainty from parameter estimation. However, it is not clear
that the sum of these two terms lead to a valid variance estimator. Under general settings this estimate
of the variance was shown to be conservative in that it gives estimates of the variance that are too
large on average, leading to more conservative inference. This would seem potentially problematic
at first as it is not known just how large this variance estimate could be, and it may lead to overly
wide confidence intervals that lead to low statistical power. It was shown, however, that under certain
conditions on the posterior distributions for the treatment and outcome models that this is a consistent
variance estimator. In Bayesian statistics, asymptotics are frequently expressed in terms of posterior
contraction rates instead of convergence rates used for point estimators. Posterior contraction rates
detail the behavior of the entire posterior distribution instead of a simple measure of centrality such
as the posterior mean or median. We say that the treatment and outcome models contract at rates nw
and ny if the following holds:
(i) sup EP0 Pn √1
n
∗
||pw − pw ||2 > Mw nw | D → 0,
P0
Other Issues at the Intersection of Confounding Adjustment and Bayesian Analysis 515
(ii) sup EP0 Pn √1n ||mw − m∗w ||2 > My ny | D → 0,
P0
where Mw and My are constants, pw = (pw1 , . . . , pwn ), mw = (mw1 , . . . , mwn ), and p∗w and
m∗w denote their unknown, true values.. Here, nw and ny determine how quickly the posterior
distribution centers around the true values for the propensity score and outcome regression models.
If we assume correctly specified parametric models, then these rates of contraction would be n−1/2 .
For high-dimensional or nonparametric models, we expect slower rates of convergence where
nw ≥ n−1/2 and ny ≥ n−1/2 . It was shown that if both the treatment and outcome model posterior
distributions contract at rates such that nw ≤ n−1/4 and ny ≤ n−1/4 , i.e. faster rates than n−1/4 ,
then the variance estimator is consistent for the true variance. In addition to the variance being
consistent, under these
√ same assumptions, the estimator of the treatment effect defined in (23.5)
is consistent at the n rate. The intuition behind this result is relatively straightforward. The first
term in the variance estimator in (23.6) is effectively the variance of the doubly robust estimator
that ignores parameter uncertainty. It is well known that one desirable feature of the doubly robust
estimator is that its asymptotic variance is the same whether fixed parameters are used or not and
therefore does not need to account for parameter uncertainty. This implies that the first term in
(23.6) is asymptotically equivalent to the true variance of interest. The second term in (23.6), which
amounts to uncertainty in the doubly robust estimator from parameter estimation, can be shown
to be asymptotically negligible under the conditions above, and therefore the variance estimator is
consistent.
These results are similar to those seen in the frequentist causal inference literature for semipara-
metric or high-dimensional models [44, 66], so a natural question might be to ask what Bayesian
inference provided in this setting. The key to answering that question lies in the situations when
the conditions of the theoretical results above do not hold. What happens in small sample sizes, or
when one of the propensity score and outcome models is misspecified? In [50] it was shown that
existing estimators that utilize asymptotic theory to provide inference, which ignore uncertainty from
parameter estimation, perform poorly in small sample sizes or model misspecification in terms of
interval coverage, obtaining levels well below the nominal level. The approach above, which uses
the posterior distribution of Ψ to account for parameter uncertainty instead of ignoring it, performs
well in these scenarios achieving coverage rates at or slightly above the nominal level. In simpler
situations such as finite-dimensional parametric models for the two regression models and large
sample sizes, the two approaches perform very similarly leading to valid inference on the treatment
effect. In the more difficult settings of high dimensions, finite samples, or complex models for the
two regression functions, the Bayesian approach provides a solution that works in general, and at
worst is slightly conservative. Note again that this approach is not fully Bayesian, but rather Bayesian
ideas are used to account for a difficult source of uncertainty that is commonly ignored.
approach has been used in a wide variety of other contexts such as mediation analysis [71, 72],
sensitivity analysis for unmeasured confounders [73], estimation in panel data settings [74], estimation
in regression discontinuity designs [75, 76], dealing with high-dimensional confounders [49, 77], and
instrumental variable analysis [78], among others. In this chapter, we are focused on propensity score
methods, and more generally the issue of confounding adjustment. Along these lines we discuss
three main issues in this section: 1) sample estimands and uncertainty quantification, 2) the use of
nonparametric Bayesian approaches for flexible confounding adjustment and effect estimation, and
3) treatment effect heterogeneity.
When updating all unknown parameters in an MCMC algorithm, the unknown potential outcome
for each individual can be treated as an unknown parameter and updated from its full conditional
distribution. For control individuals with Wi = 0, we update their missing potential outcome from
the following distribution:
ρσ1
Y1i |· ∼ N Xβ1 + (Yi − Xβ0 ), σ12 (1 − ρ2 ) ,
σ0
where an analogous distribution holds for treated individuals that require imputation of Y0i . The
main issue with this model is that Y0i and Y1i are never jointly observed and therefore there is no
information in the data to inform the correlation between these two potential outcomes, denoted by
ρ. One strategy is to vary ρ from 0 to 1 and assess how results vary as a function of the correlation
between the potential outcomes. Another approach is to place a prior distribution on ρ that assigns
probability to plausible values and allows the resulting posterior distribution to average over possible
values of this correlation.
It is useful to take this approach from the Bayesian paradigm as it is easy to account for uncertainty
in the imputations of the missing potential outcomes and to average over uncertainty about ρ, however,
there are other potential benefits in this setting. If the dimension of the covariate space is relatively
large then shrinkage priors or Bayesian model averaging can be used for β0 and β1 . Additionally, the
model above assumes both a linear association between the covariates and the mean of the potential
outcomes, as well as normality of the potential outcomes. One or both of these assumptions can easily
Other Issues at the Intersection of Confounding Adjustment and Bayesian Analysis 517
be alleviated using nonparametric Bayesian prior distributions, which do not make strong parametric
assumptions about either the mean or distribution of the potential outcomes. These nonparametric
Bayesian approaches have shown to be very effective in a variety of settings, and therefore we detail
their usage in a variety of causal estimators in the following section.
Zi |µ, σ 2 ∼ N (µ, σ 2 )
µ ∼ N (µ0 , σ02 )
σ 2 ∼ IG(a0 , b0 ).
Here we have assumed that Zi follows a normal distribution with mean µ and variance σ 2 , and then
we assigned a normal prior distribution for the mean, and an inverse-gamma prior distribution for the
variance. A nonparametric Bayesian alternative would be to assume
Zi |G ∼ G
G ∼ DP (α, M ),
where DP (α, M ) represents a dirichlet process with centering parameter α and base distribution M .
Here we could take M to be a normal distribution as above, but this formulation allows for deviations
away from normality and the degree of flexibility is governed by α. We won’t discuss this in further
detail here, but the main idea is that Bayesian nonparametrics place prior distributions on infinite
dimensional parameters so that parametric assumptions do not need to be made. The dirichlet process
example above is simply one such Bayesian nonparametric solution to this problem, though many
others exist. For an accessible introduction into Bayesian nonparametrics, we point readers to [79].
Now we can discuss how these ideas have been successfully applied in causal inference problems.
First note that the nonparametric Bayesian ideas that follow can be applied to either propensity
score estimation or outcome regression models. We choose to focus on outcome regression models
here as they are far more commonly combined with Bayesian nonparametrics. Within outcome
model estimation, there are two distinct ways in which Bayesian nonparametrics can be applied.
The first is to use flexible nonparametric Bayesian prior distributions to improve estimation of the
mean of the outcome regression surface, while the rest of the outcome distribution is allowed to be
fully parametric. The second approach is to adopt a fully nonparametric solution to the entire data
generating process rather than simply the mean of the outcome.
Here, Tk corresponds to a particular tree structure or partition of the data, while Mk corresponds
to the value of the responses in each terminal node of the tree given by Tk . Each of the individual
trees is a weak-learner in that they do not predict the outcome well on their own, but the sum of these
trees leads to a very strong predictive model. This approach was shown to work remarkably well at
estimating causal effects. When the true regression models are nonlinear, the BART model greatly
outperforms simple parametric models. When the true regression models are linear, the BART model
performs nearly as well as an estimate of the true linear regression model. This is an overarching
theme of Bayesian nonparametrics and why they can be so effective: they adapt well to complex
situations, but also perform nearly as well as parametric models when the truth is indeed parametric.
Given a function µ(w, x), inference can proceed for a number of distinct estimands. Assuming
unconfoundedness and consistency, the conditional average treatment effect, which we highlight in
the following section, is given by
Inference can proceed automatically using the relevant quantiles of the posterior distribution of
µ(1, x) − µ(0, x). This shows one advantage of BART over more algorithmic tree-based approaches
such as random forests [82]. Prior distributions are assigned to Tk and Mk and uncertainty is
automatically captured by their posterior distribution. Under the same assumptions, the average
treatment effect can be identified from the observed data as
Z n
1X
µ(1, x) − µ(0, x)f (x)dx ≈ µ(1, x) − µ(0, x)
x n i=1
Inference in this situation is slightly more nuanced, because we must also account for uncertainty in
the distribution of the covariates, which we are approximating with the empirical distribution from
our observed sample. The posterior distribution of µ(w, x) does not account for this uncertainty, but
we can account for it using the Bayesian bootstrap [83]. Suppose we have B posterior draws from
our model, given by µ(b) (w, x). We can define weights as ξi = ui − ui−1 where u0 = 0, un = 1
and u1 , . . . , un−1 are the order statistics from n − 1 draws from a standard uniform distribution. We
(b)
can do this for every posterior sample to obtain ξi and create posterior draws of the average causal
effect as
n
1 X (b) h (b) i
ξi µ (1, x) − µ(b) (0, x) for b = 1, . . . , B.
n i=1
Inference can then proceed in a traditional Bayesian framework once a posterior distribution is
Other Issues at the Intersection of Confounding Adjustment and Bayesian Analysis 519
obtained. Note that if we were interested in sample average treatment effects, then we would not
have to do this additional uncertainty assessment as this is only done to account for uncertainty from
using a sample average to estimate a population expectation.
While BART has been used extensively for causal inference, it is not the only approach to
estimating µ(w, x). A popular approach in Bayesian nonparametrics is to use the Gaussian process
prior distribution, which amounts to specifying a prior for µ(w, x) as
Here µ0 (·) is the mean function of the Gaussian process, which is usually Pp either assumed to be
µ0 (w, x) = 0 or a linear function given by µ0 (w, x) = β0 + βw w + j=1 βj xj . The Gaussian
process allows for deviations away from this mean function, and this is dictated by the kernel function
C(z, z 0 ), where z = [w, x0 ]0 . This function describes how similar two covariate vectors are, and
there are a number of choices for this function, such as
p+1
X
C(z, z 0 ) = σ 2 exp{−δ (zj − zj0 )2 }.
j=1
The Gaussian process prior works under an assumption that similar covariate vectors should have
similar values of the regression function µ(·). The only real assumption on the regression function is
that of smoothness, and the degree of smoothness is dictated by δ. An easier way to understand the
Gaussian process is to see that the function evaluated at a finite set of locations follows a multivariate
normal distribution. In our setting, we have that the prior distribution at our observed n data points is
given by
(µ(w1 , x1 ), . . . , µ(wn , xn ))0 ∼ N ((µ0 (w1 , x1 ), . . . , µ0 (wn , xn ))0 , Σ)
where the (i, j) element of Σ is given by C(zi , zj ). While Gaussian processes are widely used and
have very strong predictive performance, there are certain drawbacks relative to the BART model
described earlier. The main drawback is the computational burden of Gaussian processes, which
require the inversion of an n × n matrix in every MCMC sample. While this can be substantially
alleviated through certain approximations [84–86], the BART model is extremely fast computationally.
Additionally, the default specifications for BART perform remarkably well across a wide range of
scenarios and therefore the model requires little to no tuning or prior expertise. This is likely
one of the driving reasons for BART regularly performing well in causal inference data analysis
competitions [87].
where φi = [βi , θi ] and M is an appropriate base measure of the dirichlet process. Effectively,
this is simply a mixture model where each individual has their own parameters that come from a
distribution G. The dirichlet process prior on G necessarily leads to a discrete distribution on these
parameters, where some individuals will have the same parameters as others. While this model is
quite complex, extensions have been proposed and used in the causal inference literature such as
the enriched dirichlet process model [89]. Of more interest is what advantages these models have
over the nonparametric mean functions of the previous section. The first is that this specifies a model
for the full joint distribution of the data, and therefore any missingness in the covariates can easily
be addressed by imputing the missing values within the MCMC algorithm [70]. A second benefit is
that we are no longer restricted to estimands that examine the mean of the potential outcome surface.
Approaches such as these can easily estimate average treatment effects, but immediately extend to
more complex estimands such as quantile treatment effects [90]. If we let F1 (y) and F0 (y) be the
cumulative distribution functions for the potential outcomes under treatment and control, respectively,
we can define quantile treatment effects as
Once the posterior distribution is obtained for the joint density of (Yi , Zi ), these quantities can
easily be obtained through appropriate standardization, and inference can be accounted for through
quantiles of the posterior distribution.
these functions, including the nonparametric ones discussed in the previous section, though the
Bayesian causal forest (BCF) approach that introduced this idea was restricted to BART priors on
both functions. To reflect the fact that they believe heterogeneity to typically be small to moderate
in magnitude, they use a different BART prior for τ (x) that prioritizes smaller trees and simpler
functions. In addition to the reparameterization of this model, a key insight of this paper is that
overall estimation is improved by including an estimate of the propensity score into the prognostic
function, f (x). Specifically, they fit the following model:
µ(w, x) = f (b
π , x) + τ (x)w,
where π b is an initial, likely frequentist, estimate of the propensity score. It was shown that this can
help to reduce bias that the authors refer to as regularization induced confounding bias. Essentially
this is bias that occurs when regularization is applied in an outcome model without regard to the
fact that the goal is treatment effect estimation and not prediction of the outcome. To see the utility
of these different approaches, we simulated a data set with a constant treatment effect of 0.3 that
does not vary by observed characteristics. We applied all three of the approaches considered here
for estimating E(Y1 − Y0 |X = Xi ) for i = 1, . . . , n, and the results can be seen in Figure 23.2.
We see that the approach that fits two separate models leads to substantial amounts of variability
in the individual treatment effect estimates. The 95% credible intervals still work well in that they
typically cover the true parameter, which is given by the grey line at 0.3. The BCF and single BART
approaches perform similarly well, but the BCF approach has estimates that are closer to the truth
and intervals that are generally smaller in width than the single BART approach. This is likely due to
the explicit parameterization that allows them to enforce more shrinkage and simpler BART models
for the function governing treatment effect heterogeneity.
4
●
●
● ●●
● ●
●
● ● ●● ● ● ● ●● ●●
● ● ●
● ● ●
● ● ●
●
● ●
● ● ●
● ● ● ●●● ● ● ●
● ● ● ● ●● ● ● ●
● ●
● ● ● ● ● ●
● ● ●● ●
● ● ● ●● ● ● ●●
● ●●● ● ●
● ● ● ●
● ● ● ●● ● ● ● ●
● ● ● ●
●● ● ● ● ● ● ● ●
● ●
● ● ● ●● ●● ●● ● ●● ● ● ●
●● ● ●●●
Treatment effect
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●●●
●
2
● ● ●● ●●●●
● ●● ● ● ● ● ● ● ●● ● ● ●
● ●● ● ● ●
● ●
● ● ● ● ● ● ●
● ● ● ●● ● ● ● ●● ● ● ●
● ●● ●● ● ● ● ● ●
●●● ●● ●● ●
● ● ● ●●●●● ● ●● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ●●●● ● ●●●
● ●● ● ●● ●● ● ● ● ● ● ● ● ●
●● ●●●
●● ● ● ● ●● ● ●●
●● ● ●
● ●● ●●● ●●●●●●●●●
● ●●●●
●● ●● ●
●
●●
●●● ●●● ● ●●● ● ● ● ● ● ● ● ● ●● ●●
●● ●● ●● ●● ●● ●●●●●●● ●●●●●● ● ●● ●● ●● ●● ● ● ●● ●●●● ●● ●
● ● ●● ● ● ●●
●
● ● ● ● ● ● ●
● ● ● ●●●● ●
● ● ● ●● ● ● ●● ● ● ●
● ● ● ● ● ●
●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●
● ●
● ●●
● ● ● ●● ●●●●●●●●● ●● ●●● ● ● ● ● ●● ●● ●●
●●●●●
●●●
●●
●● ● ●●● ●●● ● ●●●●●
● ●● ● ●●●●●● ●
●● ●
● ●
●● ● ● ●● ● ●● ●●● ●●●
●●
●● ● ● ● ● ●
● ●●●●●●●
●
●● ● ● ●● ● ● ●●●●●
●●●
● ● ● ● ●● ● ●●●
●●● ● ●●● ●● ●● ●● ●●●●●●●●●● ●●● ●● ●●●●●●●●● ●● ●● ●●●● ●● ● ●● ●● ●●●●● ●● ● ●●
●
● ● ● ●● ● ●●●●
●● ●
●● ● ● ●●●● ● ●●●●● ● ● ● ●●● ●● ●
●●● ● ●●● ●● ●● ●●
● ●● ●
●● ● ●●●
●
●●●● ● ●● ● ● ● ●
● ●●
●● ● ●●●●● ●●●● ●●●●●
● ●● ●● ●●●
●●●● ●● ●●●●●●
●●●
● ●●● ● ●● ●●●●● ●●●● ●●● ●●● ● ●●●●●●●● ●● ●●●●
● ●●●●●● ● ● ● ● ● ● ● ● ●
● ●● ● ● ●
● ●●● ●
●●● ●●
●●●● ●● ● ●
●●●●● ●●●
●
● ●●
● ● ●●● ● ● ●● ● ●
● ●●● ●●● ●●● ● ● ● ● ● ●● ● ●
● ● ● ● ●●●●● ●● ●
● ● ●●
●● ●
●
●
● ●
0
● ●
●● ●●●●● ●● ● ●●● ● ●● ● ●● ●
●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●●●● ● ●●●● ● ● ●●● ●● ●● ● ●● ● ●●●● ●●● ●●● ● ● ● ●● ●●●
● ● ●● ●●●●●
●● ●●●● ●●●●●
●●
● ● ● ●● ●●● ●●● ● ●●● ● ●● ● ● ●● ● ● ●●●
●●●●●●●●●●●●●● ●●●●●● ●● ●●●●●●●●●
●● ● ●●
● ●●●● ● ●●
● ●●●●●● ● ●●● ● ● ●● ●●●●● ●● ● ●● ●●● ●● ● ● ●● ●● ● ● ● ● ● ●● ●● ●●
● ● ●● ● ●●●● ● ●●● ●●●● ●● ●●●● ●● ●●● ●●● ●●● ●●
●●● ●● ● ●●●● ●●●● ● ●● ●●
●● ●● ●●●●●●●● ●●
●● ● ●● ● ●
● ●●● ● ●●
● ●● ●●●
●
●● ● ● ● ● ● ● ●● ● ●●
● ● ● ●
●● ● ● ●●●● ●●●●●●●
●● ● ●●
●●●●●● ● ● ● ●●●●●●●● ●●●
●●●●● ●●● ●●●●●● ●●● ●●●●
● ● ● ●
● ●● ●● ●● ●● ● ● ●● ●● ●● ●●
● ●● ●● ●● ● ● ●● ● ●● ●● ● ● ●●●● ●
●● ●● ●
●● ● ● ● ●●
● ●● ● ●●●●● ●●● ● ● ●●●● ● ● ● ● ● ●●
●●
● ●●●●● ● ● ●●●● ●●●●
● ●●●● ●●●●●
●●● ●●● ● ●●
●● ● ● ●● ● ●● ●● ● ●●● ●
● ●●
● ● ● ● ● ● ● ● ● ●
●
●●●●● ●●● ●● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●●●
●● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ●
●
● ● ● ●● ● ● ● ● ● ● ●● ● ●● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ●●● ● ●● ●● ● ● ●● ●● ●
●● ● ● ● ● ● ● ● ● ●
●● ● ●● ●
● ●● ● ● ●● ● ● ●
● ● ● ●● ● ● ●
●
● ●
● ●
●● ● ●● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ●
● ●● ●● ● ● ● ● ● ●
● ● ●●
● ●
●●● ●● ●● ● ● ● ●
● ● ● ●● ● ●
● ● ● ● ● ●●●●● ● ● ●● ●
●● ●● ● ● ● ●
● ●
Single BART
● ●●
● ● ● ● ● ● ● ●
●
● ● ● ●
● ●●● ●● ●●●● ●●●● ● ● ●
● ● ●
−2
●● ● ● ● ●
● ● ● ● ●●
● ●
● ● ● ● ●
● ● ●● ●● ●
●● ●
● ●● ●●●
●●● ● ●
● ● ●● ● ●● ● ●
● ●●● ●● ● ●
Separate BART
● ●● ●
● ● ● ● ●
● ● ●●● ●● ●
● ● ●● ● ● ● ●
●● ●
●●● ●
● ●●● ●
BCF
●●
●
●
−4
Subject index
FIGURE 23.2
Conditional average treatment effect estimates evaluated at each Xi from the three models for
treatment effect heterogeneity. The solid lines refer to posterior means, while the corresponding dots
represent upper and lower 95% credible intervals.
Overall the Bayesian paradigm provided an elegant solution to this problem that provided a
sufficient amount of flexibility for estimation of conditional average treatment effects while still
522 Bayesian Propensity Score Methods and Related Approaches for Confounding Adjustment
being able to perform inference automatically on any estimands of interest. Future work could
look into the best choice of prior distribution in this framework for estimating heterogeneous treat-
ment effects. Recent work has shown that other priors such as Gaussian processes may perform
better with respect to uncertainty quantification than BART, particularly when there is less propen-
sity score overlap [92]. Other extensions of these ideas could involve non-binary treatments or
higher-dimensional covariate spaces. The latter of these two is a difficult problem, particularly with
respect to uncertainty quantification, but is one for which Bayesian inference is well-suited.
References
[1] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin, Bayesian data
analysis. CRC Press, 2013.
[2] J. Berger, “The case for objective bayesian analysis,” Bayesian analysis, vol. 1, no. 3, pp. 385–
402, 2006.
[3] G. Consonni, D. Fouskakis, B. Liseo, and I. Ntzoufras, “Prior distributions for objective
bayesian analysis,” Bayesian Analysis, vol. 13, no. 2, pp. 627–679, 2018.
[4] W. R. Gilks, S. Richardson, and D. Spiegelhalter, Markov chain Monte Carlo in practice. CRC
press, 1995.
[5] S. Brooks, A. Gelman, G. Jones, and X.-L. Meng, Handbook of markov chain monte carlo.
CRC press, 2011.
[6] J. M. Robins and Y. Ritov, “Toward a curse of dimensionality appropriate (coda) asymptotic
theory for semi-parametric models,” Statistics in Medicine, vol. 16, no. 3, pp. 285–319, 1997.
[7] O. Saarela, D. A. Stephens, E. E. Moodie, and M. B. Klein, “On bayesian estimation of marginal
structural models,” Biometrics, vol. 71, no. 2, pp. 279–288, 2015.
[8] J. M. Robins, M. A. Hernán, and L. Wasserman, “On bayesian estimation of marginal structural
models,” Biometrics, vol. 71, no. 2, p. 296, 2015.
[9] O. Saarela, L. R. Belzile, and D. A. Stephens, “A bayesian view of doubly robust causal
inference,” Biometrika, vol. 103, no. 3, pp. 667–681, 2016.
[10] D. B. Rubin, “Bayesian inference for causal effects: The role of randomization,” Ann. Statist.,
vol. 6, no. 1, pp. 34–58, 1978.
[11] C. M. Zigler, “The central role of bayes’ theorem for joint estimation of causal effects and
propensity scores,” The American Statistician, vol. 70, no. 1, pp. 47–54, 2016.
[12] D. B. Rubin and N. Thomas, “Matching using estimated propensity scores: relating theory to
practice,” Biometrics, pp. 249–264, 1996.
[13] E. A. Stuart, “Matching methods for causal inference: A review and a look forward,” Statistical
Science: A Review Journal of the Institute of Mathematical Statistics, vol. 25, no. 1, p. 1, 2010.
[14] J. M. Robins, M. A. Hernan, and B. Brumback, “Marginal structural models and causal inference
in epidemiology,” 2000.
Other Issues at the Intersection of Confounding Adjustment and Bayesian Analysis 523
[15] F. Li, K. L. Morgan, and A. M. Zaslavsky, “Balancing covariates via propensity score weighting,”
Journal of the American Statistical Association, vol. 113, no. 521, pp. 390–400, 2018.
[16] J. K. Lunceford and M. Davidian, “Stratification and weighting via the propensity score in
estimation of causal treatment effects: A comparative study,” Statistics in Medicine, vol. 23,
no. 19, pp. 2937–2960, 2004.
[17] L. C. McCandless, P. Gustafson, and P. C. Austin, “Bayesian propensity score analysis for
observational data,” Statistics in Medicine, vol. 28, no. 1, pp. 94–112, 2009.
[18] D. B. Rubin et al., “For objective causal inference, design trumps analysis,” Annals of Applied
Statistics, vol. 2, no. 3, pp. 808–840, 2008.
[19] D. Lunn, N. Best, D. Spiegelhalter, G. Graham, and B. Neuenschwander, “Combining mcmc
with ‘sequential’pkpd modelling,” Journal of Pharmacokinetics and Pharmacodynamics,
vol. 36, no. 1, p. 19, 2009.
[20] P. E. Jacob, L. M. Murray, C. C. Holmes, and C. P. Robert, “Better together? statistical learning
in models made of modules,” arXiv preprint arXiv:1708.08719, 2017.
[21] F. Liu, M. Bayarri, J. Berger, et al., “Modularization in bayesian analysis, with emphasis on
analysis of computer models,” Bayesian Analysis, vol. 4, no. 1, pp. 119–150, 2009.
[22] L. C. McCandless, I. J. Douglas, S. J. Evans, and L. Smeeth, “Cutting feedback in bayesian
regression adjustment for the propensity score,” The International Journal of Biostatistics,
vol. 6, no. 2, 2010.
[23] C. M. Zigler, K. Watts, R. W. Yeh, Y. Wang, B. A. Coull, and F. Dominici, “Model feedback in
bayesian propensity score estimation,” Biometrics, vol. 69, no. 1, pp. 263–273, 2013.
[24] P. R. Rosenbaum and D. B. Rubin, “The central role of the propensity score in observational
studies for causal effects,” Biometrika, vol. 70, no. 1, pp. 41–55, 1983.
[25] D. Kaplan and J. Chen, “A two-step bayesian approach for propensity score analysis: Simula-
tions and case study,” Psychometrika, vol. 77, no. 3, pp. 581–609, 2012.
[26] D. B. Rubin, “The use of propensity scores in applied bayesian inference,” Bayesian Statistics,
vol. 2, pp. 463–472, 1985.
[27] C. Wang, G. Parmigiani, and F. Dominici, “Bayesian effect estimation accounting for adjustment
uncertainty,” Biometrics, vol. 68, no. 3, pp. 661–671, 2012.
[28] A. Belloni, V. Chernozhukov, and C. Hansen, “Inference on treatment effects after selection
among high-dimensional controls,” The Review of Economic Studies, vol. 81, no. 2, pp. 608–650,
2014.
[29] C. M. Zigler and F. Dominici, “Uncertainty in propensity score estimation: Bayesian methods
for variable selection and model-averaged causal effects,” Journal of the American Statistical
Association, vol. 109, no. 505, pp. 95–107, 2014.
[30] S. M. Shortreed and A. Ertefaie, “Outcome-adaptive lasso: Variable selection for causal infer-
ence,” Biometrics, vol. 73, no. 4, pp. 1111–1122, 2017.
[31] A. Gelman and J. Hill, Data analysis using regression and multilevel/hierarchical models.
Cambridge university press, 2006.
524 Bayesian Propensity Score Methods and Related Approaches for Confounding Adjustment
[32] A. Abadie and G. W. Imbens, “Matching on the estimated propensity score,” Econometrica,
vol. 84, no. 2, pp. 781–807, 2016.
[33] B. A. Brumback, “A note on using the estimated versus the known propensity score to estimate
the average treatment effect,” Statistics & Probability Letters, vol. 79, no. 4, pp. 537–542, 2009.
[34] C. M. Zigler and M. Cefalu, “Posterior predictive treatment assignment for estimating causal
effects with limited overlap,” arXiv preprint arXiv:1710.08749, 2017.
[35] R. M. Alvarez and I. Levin, “Uncertain neighbors: Bayesian propensity score matching for
causal inference,” arXiv preprint arXiv:2105.02362, 2021.
[36] S. X. Liao and C. M. Zigler, “Uncertainty in the design stage of two-stage bayesian propensity
score analysis,” Statistics in Medicine, vol. 39, no. 17, pp. 2265–2290, 2020.
[37] P. C. Austin, “A critical appraisal of propensity-score matching in the medical literature between
1996 and 2003,” Statistics in Medicine, vol. 27, no. 12, pp. 2037–2049, 2008.
[38] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical
Society: Series B (Statistical Methodology), vol. 58, no. 1, pp. 267–288, 1996.
[39] S. Vansteelandt, M. Bekaert, and G. Claeskens, “On model selection and model misspecification
in causal inference,” Statistical Methods in Medical Research , vol. 21, no. 1, pp. 7–30, 2012.
[40] D. B. Rubin, “Estimating causal effects from large data sets using propensity scores,” Annals of
Internal Medicine, vol. 127, no. 8 Part 2, pp. 757–763, 1997.
[41] M. A. Brookhart, S. Schneeweiss, K. J. Rothman, R. J. Glynn, J. Avorn, and T. Stürmer,
“Variable selection for propensity score models,” American Journal of Epidemiology, vol. 163,
no. 12, pp. 1149–1156, 2006.
[42] S. Schneeweiss, J. A. Rassen, R. J. Glynn, J. Avorn, H. Mogun, and M. A. Brookhart, “High-
dimensional propensity score adjustment in studies of treatment effects using health care claims
data,” Epidemiology (Cambridge, Mass.), vol. 20, no. 4, p. 512, 2009.
[43] X. De Luna, I. Waernbaum, and T. S. Richardson, “Covariate selection for the nonparametric
estimation of an average treatment effect,” Biometrika, p. asr041, 2011.
[44] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins,
“Double/debiased machine learning for treatment and structural parameters,” The Econometrics
Journal, vol. 21, pp. C1–C68, 01 2018.
[45] S. Athey, G. Imbens, and S. Wager, “Approximate residual balancing: debiased inference of
average treatment effects in high dimensions,” Journal of the Royal Statistical Society Series B,
vol. 80, no. 4, pp. 597–623, 2018.
[46] J. Antonelli, M. Cefalu, N. Palmer, and D. Agniel, “Doubly robust matching estimators for high
dimensional confounding adjustment,” Biometrics, vol. 74, no. 4, pp. 1171–1179, 2018.
[47] J. Antonelli and M. Cefalu, “Averaging causal estimators in high dimensions,” Journal of
Causal Inference, vol. 8, no. 1, pp. 92–107, 2020.
[48] A. Ertefaie, M. Asgharian, and D. A. Stephens, “Variable selection in causal inference using a
simultaneous penalization method,” Journal of Causal Inference, vol. 6, no. 1, 2018.
[49] J. Antonelli, G. Parmigiani, and F. Dominici, “High-dimensional confounding adjustment using
continuous spike and slab priors,” Bayesian Analysis, vol. 14, no. 3, p. 805, 2019.
Other Issues at the Intersection of Confounding Adjustment and Bayesian Analysis 525
[67] F. Mealli and A. Mattei, “A refreshing account of principal stratification,” The International
Journal of Biostatistics, vol. 8, no. 1, 2012.
[68] A. Mattei, F. Li, and F. Mealli, “Exploiting multiple outcomes in bayesian principal stratification
analysis with application to the evaluation of a job training program,” The Annals of Applied
Statistics, vol. 7, no. 4, pp. 2336–2360, 2013.
[69] L. Forastiere, F. Mealli, and T. J. VanderWeele, “Identification and estimation of causal mech-
anisms in clustered encouragement designs: Disentangling bed nets using bayesian principal
stratification,” Journal of the American Statistical Association, vol. 111, no. 514, pp. 510–525,
2016.
[70] J. Roy, K. J. Lum, B. Zeldow, J. D. Dworkin, V. L. Re III, and M. J. Daniels, “Bayesian
nonparametric generative models for causal inference with missing at random covariates,”
Biometrics, vol. 74, no. 4, pp. 1193–1202, 2018.
[71] M. J. Daniels, J. A. Roy, C. Kim, J. W. Hogan, and M. G. Perri, “Bayesian inference for the
causal effect of mediation,” Biometrics, vol. 68, no. 4, pp. 1028–1036, 2012.
[72] C. Kim, M. J. Daniels, B. H. Marcus, and J. A. Roy, “A framework for bayesian nonparametric
inference for causal effects of mediation,” Biometrics, vol. 73, no. 2, pp. 401–409, 2017.
[73] L. C. McCandless, P. Gustafson, and A. Levy, “Bayesian sensitivity analysis for unmeasured
confounding in observational studies,” Statistics in Medicine, vol. 26, no. 11, pp. 2331–2347,
2007.
[74] J. Antonelli and B. Beck, “Estimating heterogeneous causal effects in time series settings with
staggered adoption: An application to neighborhood policing,” arXiv e-prints, pp. arXiv–2006,
2020.
[75] S. Chib and L. Jacobi, “Bayesian fuzzy regression discontinuity analysis and returns to compul-
sory schooling,” Journal of Applied Econometrics, vol. 31, no. 6, pp. 1026–1047, 2016.
[76] Z. Branson, M. Rischard, L. Bornn, and L. W. Miratrix, “A nonparametric bayesian methodology
for regression discontinuity designs,” Journal of Statistical Planning and Inference, vol. 202,
pp. 14–30, 2019.
[77] P. R. Hahn, C. M. Carvalho, D. Puelz, J. He, et al., “Regularization and confounding in linear
regression for treatment effect estimation,” Bayesian Analysis, vol. 13, no. 1, pp. 163–182,
2018.
[78] S. Adhikari, S. Rose, and S.-L. Normand, “Nonparametric bayesian instrumental variable
analysis: Evaluating heterogeneous effects of coronary arterial access site strategies,” Journal
of the American Statistical Association, vol. 115, no. 532, pp. 1635–1644, 2020.
[79] P. Müller, F. A. Quintana, A. Jara, and T. Hanson, Bayesian nonparametric data analysis.
Springer, 2015.
[80] J. L. Hill, “Bayesian nonparametric modeling for causal inference,” Journal of Computational
and Graphical Statistics, vol. 20, no. 1, pp. 217–240, 2011.
[81] H. A. Chipman, E. I. George, and R. E. McCulloch, “BART: Bayesian additive regression trees,”
The Annals of Applied Statistics, vol. 4, no. 1, pp. 266–298, 2010.
[82] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
[83] D. B. Rubin, “The bayesian bootstrap,” The annals of statistics, pp. 130–134, 1981.
Other Issues at the Intersection of Confounding Adjustment and Bayesian Analysis 527
[84] R. B. Gramacy and H. K. H. Lee, “Bayesian treed gaussian process models with an application
to computer modeling,” Journal of the American Statistical Association, vol. 103, no. 483,
pp. 1119–1130, 2008.
[85] S. Banerjee, A. E. Gelfand, A. O. Finley, and H. Sang, “Gaussian predictive process mod-
els for large spatial data sets,” Journal of the Royal Statistical Society: Series B (Statistical
Methodology), vol. 70, no. 4, pp. 825–848, 2008.
[86] A. Banerjee, D. B. Dunson, and S. T. Tokdar, “Efficient gaussian process regression for large
datasets,” Biometrika, vol. 100, no. 1, pp. 75–89, 2013.
[87] V. Dorie, J. Hill, U. Shalit, M. Scott, and D. Cervone, “Automated versus do-it-yourself methods
for causal inference: Lessons learned from a data analysis competition (with discussion),”
Statistical Science, vol. 34, pp. 43–99, 02 2019.
[88] A. Oganisian and J. A. Roy, “A practical introduction to bayesian estimation of causal effects:
Parametric and nonparametric approaches,” Statistics in Medicine, vol. 40, no. 2, pp. 518–551,
2021.
[89] S. Wade, S. Mongelluzzo, and S. Petrone, “An enriched conjugate prior for bayesian nonpara-
metric inference,” Bayesian Analysis, vol. 6, no. 3, pp. 359–385, 2011.
[90] D. Xu, M. J. Daniels, and A. G. Winterstein, “A bayesian nonparametric approach to causal
inference on quantiles,” Biometrics, vol. 74, no. 3, pp. 986–996, 2018.
[91] P. R. Hahn, J. S. Murray, and C. M. Carvalho, “Bayesian regression tree models for causal
inference: regularization, confounding, and heterogeneous effects,” Bayesian Analysis, 2020.
[92] R. Papadogeorgou and F. Li, “Discussion of “Bayesian regression tree models for causal
inference: Regularization, confounding, and heterogeneous effects”,” Bayesian Analysis, vol. 15,
no. 3, pp. 1007–1013, 2020.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Part V
Beyond Adjustments
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
24
How to Be a Good Critic of an Observational Study
Dylan S. Small
CONTENTS
24.1 Smoking and Lung Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
24.2 Bross’s Criterion for Good Criticism: Show that Counterhypothesis is Tenable . . . . . 532
24.3 Bross’s Types of Bad Criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
24.3.1 Hit-and-run criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
24.3.2 Dogmatic criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
24.3.3 Speculative criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
24.3.4 Tubular criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
24.4 Less Stringent Criteria for Criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
24.4.1 Welcoming all criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
24.4.2 Welcoming more criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
24.4.3 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
24.5 Evaluating Criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
24.6 Self-Criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
the increase, and the focus of the discussion shifted. What was the cause of the epidemic? Smoking
was one theory. However, other experts thought that emissions from gas works were the cause. Still
others believed that fumes from the tarring of roads were responsible.”
There were early papers that reported an association between smoking and lung cancer and
suggested a casual effect [2–4]. But two papers published in 1950, Wynder and Graham [5] in the
US and Doll and Hill [6] in the UK, attracted more attention [1]. Many papers in the 1950s followed;
see [7] for a review. In 1957 a Study Group appointed by the National Cancer Institute, the National
Heart Institute, the American Cancer Society, and the American Heart Association, examined the
scientific evidence on the effects of smoking on health and arrived at the following conclusion: “The
sum total of scientific evidence establishes beyond reasonable doubt that cigarette smoking is a
causative factor in the rapidly increasing incidence of human epidermoid carcinoma of the lung.” [8]
But the public was not convinced – cigarette sales were at an all time high in 1957 [9]. And there
remained fierce scientific critics including the famous statisticians R.A. Fisher and Joseph Berkson.
Motivated by what he saw as poor criticism in the debate over smoking and lung cancer, Bross [10]
formulated ground rules for good statistical criticism.
method that the hypothesis be clearly stated so that it can be tested, good criticism of a scientific study should explicitly state a
counterhypothesis so that the criticism can be evaluated using the scientific method.
Bross’s Types of Bad Criticism 533
urban-rural residence would make the counterhypothesis that smoking does not cause lung cancer
tenable.
Postmenopausal hormone replacement therapy (HRT) is use of a drug containing female hormones
to replace ones a woman’s body no longer makes after menopause. Observational studies during the
1970s–1990s reported evidence that HRT protects against heart disease. For example, Stampfer et
al. [18] published an observational study of nurses in the New England Journal of Medicine that
compared postmenopausal women who had currently used HRT vs. those who had never used it
and reported that after controlling for a number of measured covariates – including age, body mass
index (BMI), and smoking – current use of HRT was associated with an estimated 70% reduction
in the risk of coronary heart disease and a 95% confidence interval of 36% to 86% reduction in
risk with a p-value for testing the null hypothesis of no effect of less than 0.002. Use of HRT
was being advocated by many in the medical community and users of HRT could be thought of
as complying with medical advice [19–23]. Pettiti [24] raised the concern that the observational
studies of HRT might be biased because compliers with medical advice often have healthier lifestyles
than noncompliers [25]. Pettiti considered the counterhypothesis that there was no causal effect
of HRT on heart disease and the lower observed risk of heart disease was fully accounted for by
this “compliance bias.” To evaluate whether this counterhypothesis was tenable, Pettiti studied how
much of a difference in heart disease risk there was among compliers with placebo vs. noncompliers
with placebo in randomized trials of two drug treatments for coronary heart disease, beta blockers
and clofibrate. Pettiti found that compliers had lower mortality (which was mostly due to heart
disease) and that the reduction in mortality was similar to that of the reduction in mortality among
users vs. non-users of HRT. Furthermore, Pettiti found that after adjusting for similar covariates as
those adjusted for in the observational studies of HRT, the reduction in mortality among compliers
remained the same. While it can be argued whether the bias among compliers vs. noncompliers in
drug trials is comparable to that of compliers vs. noncompliers with medical advice about taking
HRT, it is plausibly so, and Pettiti has at least made a serious case that her counterhypothesis is
tenable. Pettiti’s criticism was a good criticism.
Bross [10] describes several classes of what he considers bad criticism which do not satisfy his
criterion of showing that the counterhypothesis is tenable – hit-and-run criticism, dogmatic criticism,
speculative criticism, and tubular criticism. We review these next.
convincing evidence for a diagnosis of lung cancer. Thus, misclassification arguably enhanced rather
than dented the evidence Hammond and Horn’s study provided for smoking causing lung cancer.2
The second example of hit-and-run criticism that Bross [10] gives is Berkson’s [30] paper
that criticizes Hammond and Horn’s [26] study on the grounds of selection bias. Hammond and
Horn found a strong association between smoking and lung cancer and suggested that an active
educational campaign be conducted to warn the public that smoking increases the risk of lung cancer.
Berkson [30] asserted that “such a proposal seems poorly founded” because smokers in the study had
equal or lower cancer death rates than the general U.S. population3 and warned that “the operation
of selective forces in statistical investigations can be a very subtle process.” Berkson provided a
model and hypothetical numerical example of how selection bias in Hammond and Horn’s [26]
study could produce a spurious (i.e., non-causal) association between smoking and lung cancer
through the selection processes of (i) some seriously ill people at the start of the prospective study not
participating in the study and (ii) smokers being more reluctant to participate in the study. While these
selection processes may be present, Korteweg [31] showed that in Berkson’s numerical example,
if the arbitrarily chosen mortality rates in Berkson’s hypothetical example are substituted by rates
from Hammond and Horn [26] and U.S. official mortality statistics, “only a small part of the excess
in death rates for lung cancer and for coronary disease with smokers can be explained as being
spurious.” Furthermore, under Berkson’s model, if there was no causal effect of smoking on lung
cancer and the association in Hammond and Horn’s study arose entirely from selection bias, the
association between smoking and lung cancer should diminish over the course of the study but in
fact it increased [1, 10]. Berkson’s criticism is like the lawyer’s tactic of cross-examining an expert
witness by asking a hypothetical question removed from the facts of the case. The lawyer asks the
physician witness, “Would you do an MRI scan for a patient who makes repeated complaints of
severe neck pain?” If there was no evidence of such complaints, a good reply by the witness would
be ”Yes, but in this case, the charts and the admission questionnaire shows no such complaints.”
Another example of hit-and-run criticism was pointed out by Kodlin [32] concerning studies
that showed an association between blood group and diseases such as gastric cancer, malaria and
peptic ulcers. Alexander Wiener, a winner of the prestigious Lasker Prize, criticized these studies as
“fallacious” and founded on a “bias in the collection of data” that in borderline cases of classifying a
case as a certain disease or not, the investigator is subconsciously influenced by knowledge of the
patient’s blood-group [33, 34]. Wiener asserted that “such bias actually occurs in practice is proved
by Billington [the author of a study that found an association between gastric cancer and blood
group [35]]’s admission when he was questioned about possible bias when classifying a series of
cases of carcinoma of the stomach according to site of the lesion” [34] In fact, Wiener asked “Can Dr.
Billington exclude this possibility [bias from differential misclassification of borderline cases]?” [33]
to which Dr. Billington replied “I admit I am unable to exclude the possibility.” Hardly proof that
such bias occurred! In fact, Wiener’s counterhypothesis that bias from differential misclassification of
borderline cases explains all the associations between blood group and disease does not seem tenable.
It would require for certain diseases (e.g., peptic ulcers), a person with blood type O was more likely
to be called a disease case when borderline but for other diseases (e.g., gastric cancer), a person
with blood type A was more likely to be called a disease case. Furthermore, since the associations
between blood group and disease have been repeatedly found, it would require that independent
2 Sterling et al. [28] discussed the possibility that there is nondifferential misclassification of the cause of death where
lung cancer is more likely to be listed as the cause of death for a smoker. Flanders [29] argued that the hypothesized bias
from nondifferential misclassification would probably be of small magnitude and affect only a select subgroup of the many
investigations of smoking and lung cancer so that consequently, “Even if the bias should prove real, current ideas about
smoking and its adverse effects would change little.”
3 The study population consisted of white U.S. males from nine states and the follow-up period contained more low
death summer months than high death winter months; Hammond and Horn [26] did not claim the study population was
representative of the U.S. population. See Korteweg [31] for discussion.
Bross’s Types of Bad Criticism 535
teams, who are not necessarily particularly eager to confirm previous findings, are all biased in the
same way [32].
average relatively longevous, and this implies that death rates generally in this segment of the
population will be relatively low. After all, the small group of persons who successfully resist
the incessantly applied blandishments and reflex conditioning of the cigarette advertisers
are a hardy lot, and, if they can withstand these assaults, they should have relatively little
difficulty in fending off tuberculosis or even cancer! If it seems difficult to visualize how
such a constitutional influence can carry over to manifest itself as a graded increase of
death rate with a graded increase of intensity of smoking, then we must remember that we
are wandering in a wilderness of unknowns. I do not profess to be able to track out the
implications of the constitutional theory or to defend it, but it cannot be disposed of merely
by fiat denial.
Bross credits Berkson for labeling this constitutional hypothesis for the association between smoking
and lung cancer as speculative, but expresses concern that this speculative criticism appears to play an
important role in Berkson’s subsequent rejection of the claim that the observational studies provide
strong evidence for smoking causing lung cancer. In the summary of his paper, Berkson states this
rejection and restates the speculative criticism without the caution that it is speculative.
Bross anticipates arguments “that it is too stringent to require a critic to show that his substantive
counterhypothesis is tenable because he is not actually asserting it but merely suggesting it as a
possible line for future research. However, “I fail to see how a critic contributes to the scientific
process if the suggested avenue for research is, in fact, a dead end road. Nor can I see how a critic
can expect to point out a sensible direction for research unless he explores the tenability of his
counterhypothesis – for example [in studies of smoking and lung cancer] whether his notion jibes
with the incidence pattern for lung cancer.”
Writing in 1959, Cornfield et al. [1] pointed out four ways in which the constitutional hypothesis
does not entirely jibe with the incidence pattern for lung cancer: (i) the rapid rise in lung cancer in
the first half of the 20th century; (ii) the carcinogenicity of tobacco tars for experimental mice; (iii)
the existence of a large association of pipe and cigar tobacco smoking with cancers of the buccal
cavity and larynx but not with cancers of the lung; and (iv) the reduced lung-cancer mortality among
discontinued cigarette smokers. To explain (i) with the constitutional hypothesis, one could assert
that the constitution of people got worse during the first half of the 20th Century5 ; to explain (ii), one
could assert that there is a difference between mice and humans; to explain (iii), one could assert
that the constitutions of pipe and cigar tobacco smokers tend to protect them against lung cancer
but leaves them vulnerable to cancers of the buccal cavity and larynx; and to explain (iv), one could
assert that the constitutions of cigarette smoking quitters is better than smokers. While it is possible
that each of these assertions is true, all of them being true so that the constitutional hypothesis jibes
with the data does not seem particularly tenable. Additional features of the incidence pattern for
lung cancer that have arisen since 1959 are also hard for the constitutional hypothesis to explain. For
example, prior to the 1960s, women were much less likely to smoke than men but in the 1960s, the
tobacco industry sought to change that with products and advertising aimed at women. For instance,
beginning in 1967, a time when the women’s rights movement was gaining steam, the Virginia
Slims brand was marketed under the slogan “You’ve come a long way baby” with ads that showed
independent, stylish, confident and liberated women smoking. Women increased their smoking but
men did not [42]. Thirty years later, Bailar and Gornik [43] wrote: “For lung cancer, death rates for
women 55 or older have increased to almost four times the 1970 rate,” but rates for males over 55
and rates for other cancer sites showed no such dramatic increase.6
5 One could also augment the constitutional hypothesis with the hypothesis that some other factor that rose during the first
• Hack jobs – criticism that is motivated by a desire to muddy the waters rather than to get at the
truth. Gelman says “Even a hack can make a good point, and hacks will use legitimate arguments
where available.” The problem Gelman says with hack criticism is not in the criticism itself but
in the critical process, “a critic who aims at truth should welcome a strong response, while a hack
will be motivated to avoid any productive resolution.”
538 How to Be a Good Critic of an Observational Study
Gelman summarizes his argument by paraphrasing Al Smith’s quote “The cure for the evils of
democracy is more democracy,” by “the ills of criticism can be cured by more criticism.” That said,
Gelman recognizes “that any system based on open exchange can be hijacked by hacks, trolls, and
other insincere actors.” How to have open exchange of ideas without it being hijacked by trolls and
people seeking to just muddy the water is a major challenge. See [49] for some perspectives on
possible approaches.
Gelman makes a good point that we can learn from any statistical criticism, whatever its source.
The difficulty is that we all have limits on our time and cognitive capacity. It would be helpful if
critics label the type of criticism they are providing so that authors and scientific community can
prioritize their time in thinking about the criticism.
Another interesting example Reichardt brings up to question whether Bross’s criterion for a
hypothesis or counterhypothesis to be tenable is too stringent is the work of Ingaz Semmelweis.
Giving birth was perilous for mothers in the 1800s and childbed fever was one common cause of
death. Childbed fever was thought to be due to multiple causes and the causes of illnesses were
thought to be as unique as individuals themselves and determinable only on a case by case basis [52].
When Ignaz Semmelweis was hired as a physician at the Vienna General Hospital’s First Obstetrical
Clinic in 1846, as many as twenty percent of the women giving birth to a child in the hospital’s
First Clinic died from childbed fever; high death rates from childbed fever were common in many
hospitals [52]. Semmelweis observed that the Vienna General Hospital’s Second Obstetrical Clinic
had around a 2% death rate from childbed fever. Semmelweis was severely troubled that his First
Clinic had a much higher mortality rate than the Second Clinic. It “made me so miserable that
life seemed worthless” [57]. Semmelweis was determined to find the cause of the difference. The
First Clinic and Second Clinic admitted women on alternate days so differences in the humours of
the body did not seem a likely explanation for the difference in death rates. Semmelweis started
eliminating all possible differences between the two clinics, including even religious practices. The
only major difference was the individuals who worked there. The First Clinic was the teaching
service for medical students, while the Second Clinic had been selected in 1841 for the instruction
of midwives only. Semmelweis uncovered a telling clue when his friend Jakob Kolletschka cut his
finger while performing an autopsy and died from symptoms similar to childbed fever. Semmelweis
hypothesized that childbed fever was caused by contamination from “cadaverous material” from
doctors who often performed autopsies before serving on the obstetrics ward. This would explain the
lower rate in the Second Clinic because midwives do not perform autopsies. Semmelweis instituted a
policy that physicians wash their hands in a solution of chlorinated lime before examining patients.
The First Clinic’s death rate from childbed fever dropped 90% and became comparable to the Second
Clinic.
Instead of being lauded as the “savior of mothers,” as he is now called, Semmelweis was ridiculed
and his ideas were rejected by the medical community [58]. An example is a paper by Carl Edvard
Marius Levy, head of the Danish Maternity Institute at Copenhagen, who wrote:
If Dr. Semmelweis had limited his opinion regarding infections from corpses to puerperal
corpses, I would have been less disposed to denial than I am... the specific contagium seems
to be of little importance to Dr. Semmelweis. Indeed it is so little considered that he does not
even discuss the direct transmission of the disease from those who are ill to healthy persons
lying nearby. He is concerned only with general infection from corpses without respect to
the disease that led to death. In this respect his opinion seems improbable...a rapidly fatal
putrid infection, even if the putrid matter is introduced directly into the blood, requires
more than homeopathic doses of the poison. And, with due respect for the cleanliness of
the Viennese students, it seems improbable that enough infective matter or vapor could be
secluded around the fingernails to kill a patient...To prove his opinion, Dr. Semmelweis
ordered chlorine washings to destroy every trace of cadaverous residue on the fingers. Would
not the experiment have been simpler and more reliable if it had been arranged, at least during
the experiment, that all anatomical work would be avoided?...In spite of these reservations,
one must admit that the results of the experiment appear to support Dr. Semmelweis’s opinion,
but certainly one must admit no more. Everyone who has had the opportunity to observe the
periodic variations in the mortality rate of maternity clinics will agree that his findings lack
certain important confirmation...In the absence of more precise statistical information, it is
conceivable that the results of the last seven months depend partially on periodic accidental
factors...that insofar as they are laid out, his [Semmelweis’] views appear too unclear, his
observations too volatile, his experiences too uncertain, to deduce scientific results therefrom.
Would application of Bross’s criterion that a hypothesis or counterhypothesis should be tenable to
be publishable have suggested that in light of the concerns raised by Levy, Semmelweis’s findings
540 How to Be a Good Critic of an Observational Study
should not have been published or his suggested policy of doctors washing their hands should not
have been implemented? Levy raises some concerns that were legitimate at the time. Semmelweis’s
hypothesis that a small dose of invisible cadaverous particles cause childbed fever was less plausible
before the acceptance of the germ theory of disease. Furthermore, Semmelweis had claimed at the
time that Levy was writing that only cadaveric matter from corpses could cause childbed fever, a
claim that was incorrect [59]. Also mortality rates of childbed fever fluctuated dramatically within
hospitals and within towns more generally [59] and consequently the data Semmelweis had compiled
at the time Levy was writing, shown in Figure 24.1, suggested handwashing was beneficial but
perhaps did not make an overwhelming case7 . However, Semmelweis did make a tenable case that
handwashing was beneficial that should have been taken seriously. Levy’s wholesale dismissal
of Semmelweis’s hypothesis – “his views appear too unclear, his observations too volatile, his
experiences too uncertain, to deduce scientific results therefrom” – is based on hit-and-run criticism
and tubular criticism. Other criticisms of Semmelweis were dogmatic. For example, some critics
blamed Semmelweis’s hypothesis that invisible particles from cadaverous material causing childbed
fever on his Catholic faith. They said his idea that invisible particles could cause disease and death
was simply a product of his Catholic superstition, and argued that the presence of Catholic priests
bearing the Eucharist to dying patients was deeply frightening, and this fright induced child-bed
fever. Semmelweis tested this theory by keeping priests out of one ward while admitting them to a
second: no difference in illness or mortality was observed. Despite this, the critics continued to hold
that Semmelweis’s religion was the actual cause of the deadly disease [62].
FIGURE 24.1
Data Semmelweis compiled on the monthly proportion of maternal deaths from child bed fever in
births at the First Clinic of the Vienna Maternity Institution at the time Levy was writing his criticism
of Semmelweis’s hypothesis that chlorine handwashing prevented child bed fever deaths.
In welcoming more criticism, Reichardt also makes the good point that for a counterhypothesis
to be tenable, it need not, by itself, account for all of the observed results: “I’ve seen instances where
an estimate of a treatment effect, for example, is said to be immune to a rival hypothesis because the
alternative explanation was insufficient to account for the entirety of the estimate. But more than one
bias can be present. And perhaps together they could account for the whole treatment effect estimate.
So the tenability of alternative explanations needs to be considered en masse rather than one at a
time.”
7 Levy wrote his article based on a letter written in December 1847 by Heinrich Hermann Schwartz to Professor Gustav
Adolph Michaelis, which Michaelis forwarded to Levy [60]. We have shown the data through December 1847. Data from [61]
Less Stringent Criteria for Criticism 541
Like Reichardt [52], the authors Ho [63], Rindskopf [64], Rosenbaum and Small [65] and Hill
and Hoggatt [66] are all generally supportive of Bross’s position that there should be standards for
criticism – “requiring critics to do more than sling random criticisms without some backing for their
statements was, and still is, a reasonable standard to meet” [64] – but they all express concerns that
the standards not be too stringent as to stifle useful criticism. Ho [63] and Rindskopf [64] point
out that the underlying data needed to do a reanalysis may not be publicly available. Making data
publicly available should be encouraged. In contrast, dismissing criticism because a reanalysis was
not done when the data is not made publicly available might encourage making data unavailable.
Rosenbaum and Small [65] argue that a criticism can advance understanding when it points out a
logical inconsistency among data, a proponent’s assumptions, and scientific knowledge from other
sources, even if it does not make a definitive case that the proponent’s hypothesis is wrong. For
example, “Yang et al. [67] considered a plausible instrumental variable (IV) and used it to estimate a
plausible beneficial treatment effect in one population in which the true treatment effect is unknown.
They then applied the same IV to a second population in which current medical opinion holds that
this same treatment confers no benefit, finding that this IV suggests a benefit in this second population
also. Specifically, there is debate about whether delivery by caesarean section improves the survival
of extremely premature infants, but current medical opinion holds that it is of no benefit for otherwise
healthy but slightly premature infants. In contrast, the IV analysis suggested a substantial benefit
for both types of infants. In light of this, there is logical incompatibility between four items: (i)
the data, (ii) the claim that extremely premature infants benefit from delivery by caesarean section,
(iii) the claim that otherwise healthy, slight premature infants do not benefit, (iv) the claim that
the IV is valid in both groups of babies. Removal of any one of (i), (ii), (iii) or (iv) would remove
the inconsistency, but there is no basis for removing one and accepting the others.” This situation
where several propositions are logically inconsistent so that they cannot all be true, yet at the present
moment we are not in a position to identify which proposition(s) are false is called an aporia [68].
An aporia, though uncomfortable, is an advance in understanding: it can spur further investigation
and further advances in understanding. Socrates, in Plato’s Meno [69], thought that demonstrating
an aporia in a curious person’s thinking would spur discovery. Socrates said of a befuddled young
interlocutor who he put in an aporia:
At first he did not know what [he thought he knew], and he does not know even now: but
at any rate he thought he knew then, and confidently answered as though he knew, and was
aware of no difficulty; whereas now he feels the difficulty he is in, and besides not knowing
does not think he knows...[W]e have certainly given him some assistance, it would seem,
towards finding out the truth of the matter: for now he will push on in the search gladly, as
lacking knowledge; whereas then he would have been only too ready to suppose he was
right...[Having] been reduced to the perplexity of realizing that he did not know...he will go
on and discover something
Hill and Hoggatt [66] point out that we should not have harsher standards for the critic than the
proponent. If the proponent makes a claim based on an observational study which may be biased
because of unmeasured confounders, we should not reflexively dismiss a criticism based on an
alternative observational study just because the alternative observational study may be biased because
of having its own unmeasured confounders. “If we are requiring that a counter-hypothesis be tenable,
it seems the criteria should include a reasonable assessment of the plausibility of such assumptions.
However, if we are comparing competing sets of untestable assumptions (corresponding to the
proponent’s original analysis and the critic’s analysis in support of a counter-hypothesis) how should
we assess which of the sets of assumptions are most plausible? Would it be better, for instance, to
use an instrumental variables approach where the instrument is weak and the exclusion restriction
is questionable or to use an observational study where we are uncertain that we have measured all
confounders?” Hill and Hoggatt [66] propose sensitivity analyses as a way to tackle this problem.
542 How to Be a Good Critic of an Observational Study
Evidence may refute a theorem, not the theorem’s logic, but its relevance...Evidence, unlike
proof, is both a matter of degree and multifaceted. Useful evidence may resolve or shed light
on certain issues while leaving other equally important issues entirely unresolved. This is but
one of many ways that evidence differs from proof...
Evidence, even extensive evidence, does not compel belief. Rather than being forced to a
conclusion by evidence, a scientist is responsible and answerable for conclusions reached
in light of evidence, responsible to his conscience and answerable to the community of
scientists
Some of the criteria used for evaluating the credibility of news sources are also useful for
evaluating the quality of evidence presented by a proponent or a critic. For evaluating news sources,
[84] proposes the CRAAP test, a list of questions to ask about the news source related to Currency,
Relevance, Authority, Accuracy, Purpose. For evaluating the quality of evidence, trying to directly
evaluate accuracy is clearly an important consideration, but because there is typically uncertainty in
reading a criticism about its accuracy, authority and purpose are also worth considering. Questions
one should ask about authority include has the author demonstrated expertise on the topic? It is
important though in evaluating authority to keep in mind that well known does not always mean
authoritative and understanding of authority can itself be biased and leave out important voices [85].
One should try to be open to new voices. Of the self taught mathematical genius Srinivasa Ramanujan
who was toiling away as a shipping clerk, Eysenck [86] wrote, “He tried to interest the leading
professional mathematicians in his work, but failed for the most part. What he had to show them was
too novel, too unfamiliar, and additionally presented in unusual ways; they could not be bothered.”
One mathematician G.H. Hardy took Ramanujan seriously; Hardy called their fruitful collaboration
the “one romantic incident in my life” [87]. In evaluating the purpose of a criticism, one should ask,
does the author have an agenda or bias, e.g., is the author part of a group whose interests would be
affected by the research question? One should keep in mind though that as Gelman [48] mentions,
even biased sources can make a good point and even “objective” sources may have a point of view
that slips in [88]. In general, it is best to always try to think critically no matter the source. Thomas
Jefferson emphasized the importance of critical thinking for the public when reading the news and it
might equally apply to the scientific community when reading research [89]:
The basis of our governments being the opinion of the people, the very first object should
be to keep that right; and were it left to me to decide whether we should have a government
without newspapers, or newspapers without a government, I should not hesitate a moment to
prefer the latter. But I should mean that every man should receive those papers & be capable
of reading them.
One useful criterion for evaluating statistical criticism is Bross’s criterion that it should present a
tenable counterhypothesis, along with modifications discussed above that relax it. Gastwirth [90]
presents examples of how the use (lack of use) of Bross’s criterion was helpful in producing fair
(unfair) decisions in employment discrimination cases. Gastwirth also mentions an interesting public
policy example about Reye syndrome where the use of Bross’s criterion might have saved lives.
Reye syndrome is a rapidly developing serious brain disease. Pediatric specialists had suspected
that children receiving aspirin to alleviate symptoms of a cold or similar childhood disease might
increase the risk of contracting Reye syndrome. In 1982, Halpin et al. [91] published a case control
study in which aspirin use in cases of Reye syndrome was compared with that in controls who were
in the case’s class or homeroom and who were of the same sex, race, and age (±1 year) and were
recently absent with an acute illness or appeared ill to the teacher or school nurse. Controlling for the
presence of fever, headache and sore throat, Halpin et al. estimated that aspirin use increased the odds
of Reye syndrome 11.5 times (p < 0.001). Based on the findings by Halpin et al. and two earlier
studies, the U.S. government initiated the process of warning the public, submitting the proposed
warning and background studies to the Office of Information and Regulatory Analysis for review.
544 How to Be a Good Critic of an Observational Study
During the review period, The Aspirin Institute, which represented the interests of the industry,
criticized the Halpin et al. study. The Institute argued that the association between aspirin use and
Reye syndrome could be due the parents of cases who had Reye Syndrome having a stress-induced
heightened recall of the medications they administered including aspirin. The Institute suggested that
instead of the controls being children who had been absent from school with an illness, the controls
be formed from children hospitalized for other diseases or who visited the emergency room since
the parents of these children would under a more similar stress level as the cases. The government
decided that another study should be conducted before warning the public about aspirin and Reye’s
syndrome. A Public Health Task Force was formed and planned the study during 1983 and a pilot
study was undertaken during mid-February through May 1984. The data analysis was reviewed and
made available in December 1984. The logistic regression model that controlled for fever and other
symptoms using all control groups yielded an estimated odds ratio of 19.0 (lower 95% confidence
limit: 4.8) and the two control groups suggested by the Aspirin Institute had the highest estimated
odds ratios (28.5 for emergency room controls and 70.2 for inpatient controls) [92]. Citing these
findings, on January 9, 1985, Health and Human Services Secretary Margaret Heckler, asked all
aspirin manufacturers to begin using warning labels. The CDC reported that the number of cases of
Reye syndrome reported to it dropped from 204 in 1984 before the warning to 98 in 1985 after the
warning. See Gastwirth [90, 93] for further details and discussion of the aspirin-Reye syndrome story.
Gastwirth [90] argues that had Bross’s criterion for criticism been applied, the Aspirin Institute’s
criticism of the 1982 study – that an estimated odds ratio of 11.5 (p < 0.001) was due to recall
bias from the caregiver being under stress rather than an effect of aspirin – should not have been
considered tenable since no supporting evidence was provided for there being sufficient recall bias
to create this high an estimated odds ratio and low a p-value. Consequently, Gastwirth [90] argues,
the U.S. government should have issued a warning about aspirin causing Reye’s syndrome in 1982
rather than 1985, and lives would have been saved. Gastwirth [93] makes a similar argument from a
Bayesian perspective.
In interpreting statistical criticism, it is important to try to spot and avoid falling prey to “argument
from ignorance” (argumentum ad ignorantiam), the trap of fallacious reasoning in which a premise
is claimed to be true only because it has not been proven false or that it is false because it has not
been proven true. Hit-and-run criticism (see Section 24.3.1) is an appeal to argument from ignorance.
Other examples of argument from ignorance include the “margin of error folly” – if it could be (e.g.,
is within the margin of error), it is – or in a hypothesis testing context assuming that if a difference
isn’t significant, it is zero [50, 94]. One should not confuse a statement of “no evidence of an effect”
with one of “evidence of no effect.” In 2016, US News and World Report ran a story “Health Buzz:
Flossing Doesn’t Actually Work, Report Says” which reported on an Associated Press investigation
that found that randomized trials provide only weak evidence of flossing’s benefits. While this is
true, the story’s suggestion that flossing doesn’t work wasn’t justified. No well powered randomized
trials of flossing’s long terms effects have been conducted. A 2011 Cochrane review concluded,
“Twelve trials were included in this review which reported data on two outcomes (dental plaque and
gum disease). Trials were of poor quality and conclusions must be viewed as unreliable. The review
showed that people who brush and floss regularly have less gum bleeding compared to toothbrushing
alone. There was weak, very unreliable evidence of a possible small reduction in plaque. There was
no information on other measurements such as tooth decay because the trials were not long enough
and detecting early stage decay between teeth is difficult” [95]. Why haven’t high quality studies of
flossing’s long term effects been conducted? For one thing, it’s unlikely that an Institutional Review
Board would approve as ethical a trial in which, for example, people don’t floss for three years since
flossing is widely believed by dentists to be effective [96]. Dr. Tim Iafolla, a dental health expert
at the National Institute of Health said, “Every dentist in the country can look in someone’s mouth
and tell whether or not they floss” [97] Red or swollen gums that bleed easily are considered a clear
sign that flossing and better dental habits are needed [97]. Another challenge in conducting a well
powered, long run trial is that it would be difficult to monitor people’s flossing habits over a long
Self-Criticism 545
period and instead such a trial might need to rely on self report. And people tend to report what
they think is the “right” answer when it comes to their health behaviors – e.g., say they are flossing
regardless of whether they are – creating measurement error and reducing power [97]. “The fact that
there hasn’t been a huge population-based study of flossing doesn’t mean that flossing’s not effective,”
Dr. Iafolla said. “It simply suggests that large studies are difficult and expensive to conduct when
you’re monitoring health behaviors of any kind” [97]. Arguments from ignorance are sometimes
made by supporting what is purported to be true not by direct evidence but by attacking an alternative
possibility, e.g., a clinician might say “because the research results indicate a great deal of uncertainty
about what to do, my expert judgment can do better in prescribing treatment than these results” [50].
Another way arguments from ignorance are sometimes made is from personal incredulity where
because a person personally finds a premise unlikely or unbelievable, a claim is made that a premise
can be assumed false or that another preferred but unproved premise is true instead [50].
A reasoning fallacy related to argument from ignorance is falsum in uno, falsum in omnibus (false
in one thing, false in everything) implying that someone found to be wrong on one issue, must be
wrong on all others. Falsum in uno, falsum in omnibus is a common law principle for judging the
reliability of witnessess dating back from at least the Stuart Treason Trials in the late 1600s. Today,
many jurisdictions have abandoned the principle as a formal rule of evidence and instead apply the
rule as a “permissible inference that the jury may or may not draw” [98]. Judge Richard Posner drew
a distinction between “the mistakes that witnesses make in all innocence...(witnesses are prone to
fudge, to fumble, to misspeak, to misstate, to exaggerate)” and “slips that, whether or not they go
to the core of the witness’s testimony, show that the witness is a liar” [99]. In scientific arguments,
hit-and-run criticism is often an attempt to discredit a proponent’s hypothesis based on the equivalent
of a fudge, fumble, misstatement, or exaggeration rather than something that goes to the core of the
data supporting the proponent’s hypothesis or the reliability of the proponent’s work. For example,
the Intergovernmental Panel on Climate Change (IPCC), which shared the Nobel Peace Prize in
2007, wrote an over a thousand page report with 676 authors assessing the evidence for climate
change [100]. The IPCC had a procedure of only relying on data that has passed quality assurance
mechanisms such as peer review. However, it was found that an estimate of how fast the glaciers
in the Himalayas will melt – “if the present rate continues, the likelihood of them disappearing by
the year 2035 and perhaps sooner is very high” – was based on a magazine writer’s phone interview
with an Indian scientist [101]. Newspaper headlines blared “World misled over Himalayan glacier
meltdown” (Sunday Times, Jan. 7, 2010) and “UN report on glaciers melting is based on ’speculation”
(Daily Telegraph, Jan. 7, 2010). The IPCC’s chairman R.K. Pachauri argued that falsum in uno,
falsum in omnibus should not be used to derail the IPCC’s whole science based argument: “we
slipped up on one number, I don’t think it takes anything away from the overwhelming scientific
evidence of what’s happening with the climate of this earth” [101]. See Hubert and Wainer [50] for
further discussion.
24.6 Self-Criticism
In his seminal paper on observational studies, Cochran [18] advocated that the proponent of a
hypothesis should also try to be its critic:
When summarizing the results of a study that shows an association consistent with the causal
hypothesis, the investigator should always list and discuss all alternative explanations of his
[sic] results (including different hypotheses and biases in the results) that occur to him. This
advice may sound trite, but in practice is often neglected. A model is the section “Validity of
546 How to Be a Good Critic of an Observational Study
the results” by Doll and Hill (1952) [103], in which they present and discuss six alternative
explanations of their results in a study.
While it is commonplace in scientific papers to report a study’s limitations and weaknesses, Reichardt
(2018) [52] argues that often such sections do not go to the heart of the limitations of a study:
I’ve read reports where the limitations of a study include such obvious reflections as that
the results should not be generalized beyond the population of participants and the outcome
measures that were used. But the same reports ignore warnings of much more insidious
concerns such as omitted variables and hidden biases.
One reason authors may be reluctant to acknowledge the weaknesses of a study is that the authors
are afraid that acknowledging such weaknesses would disqualify their research from publication.
Reviewers and editors should “explicitly disavow such disqualification when the research is otherwise
of high quality – because such weaknesses are simply an inherent feature in some realms of research,
such as observational studies. Even sensitivity analyses are not guaranteed to bracket the true sizes of
treatment effects” [52]. To highlight the limitations and weaknesses of a study, Reichardt advocates
that they should be put in their own section as Doll and Hill [103] did rather than in the discussion
section.
In conducting observational studies, it may be good to keep in mind the great UCLA basketball
coach John Wooden’s advice [104]:
You can’t let praise or criticism get to you. It’s a weakness to get caught up in either one.
References
[1] David Freedman. From association to causation: some remarks on the history of statistics.
Journal de la société française de statistique, 140(3):5–32, 1999.
[2] Herbert L Lombard and Carl R Doering. Cancer studies in massachusetts: habits, charac-
teristics and environment of individuals with and without cancer. New England Journal of
Medicine, 198(10):481–487, 1928.
[3] Raymond Pearl. Tobacco smoking and longevity. Science, 87(2253):216–217, 1938.
[4] F.H. Muller. Tabakmissbrauch und lungcarcinom. Zeitschrift fur Krebs forsuch, 49:57–84,
1939.
[5] Ernest L Wynder and Evarts A Graham. Tobacco smoking as a possible etiologic factor in
bronchiogenic carcinoma: a study of six hundred and eighty-four proved cases. Journal of the
American Medical Association, 143(4):329–336, 1950.
[6] Richard Doll and A Bradford Hill. Smoking and carcinoma of the lung. British Medical
Journal, 2(4682):739, 1950.
[7] Dean F Davies. A review of the evidence on the relationship between smoking and lung
cancer. Journal of Chronic Diseases, 11(6):579–614, 1960.
[8] American Association for the Advancement of Science et al. Smoking and health: Joint report
of the study group on smoking and health. Science, 125(3258):1129–1133, 1957.
Self-Criticism 547
[9] US Congress. False and misleading advertising (filter-tip cigarettes). In Hearings before a
Subcommittee of the Committee on Government Operations. Washington, DC: US House of
Representatives, 85th Congress, First Session, July, volume 18, pages 23–26, 1957.
[10] Irwin DJ Bross. Statistical criticism. Cancer, 13(2):394–400, 1960.
[11] Wilhelm C Hueper. A quest into the environmental causes of cancer of the lung. Number 452.
US Department of Health, Education, and Welfare, Public Health Service, 1955.
[12] Paul Kotin. The role of atmospheric pollution in the pathogenesis of pulmonary cancer: A
review. Cancer Research, 16(5):375–393, 1956.
[13] Ian Macdonald. Contributed comment: Chinks in the statistical armor. CA: A Cancer Journal
for Clinicians, 8(2):70–70, 1958.
[14] Jerome Cornfield, William Haenszel, E Cuyler Hammond, Abraham M Lilienfeld, Michael B
Shimkin, and Ernst L Wynder. Smoking and lung cancer: recent evidence and a discussion of
some questions. Journal of the National Cancer institute, 22(1):173–203, 1959.
[15] Percy Stocks and John M Campbell. Lung cancer death rates among non-smokers and pipe
and cigarette smokers. British Medical Journal, 2(4945):923, 1955.
[16] Clarence A Mills and Marjorie Mills Porter. Tobacco smoking, motor exhaust fumes, and
general air pollution in relation to lung cancer incidence. Cancer Research, 17(10):981–990,
1957.
[17] EC Hammond and D Horn. Smoking and death rates. part i. total mortality. part ii. death rates
by cause. Journal of the American Medical Association, 166:1159–1172, 1958.
[18] Meir J Stampfer, Walter C Willett, Graham A Colditz, Bernard Rosner, Frank E Speizer, and
Charles H Hennekens. A prospective study of postmenopausal estrogen therapy and coronary
heart disease. New England Journal of Medicine, 313(17):1044–1049, 1985.
[19] Veronica A Ravnikar. Compliance with hormone therapy. American Journal of Obstetrics
and Gynecology, 156(5):1332–1334, 1987.
[20] C Lauritzen. Clinical use of oestrogens and progestogens. Maturitas, 12(3):199–214, 1990.
[21] PJ Ryan, R Harrison, GM Blake, and I Fogelman. Compliance with hormone replacement
therapy (hrt) after screening for post menopausal osteoporosis. BJOG: An International
Journal of Obstetrics & Gynaecology, 99(4):325–328, 1992.
[22] Kathryn A Martin and Mason W Freeman. Postmenopausal hormone-replacement therapy.
New England Journal of Medicine, 328:1115–1117, 1993.
[23] Daniel M Witt and Tammy R Lousberg. Controversies surrounding estrogen use in post-
menopausal women. Annals of Pharmacotherapy, 31(6):745–755, 1997.
[24] Diana B Petitti. Coronary heart disease and estrogen replacement therapy can compliance bias
explain the results of observational studies? Annals of Epidemiology, 4(2):115–118, 1994.
[25] William H Shrank, Amanda R Patrick, and M Alan Brookhart. Healthy user and related
biases in observational studies of preventive interventions: a primer for physicians. Journal of
General Internal Medicine, 26(5):546–550, 2011.
[26] E Cuyler Hammond and Daniel Horn. The relationship between human smoking habits and
death rates: a follow-up study of 187,766 men. Journal of the American Medical Association,
155(15):1316–1328, 1954.
548 How to Be a Good Critic of an Observational Study
[47] Oscar Auerbach, AP Stout, E Cuyler Hammond, and Lawrence Garfinkel. Changes in
bronchial epithelium in relation to cigarette smoking and in relation to lung cancer. New
England Journal of Medicine, 265(6):253–267, 1961.
[48] Andrew Gelman. Learning from and responding to statistical criticism. Observational Studies,
4:32–33, 2018.
[49] Harrison Rainie, Janna Quitney Anderson, and Jonathan Albright. The future of free speech,
trolls, anonymity and fake news online, 2017.
[50] Lawrence Hubert and Howard Wainer. A statistical guide for the ethically perplexed. CRC
Press, 2012.
[51] W.B. Fairley and W.A. Huber. Statistical criticism and causality in prima facie proof of
disparate impact discrimination. Observational Studies, 4:11–16, 2018.
[52] Charles S Reichardt. Another ground rule. Observational Studies, 4:57–60, 2018.
[53] William Paley. Natural Theology: or, Evidences of the Existence and Attributes of the Deity,
Collected from the Appearances of Nature. Lincoln and Edmands, 1829.
[54] Bill Bryson. A short history of nearly everything. Broadway, 2004.
[55] Charles Darwin. The origin of species. John Murray, London, 1859.
[56] National Academy of Sciences. Science and Creationism: A View from the National Academy
of Sciences, Second Edition. National Academies Press, 1999.
[57] Ignaz Semmelweis. The etiology, concept, and prophylaxis of childbed fever. Univ of
Wisconsin Press, 1983.
[58] Caroline M De Costa. “the contagiousness of childbed fever”: A short history of puerperal
sepsis and its treatment. Medical Journal of Australia, 177(11):668–671, 2002.
[59] Dana Tulodziecki. Shattering the myth of semmelweis. Philosophy of Science, 80(5):
1065–1075, 2013.
[60] K Codell Carter and George S Tate. The earliest-known account of semmelweis’s initiation
of disinfection at vienna’s allgemeines krankenhaus. Bulletin of the History of Medicine,
65(2):252–257, 1991.
[61] Flynn Tran. Kaggle contributor.
[62] Peter Gay. The Cultivation of Hatred: The Bourgeois Experience: Victoria to Freud (The
Bourgeois Experience: Victoria to Freud). WW Norton & Company, 1993.
[63] Daniel E Ho. Judging statistical criticism. Observational Studies, 4:42–56, 2018.
[64] D Rindskopf. Statistical criticism, self-criticism and the scientific method. Observational
Studies, 4:61–64, 2018.
[65] Paul R Rosenbaum and Dylan S Small. Beyond statistical criticism. Observational Studies,
4:34–41, 2018.
[66] J Hill and KJ Hoggatt. The tenability of counterhypotheses: A comment on bross’ discussion
of statistical criticism. Observational Studies, 4:34–41, 2018.
550 How to Be a Good Critic of an Observational Study
[67] Fan Yang, José R Zubizarreta, Dylan S Small, Scott Lorch, and Paul R Rosenbaum. Dissonant
conclusions when testing the validity of an instrumental variable. The American Statistician,
68(4):253–263, 2014.
[68] Nicholas Rescher. Aporetics: Rational deliberation in the face of inconsistency. University of
Pittsburgh Press, 2009.
[69] Plato, WRM Lamb, Robert Gregg Bury, Paul Shorey, and Harold North Fowler. Plato in
twelve volumes. Heinemann, 1923.
[70] Irwin DJ Bross. Spurious effects from an extraneous variable. Journal of Chronic Diseases,
19(6):637–647, 1966.
[71] Irwin DJ Bross. Pertinency of an extraneous variable. Journal of Chronic Diseases, 20(7):
487–495, 1967.
[72] Paul R Rosenbaum. Sensitivity analysis for certain permutation inferences in matched
observational studies. Biometrika, 74(1):13–26, 1987.
[73] Paul R Rosenbaum. Design of observational studies. Springer, 2010.
[74] James M Robins. Association, causation, and marginal structural models. Synthese, pages
151–179, 1999.
[75] Hyejin Ko, Joseph W Hogan, and Kenneth H Mayer. Estimating causal treatment effects
from longitudinal hiv natural history studies using marginal structural models. Biometrics,
59(1):152–162, 2003.
[76] Bryan E Shepherd, Mary W Redman, and Donna P Ankerst. Does finasteride affect the
severity of prostate cancer? a causal sensitivity analysis. Journal of the American Statistical
Association, 103(484):1392–1404, 2008.
[77] Vincent Dorie, Masataka Harada, Nicole Bohme Carnegie, and Jennifer Hill. A flexible,
interpretable framework for assessing sensitivity to unmeasured confounding. Statistics in
Medicine, 35(20):3453–3470, 2016.
[78] Tyler J VanderWeele and Peng Ding. Sensitivity analysis in observational research: introducing
the e-value. Annals of Internal Medicine, 167(4):268–274, 2017.
[79] Qingyuan Zhao, Dylan S Small, and Bhaswar B Bhattacharya. Sensitivity analysis for inverse
probability weighting estimators via the percentile bootstrap. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 81(4):735–761, 2019.
[80] E Cuyler Hammond. Smoking in relation to mortality and morbidity. findings in first thirty-
four months of follow-up in a prospective study started in 1959. Journal of the National
Cancer Institute, 32(5):1161–1188, 1964.
[81] Paul R Rosenbaum. Observational studies, 2nd Edition. Springer, 2002.
[82] Paul R Rosenbaum. Choice as an alternative to control in observational studies. Statistical
Science, pages 259–278, 1999.
[83] Karl Popper. The logic of scientific discovery. Harper and Row, 1968. English translation of
Popper’s 1934 Logik der Forschung.
[84] Chico Meriam Library, California State University. Evaluating information – applying the
craap test, 2010.
Self-Criticism 551
C.B. Fogarty
CONTENTS
25.1 Why Conduct a Sensitivity Analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
25.2 Sensitivity Analysis for Matched Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
25.2.1 A model for biased assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
25.2.2 Randomization distributions for test statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
25.2.3 Reference distributions for sharp null hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . 556
25.3 Sensitivity Analysis for Sharp Null Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
25.3.1 Bounds on p-values for sum statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
25.3.2 Sensitivity analysis for point estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
25.3.3 Sensitivity analysis for confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
25.4 Design Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
25.4.1 Bias trumps variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
25.4.2 A favorable reality unknown to the practitioner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
25.4.3 An illustration: design sensitivities for m-statistics . . . . . . . . . . . . . . . . . . . . . . . . . 561
25.5 Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
25.5.1 Multivariate sensitivity analysis as a two-person game . . . . . . . . . . . . . . . . . . . . 564
25.5.2 Testing the intersection null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
25.5.3 Testing individual nulls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
25.5.4 Design sensitivity with multiple comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
25.5.5 The power of a sensitivity analysis with multiple comparisons and moderate
sample sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
25.6 Sensitivity Analysis for Average Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
25.7 Sensitivity Analysis with Instrumental Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
25.8 Sensitivity Analysis after Inverse Probability Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
25.9 Additional Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
25.10 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
expedient of strong ignorability, the critic need merely suggest the existence of bias to cast doubt
upon the posited causal mechanism. It is thus incumbent upon researchers not only to anticipate such
criticism, but also to arm themselves with a suitable rejoinder.
A sensitivity analysis assesses the robustness of an observational study’s conclusions to controlled
departures from strong ignorability, determining the strength of hidden bias needed to materially
alter its findings. Insensitivity instills confidence in the findings of an observational study, while
skepticism is warranted should only a trifling degree of hidden bias be required. Through conducting
a sensitivity analysis, a critic may no longer undermine confidence in an observational study by
simply stating that hidden bias may exist; rather, the critic must specifically argue that the strength of
a proposed lurking variable could realistically exceed the maximum degree of bias which a study
could absorb while leaving the study’s conclusions intact.
The first sensitivity analysis in an observational study was performed by [1] in an article
assessing evidence for cigarette smoking causing lung cancer, quantifying the degree of imbalance on
an unobserved covariate required to produce, in the absence of a causal relationship, an association
between smoking and lung cancer to the extent observed. The approach presented therein possesses
a few drawbacks: it does not account for sampling uncertainty; it does not allow for adjustment
for measured confounders; and it is limited to binary outcome variables. This chapter describes an
approach to sensitivity analysis first developed by Paul Rosenbaum and his collaborators appropriate
for matched designs which overcomes these limitations. It then briefly presents a related approach
by [2] appropriate when using inverse probability weighting.
Alternative frameworks for sensitivity analysis have been proposed through the years. See, among
several, [3–12].
where W is the lexicographically ordered vector of treatment assignments W = (W11 , W12 , ..., WInI )T ,
and the analogous notation holds for other boldfaced quantities in what follows.
Sensitivity Analysis for Matched Designs 555
In a sensitivity analysis we imagine that treatment assignment is not strongly ignorable given the
observed covariates x alone, but that treatment would be strongly ignorable given (x, u), i.e.
Observe that (25.1) along with assuming that pr(Wij = 1 | xij , uij ) = pr(Wij = 1 | xij ) imply that
0 < pr(Wij = 1 | y0ij , y1ij , xij ) = pr(Wij = 1 | xij ) < 1, such that (25.1) and irrelevance of u for
treatment assignment given x provides strong ignorability given x alone.
The model for a sensitivity analysis introduced in [22] bounds the odds ratio of πij and πik for
two individuals j, k in the same matched set i,
1 πij (1 − πik )
≤ ≤ Γ (i = 1, ..., I; j, k = 1, .., ni ). (25.2)
Γ πik (1 − πij )
where γ = log(Γ), κ(xij ) = κ(xik ) for all i = 1, ..., I; j, k = 1, ..., ni , and 0 ≤ uij ≤ 1 for
i = 1, ..., I; j = 1, ..., ni ; see [15] for a proof of this equivalance. That is, imagining a logit
form with a scalar unmeasured covariate bounded between 0 and 1 as in (25.3) imposes the same
restrictions as does (25.2), which makes no reference to the dimension of the unmeasured covariate.
Attention may be returned to the matched structure at hand by conditioning upon the event
that the observed treatment assignment satisfies the matched P design. Let Ω be the set of treatment
ni
assignments w adhering to the matched design, i.e. satisfying j=1 wij = 1 for all i, and let W be
the event that W ∈ Ω. When (25.3) holds at Γ, the conditional distribution for W | F, W may be
expressed as
exp(γwT u)
pr(W = w | F, W) = P T
. (25.4)
b∈Ω exp(γb u)
I Pni
Y exp(γ j=1 wij uij )
= Pni .
i=1 j=1 exp(γuij )
We see that through conditioning upon W, we have removed dependence upon the nuisance
parameters κ(x), but that dependence upon u remains. At Γ = 1 ⇔ γ = 0, pr(Wij = 1 |
y0ij , y1ij , xij , uij ) = pr(Wij = 1 | xij ), such that the study is free of hidden bias. In this case,
the conditional distribution (25.4) is precisely the distribution for treatment assignments in a finely
stratified experiment with 1 treated individual and ni − 1 controls in each stratum [16, 17]. This
reflects mathematically the adherence of matching to the advice attributed to H.F. Dorn that “the
planner of an observational study should always ask himself the question, ‘how would the study
be conducted if it were possible to do it by controlled experimentation?”’ [18, p. 236]. At Γ = 1,
matching provides a reference distribution for inference on treatment effects that aligns with what
we would have attained through a finely stratified experiment.
Taking Γ > 1 in (25.2) allows for departures from equal assignment probabilities within a
matched set, be they due to the impact of hidden bias or to residual discrepancies on the basis of
observed covariates. We attain a family of conditional treatment assignment distributions (25.4) for
each Γ > 1, indexed by the length-N vector of unmeasured confounders u ∈ U, where U is the
N -dimensional unit cube.
556 Sensitivity Analysis
exp(γbT u)1{G(b, Y) ≤ k}
P
pr{G(W, Y) ≤ k | F, W} = b∈Ω P T
, (25.5)
b∈Ω exp(γb u)
with the inequality G(W, Y) ≤ k interpreted coordinate-wise in the multivariate case and 1{A}
being an indicator that the event A occurred. The randomization distribution in (25.5) is generally
unknown to the practitioner for all k without further assumptions. For any value of Γ ≥ 1, (25.8)
depends upon values for the potential outcomes that are unknown: if Wij = 1, then y0ij is unknown,
and if Wij = 0, then y1ij is unknown. As a result, the value G(w, Y) that would be observed under
treatment assignment w ∈ Ω is generally unknown for all w ∈ Ω spare the observed assignment
W without additional assumptions. For Γ > 1, (25.8) further depends upon the unknown vector of
unmeasured confounders u ∈ U.
In so doing the observed outcome Yij imputes the missing value for the function of the potential
outcomes under the treatment assignment that was not observed. Perhaps the most famous sharp
null hypothesis is Fisher’s sharp null hypothesis of no treatment effect for any individual in the
study, which can be tested by choosing f0ij (y0ij ) = y0ij and f1ij (y1ij ) = y1ij . The supposition
that the treatment effect is constant at some value τ0 for all individuals can be tested by setting
f0ij (y0ij ) = y0ij and f1ij (y1ij ) = y1ij − τ0 . Other choices of f1ij (·) and f0ij (·) can yield tests
allowing for subject-specific causal effects such as dilated treatment effects, displacement effects and
tobit effects; see [23, §5] and [19, §§2.4-2.5] for an overview.
From our data alone we observe
Under the assumption of the sharp null hypothesis (25.6), we further have that F = f0 = f1 , where
F, f0 , and f1 are the lexicographically ordered vectors of length N containing Fij , f0ij (yij ), and
f1ij (y1ij ) in their entries. Let T (W, F) be a scalar-valued test statistic, and suppose that larger
values for the test statistic reflect evidence against the null hypothesis. The right-tail probability for
T (W, F) is
T (W, F) = WT q, (25.9)
for some vector q = q(F), referred to as a sum statistic . Most common statistics take this form, and
a few examples for testing Fisher’s sharp null of no effect follow to help build intuition. In paired
observational studies, McNemar’s test statistic with binary outcomes takes qij = Yij , while the
difference in means with any outcome variable type can be attained by choosing qij = (Yij − Yij 0 )/I
in a paired design. To recover Wilcoxon’s signed rank statistic, let diPbe the ranks of |Yi1 − Yi2 | from
1 to I, and let qij = di 1{Fij > Fij 0 }. With multiple controls, qij = j 0 6=j (Yij − Yij 0 )/{I(ni − 1)}
returns the treated minus control differenceP in means. The aligned rank test of [21] first forms
n
aligned responses in each stratum i as Yij − j 0i=1 Yij 0 /ni , ranks the aligned responses from 1 to
N (temporarily ignoring the stratification), and then sets qij equal to the rank for the ijth aligned
response accomplishes this.
When T (W, F) = WT q(F) is a sum statistic, intuition suggests that (25.8) will be larger when
large values for qij correspond to higher treatment assignment probabilities in a matched set, and
will be smaller when large values for qij correspond to lower treatment assignment probabilities. In
matched pair designs we can construct random variables TΓ− and TΓ+ whose upper tail probabilities
bound (25.8) from below and from above for any t at any value of Γ in (25.2). These random variables
accord with the above intuition: let TΓ+ have the distribution (25.8) with uij = 1{qij ≥ qij 0 }, and let
TΓ− have the distribution (25.8) with uij = 1{qij ≤ qij 0 } for i = 1, ..., I and j 6= j 0 = 1, 2. In each
matched set, TΓ+ gives the largest possible probability, Γ/(1 + Γ), to the larger value of {qi1 , qi2 } for
i = 1, ..., I, and TΓ− gives the lowest possible probability, 1/(1 + Γ), to the larger value of {qi1 , qi2 }
for i = 1, ..., I. Then, for any t and any sample size, if (25.2) holds at Γ,
For a subclass of sum statistics called sign-score statistics, we can similarly construct bounding
random variables TΓ− and TΓ+ such that (25.10) holds after matching with multiple controls; see [23,
§4.4] for details. When using other sum statistics after matching with multiple controls, [28] show
that for any t, pr(WT q ≥ t | F, W, Hsharp ) is minimized or maximized for distinct vectors u in
QI
U − and U + respectively, where both U − and U + are sets containing i=1 (ni − 1) binary vectors.
Unlike in the paired case, however, the values for u attaining the lower and upper bounds may vary as
558 Sensitivity Analysis
a function of t. More troublesome, finding Q these exact bounds becomes computationally intractable
I
due to the requirement of enumerating the i=1 (ni − 1) elements of U − and U + .
Owing to these computational limitations, [25] turned to an asymptotic approximation to provide
upper and lower bounds on pr(WT q ≥ t | F, W, Hsharp ) based upon a normal approximation for
the distribution of WT q. These bounds are not valid for all t, but are valid for the values of t relevant
to conducting inference: the lower bound is valid any t less than the smallest possible expectation
for WT q, and the upper bound is valid for any t greater than the largest possible expectation for
WT q. As this random variable may be expressed as the sum of I independent random variables,
the approximation is justified under mild regularity conditions on the constants q ensuring that
the Lindeberg-Feller central limit theorem holds. We will consider the problem of upper bounding
pr(WT q ≥ t | F, W, Hsharp ) when (25.2) is assumed to hold at Γ by constructing a normal random
variable T̃Γ+ with a suitable mean and variance; the procedure for finding the lower bound T̃Γ−
is analogous. Roughly stated, we proceed by finding the vector u which maximizes the expected
value of WT q | F, W when (25.2) is assumed to hold at Γ; if multiple vectors u yield the same
maximal expectation, we choose the one which maximizes the variance. Then, we simply compute
the probability that a normal random variable with this mean and variance exceeds a given value t
larger than the maximal expectation. Importantly, the mean and variance for this normal variable
may be pieced together separately on a stratum-by-stratum basis, requiring an optimization over
Qi I− 1 candidate solutions for each matched set i rather than requiring global optimization over
n
i=1 (ni − 1) binary vectors. [25] refer to this optimization as possessing asymptotic separability.
To proceed, rearrange the values qij in each matched set i such that qi1 ≤ qi2 ≤ ... ≤ qini . Let
+
Ui denote the collection of ni − 1 binary vectors for the ith stratum of the form ui1 = ... = uiai = 0
and uiai +1 = ... = uini = 1 for some ai = 1, ..., ni − 1. Let µi be the largest possible expectation
for ui ∈ Ui+ ,
Pni
+ j=1 exp(γuij )qij
µi = max Pni
j=1 exp(γuij )
+
ui ∈Ui
Pai Pni
i=1 qij + Γ j=ai +1 qij
= max
ai =1,...,ni −1 ai + Γ(ni − ai )
By independence of conditional treatment assignments across matched sets, the expectation and
variance of the asymptotically bounding normal random variable T̃Γ are
I
X I
X
E(T̃Γ+ ) = µ+ +
i ; var(T̃Γ ) = νi+ . (25.11)
i=1 i=1
The asymptotic upper bound for pr(WT q ≥ t | F, W) returned by this procedure whenever
t ≥ E(T̃Γ+ ) is
t − E(T̃ + )
Γ
1−Φ q , (25.12)
var(T̃Γ+ )
where Φ(·) is the cumulative distribution function of the standard normal. As seen through its
Sensitivity Analysis for Sharp Null Hypotheses 559
QI
construction, this asymptotic approximation reduces an optimization problem with i=1 (ni − 1)
candidate solutions to I tractable optimization problems which may be solved in isolation, each
requiring enumeration of only ni − 1 candidate solutions.
To find the moments E(T̃Γ− ) and var(T̃Γ− ) needed for the asymptotic lower bound, simply replace
qij with q̃ij = −qij for all ij and follow the above procedure, finding expectations µ̃+ ν̃i+ for all
i and P
I I
i = 1, .., I for the test statistic WT q̃. Then, take E(T̃Γ− ) = − i=1 µ̃+ − +
P
i and var(T̃Γ ) = i=1 ν̃i .
− T
For any t ≤ E(T̃Γ ), pr(W q ≥ t | F, W, Hsharp ) is asymptotically lower bounded by (25.12)
with T̄Γ+ replaced by T̄Γ− .
Consider first testing the null hypothesis that τ = τ0 in (25.13). By setting f0ij (y0ij ) = y0ij and
f1ij (y1ij ) = y1ij − τ0 in (25.6) for all ij, we have that Fij in (25.7) are the adjusted responses
Yij − Wij τ0 , and that F = f0 = f1 . Therefore, we are entitled to the reference distribution (25.8) for
conducting inference, and the methods for sensitivity analysis described in §25.3 may be deployed
whenever T (W, Y − Wτ0 ) is a sum statistic. As we now describe, this facilitates sensitivity analyses
for point estimates and confidence intervals. We assume moving forwards that T (w, Y − wτ ) is
both a sum statistic and is a monotone decreasing function of τ for any Y and any w ∈ Ω. The latter
condition simply requires that the statistic T measures the size of the treatment effect. We will also
assume that the potential outcomes are continuous, as a constant treatment effect model makes little
sense with discrete outcomes.
In [24], Hodges and Lehmann consider a general approach to producing point estimates based
upon hypothesis tests. Their idea, loosely stated, is to find the value of τ0 such that the adjusted
responses Y − Wτ0 appear to be exactly free of a treatment effect as measured by the statistic T .
When T (W, Y − Wτ0 ) is continuous as a function of τ0 , this is accomplished by finding the value
of τ0 such that the observed value of T (W, Y − Wτ0 ) exactly equals its expectation when assuming
(25.13) holds at τ0 . The approach can be modified to accomodate discontinuous statistics T such as
rank statistics as follows. Letting MΓ,u,τ be the expectation of T (W, Y − Wτ0 ) when the treatment
effect is constant at τ0 and (25.2) holds at Γ with vector of hidden bias u, the Hodges-Lehmann
estimator τ̂HL is
τ̂HL = [inf{τ : T (W, Y − Wτ ) < MΓ,u,τ } (25.14)
+ sup{τ : T (W, Y − Wτ ) > MΓ,u,τ }] /2,
i.e. the average of the smallest value of τ such that the test statistic falls below its expectation and the
largest τ such that the test statistic falls above its expectation.
At Γ = 1 in (25.2) there is a single Hodges-Lehmann estimator for each choice of test statistic.
For Γ > 1, the single estimator instead becomes an interval of estimates: each value for the vector
of hidden bias u can potentially provide a unique value for MΓ,u,τ . This interval of point estimates
is straightforward to construct given our assumption of a sum statistic that is monotone decreasing
in τ , as we can leverage steps involved in the procedure of §25.3.1 for constructing asymptotic tail
560 Sensitivity Analysis
bounds. Recalling that T (W, Y − Wτ ) declines with τ , to find the lower bound for the interval of
+
Hodges-Lehmann estimates we compute (25.14) replacing MΓ,u,τ with E(T̃Γ,τ ), the largest possible
expectation for T (W, Y − Wτ ) as given by (25.11). For the upper bound, we instead compute
−
(25.14) when MΓ,u,τ is replaced by E(T̃Γ,τ ), the smallest possible expectation for T (W, Y − Wτ ).
+ −
and T̃Γ,τ and T̃Γ,τ are the asymptotically upper and lower bounding normal random variables for
T (W, Y − Wτ ) returned through the procedure in §25.3.1.
I 1/2 (T − µ)
pr ≥k → Φ(k).
σ
Because the data are observational, the researcher conducts a sensitivity analysis using T as a
test statistic when assuming that the sharp null holds. For any value of Γ in (25.2) the practitioner
Design Sensitivity 561
uses the method in §25.3.1 to calculate the moments of a normal random variable which provides,
asymptotically, an upper bound on the upper tail probability of T under the assumption that the sharp
null holds. Let µΓ = E(T̃Γ+ )/I and σΓ2 /I = var(T̃Γ+ )/I 2 be the returned expectation and variance
for TI = T /I by the procedure in (25.3.1). If α is the desired level of the procedure, using a normal
approximation the sensitivity analysis rejects at Γ if
I 1/2 (TI − µΓ )
≥ kα ,
σΓ
which tends to 1 if µ > µΓ and 0 otherwise. That is to say, because bias due to unmeasured
confounding is O(1), whether or not the sensitivity analysis rejects depends solely upon whether or
not the worst-case expectation at Γ under the null, µΓ , exceeds the true value of the test statistic’s
expectation, µ, with neither σ nor σΓ playing a role. The value of Γ for which µ = µΓ is called the
design sensitivity and is denoted by Γ̃. It is thus desirable to use test statistics with larger values for
the design sensitivity when the alternative is true, as these test statistics provide evidence which is
more robust to adversarially aligned hidden bias.
limited through the choice of an estimating equation. For simplicity we restrict attention to matched
pairs in this illustration; the concepts extend to matching with multiple controls without issue, albeit
with messier formulae. We closely follow the presentation in [28, 31]. Suppose we are interested
in performing a sensitivity analysis for the null hypothesis that the additive treatment effect model
(25.13) holds with effect τ0 . Let τ̂i = (Wi1 −Wi2 )(Yi1 −Yi2 ) be the treated-minus-control difference
in means in the ith pair. We will consider using an m-statistic to conduct inference, which takes the
form
I
X τ̂i − τ0
T (W, Y − Wτ0 ) = ψ ,
i=1
hτ0
where ψ is an odd function, ψ(x) = −ψ(−x), which is nonnegative for x > 0; and hτ0 is a scaling
factor, typically taken to be a quantile of the absolute differences |τ̂i − τ0 |. For any member of this
class, if (25.2) holds at Γ then under the null the right-tail probabilities for T are bounded above and
below by variables TΓ+ and TΓ− with an intuitive construction as in (25.10): let TΓ+ be the sum of I
independent random variables that take the value ψi = ψ(|τ̂i − τ0 |/hτ0 ) with probability Γ/(1 + Γ)
and −ψi otherwise. Likewise, let TΓ− be the sum of I independent random variables that take the
value ψi with probability 1/(1+Γ) and −ψi otherwise. In each pair, TΓ+ thus puts the largest possible
mass on ψi , the positive value, while TΓ− puts the largest mass on −ψi in each pair.
From this construction we see that the expectation and variance for TΓ+ take the form
I
Γ−1X
E(TΓ+ ) = ψi
1 + Γ i=1
I
4Γ X
var(TΓ+ ) = ψ2 ,
(1 + Γ)2 i=1 i
such that under mild regularity conditions on ψi , asymptotically we can reject the null hypothesis
with a greater-than alternative at level α whenever
Γ−1
PI
T− 1+Γ i=1 ψi
q PI ≥ Φ−1 (1 − α).
4Γ 2
(1+Γ)2 i=1 ψi
Until this point inference has been performed conditional upon F, and a generative model for F
has been neither required nor assumed. For calculating design sensitivity it is convenient to assume
a superpopulation model, imagining that the responses themselves are themselves drawn from a
distribution. We imagine we are the favorable situation of an effect equal to τ > τ0 and no bias
described in §25.4.2, and imagine that τ̂i are generated through
τ̂i = i + τ, (25.16)
where i are drawn iid from a distribution P with mean zero and finite variance. From the dis-
cussion in §25.4.1, we see that the design sensitivity will be the value Γ such that E(T ) =
Γ−1
PI
1+Γ i=1 E{ψ(|τ̂i − τ0 |/η}, where η is the probability limit of hτ0 . Solving for Γ, we find that
n o n o
E ψ |τ̂i −τ
η
0|
+ E ψ τ̂i −τ0
η
Γ̃ = n o n o
|τ̂i −τ0 | τ̂i −τ0
E ßψ η − E ψ η
Multiple Comparisons 563
TABLE 25.1
Design sensitivities for three m-statistics under different generative models and different magnitudes
of treatment effect.
R∞
ψ (|y|/η) dF (y)
= R 00 ,
−∞
ψ (|y|/η) dF (y)
where F (·) is the distribution function of τ̂i − τ0 , and η is the median of F (·).
We now compare design sensitivities for three m-statistics under different distributional assump-
tion for i . The competing test statistics use hτ0 equal to the median value of |τ̂i − τ0 |, and have ψ
functions
ψt (x) = x
ψhu (x) = sgn(x) min(|x|, 2)
ψin (x) = (4/3)sgn(x) max{0, min(2, |x|) − 1/2}.
The function ψt (·) simply returns the permutational t-test based upon the treated-minus-control
difference in means. We consider performing “outer trimming” with ψhu , which uses Huber’s
weighting function with weights leveling off at 2, hence limiting the impact of outlying points. We
also consider “inner trimming” with ψin , which levels off at 2 (outer trimming) but is modified such
that any points between 0 and 1/2 have no influence (inner trimming) [28]. For each of these ψ
functions,
√ we compare results when τ0 = 0, τ = 1/4, 1/2, 3/4, √ 1 and (a) i are iid standard normal;
(b) 3i are t-distributed with 3 degrees of freedom; and (c) 2i are Laplace, or double exponential,
distributed. The scalings of i in (b) and (c) ensure that i have variance 1 in all three settings.
Table 25.1 shows the results, with the columns with normal errors replicating the first set of
columns in Table 3 of [28]. We see first that the design sensitivity increases with τ in each column,
illustrating the intuitive fact that larger treatment effects are more difficult to explain away as the
result of hidden bias. Observe next that for all three distributions, the function ψin produces the
largest design sensitivities. This may be particularly striking when considering the results with
normally distributed errors, where it is known that the test ψt has the largest Pitman efficiency in
the context of inference at Γ = 1. This highlights that the considerations when designing a test
statistic that performs well in a sensitivity analysis deviate from those when performing inference in
a randomized experiment or under strong ignorability.
providing a sensitivity analysis with improved robustness properties. Another common scenario is
when the researcher wants to investigate the potential causal effects of a single treatment on multiple
outcome variables, with distinct test statistics for each outcome variable under consideration.
Suppose that there are K outcome variables, with y0 , y1 , and Y now representing N × K
matrices of potential outcomes under control, under treatment, and observed outcomes respectively.
Suppose that for each outcome, there is a sharp null hypothesis of the form (25.6) under consideration,
denoted as H1 , ..., HK . From these K outcomes, we form L sum statistics, with T` = WT q` for
` = 1, ..., L and q` = q` (F) being a length-N vector formed as a function of the matrix F whose
form is determined by the sharp null hypotheses (25.6) under consideration for each outcome variable.
We first consider a level-α sensitivity analysis for the global null hypothesis that all K of these
hypotheses are true,
K
^
H0 : Hk , (25.17)
k=1
while assuming that (25.2) holds at Γ. An extremely simple way to conduct a sensitivity analysis
for (25.17) would be to first separately conduct L sensitivity analyses, one for each test statistic,
and to record the worst-case p-value for the `th statistic, call it PΓ,` . Then, we could simply employ
a Bonferroni correction and reject the null if min`=1,...,L PΓ,` ≤ α/L. While straightforward to
implement, this approach ends up being unduly conservative because it allows the worst-case vector
of hidden bias u used to furnish the worst-case p-value to vary from one test statistic to next. If
the sensitivity model holds at Γ, there is a single vector u ∈ U which determines the true, but
unknowable, conditional assignment distribution for W given by (25.4). This in turn determines the
reference distribution (25.8) for each test statistic under the sharp null. Substantially more powerful
tests of (25.17) can be attained by enforcing the requirement that u cannot vary from one null to the
next.
see [33–35] for details. For a given vector of probabilities %, under suitable regularity conditions on
the constants qij` the distribution of T is asymptotically multivariate normal through an application
of the Cramér-Wold device. That is, for any fixed nonzero vector λ = (λ1 , ..., λL )T the standardized
deviate
λT {T − µ(%)}
1/2
{λT Σ(%)λ}
The constraints imposed by the sensitivity model (25.2) holding at Γ on % can be represented by
a polyhedral set. For a particular Γ this set, call it PΓ , contains vectors % such that
(i) %ij ≥ 0 (i = 1, ..., I; j = 1, ..., ni );
Pni
(ii) i=1 %ij = 1 (i = 1, ..., I);
λT {t − µ(%)}
a∗Γ,Λ = min sup , (25.18)
%∈PΓ λ∈Λ {λT Σ(%)λ}1/2
where Λ is some subset of RL without the zero vector. By allowing for maximization over Λ, the
practitioner is allowed additional flexibility to choose a linear combination of test statistics which
is more robust to the impact of hidden bias, all the while imposing the condition that the vector of
unmeasured covariates u cannot vary from one test statistic to the next. Setting Λ = {e` } where e`
is a vector with a 1 in the `th coordinate and zeroes elsewhere returns a univariate sensitivity analysis
for the kth outcome with a greater-than alternative, while −e` would return the less-than alternative.
When the test statistics T` are rank tests, setting Λ = {1L } where 1L is a vector containing L
ones returns the coherent rank test of [37]. When Λ = {e1 , ..., eL }, the collection of standard basis
vectors, (25.18) returns the method of [35] with greater-than alternatives, and Λ = {±e1 , ..., ±eL }
gives the same method with two-sided alternatives. The method of [34] amounts to a choice of
Λ = RL \ {0L }, i.e. all possible linear combinations except the vector 0L containing L zeroes,
while the method of [33] takes Λ equal to the nonnegative orthant, again excluding the zero vector.
The problem (25.18) is not convex, making the problem challenging to solve in practice; however,
consider replacing it with the modified optimization problem
2
λT {t − µ(%)}
b∗Γ,Λ = min sup max 0, T . (25.19)
%∈PΓ λ∈Λ {λ Σ(%)λ}1/2
This replaces negative values for the standardized deviate with zero, and then takes the square of the
result. Note that this does not preclude directional testing (for instance, testing a less-than alternative
for a single hypothesis) due to flexibility in designing the elements contained within Λ. The benefit
of this transformation is that the function g(%) = supλ∈Λ max[0, λT {t − µ(%)}/{λT Σ(%)λ}1/2 ]2
is convex in % for any set Λ not containing the zero vector. This allows for efficient minimization
over the polyhedral set PΓ using methods such as projected subgradient descent, made practicable
by the fact that the constraints in PΓ are blockwise in nature, with distinct blocks of constraints for
each matched set and with no constraints spanning across multiple matched sets.
566 Sensitivity Analysis
For concreteness, consider K = 3. Then, we can reject H1 if we can reject the null hypotheses
H1 ∧ H2 ∧ H3 , H1 ∧ H2 , H1 ∧ H3 , and H1 using tests that are each level α. When conducting
a sensitivity analysis at Γ, each required test of an intersection null may be performed using the
optimization problem (25.19).
conservative (i.e. larger than necessary) critical value. Through this, we see that the critical value
takes a back seat in design sensitivity calculations, with the determining factor instead being the
robustness of a chosen test statistic, as measured by the true expectation of the test statistic staying
above worst-case expectation under the null for larger values of Γ in (25.2).
We now turn to calculations under the favorable setting described in §25.4.2 to formalize this intu-
ition in the context of multiple comparisons. For a given generative model for the potential PLoutcomes,
let Γ̃λ be the design sensitivity for the linear combination of test statistics Tλ = WT ( `=1 λ` q` ).
Now, consider the design sensitivity when using b∗Γ,Λ in (25.19) as the test statistic and rejecting based
upon an appropriate critical value cα,Λ . First, the design sensitivity for this test of the intersection
null satisfies
In words, the design sensitivity for testing the intersection null using b∗Γ,Λ as a test statistic is at
least as large as that of the most robust linear combination of test statistics; see [38, Theorem 2].
Stated another way, imagine the researcher had oracle access to λ̃ = arg max Pλ∈Λ Γ̃λ and performed
L
a sensitivity analysis with this most robust linear combination Tλ̃ = WT ( `=1 λ̃` q` ), rather than
choosing the linear combination based upon the data. In this case the researcher would not have
to pay a price for multiple comparisons, and could simply use a critical value from a standard
normal for inference. The above result shows that there is no loss, and a potential for gain, in design
sensitivity from adaptively choosing the linear combination based upon the observed data, as the
sensitivity analysis using (25.19) with a suitable critical value has design sensitivity no smaller than
maxλ∈Λ Γ̃λ .
Next, it is straightforward to show that whenever Λ2 ⊆ Λ1 , the design sensitivity for a test
using Λ2 in (25.19) is no larger than that of a test using Λ1 ; see [33, Theorem 2] for a proof. Hence,
the richer the set Λ, the larger the design sensitivity, and hence asymptotically the benefits of a
larger set Λ outweigh the costs of a larger critical value. This has implications for testing individual
null hypotheses through familywise error control procedures built upon tests of intersections of
hypotheses. Consider closed testing as an illustration. When proceeding with closed testing, we
reject the individual null H` only when all intersections of null hypotheses containing H` can be
rejected. As a concrete example, the test for H1 ∧ H2 ∧ H3 might take Λ123 to be all of R3 without
the zero vector, while the test of H1 ∧ H2 may be thought of as constructing Λ12 by setting the
third coordinate of λ equal to zero, but otherwise allowing the first two coordinates to range over all
of R2 without the zero vector, such that Λ12 ⊆ Λ123 . The design sensitivities for intersections of
hypotheses will be no smaller than those for their individual component hypotheses. As a result, the
design sensitivities for testing individual nulls after closed testing will equal the design sensitivities
for testing individual nulls had we not accounted for multiple comparisons at all.
25.5.5 The power of a sensitivity analysis with multiple comparisons and moderate
sample sizes
Design sensitivity calculations are asymptotic, imagining the limiting power of a sensitivity analysis
as I → ∞. In small and moderate sample sizes, issues such as the null variance of the chosen test
statistic and the critical value used to perform inference play a larger role in determining whether or
not the null may be rejected for a given Γ in (25.2). How well do the insights gleaned from design
sensitivity with multiple comparisons translate to small and moderate samples?
When discussing the power of a sensitivity analysis, we again imagine we are in the favorable
setting in §25.4.2 of a treatment effect and no hidden bias. For a given sample size I and a given value
of Γ, we assess the probability of correctly rejecting the null hypothesis of no treatment effect. This
568 Sensitivity Analysis
probability will decrease as a function of Γ for fixed I, and as I increases the power as a function of
Γ will converge to a step function with the point of discontinuity equal to the design sensitivity.
Through a simulation study, we now investigate the loss in power from controlling for multiple
comparisons. We imagine we have a paired observational study with K = 3 outcome variables.
The treated-minus-control differences in outcomes in each pair (τ̂i1 , τ̂i2 , τ̂i3 )T , i = 1, .., I, are iid
distributed as exchangeable normals with common mean 0.2, marginal variances 1, and correlation
0.5. For each outcome variable k, we apply Huber’s ψ with truncation at 2.5 to τ̂ik /h0 , where h0 is
the median of |τ̂i | and the ψ function is
For a range of values of Γ, we consider power for rejecting both the global null of no effect on
any of the three outcome variables; and rejecting the null of no effect for the first outcome variable
while controlling the familywise error rate at α = 0.05. The procedures for sensitivity analysis we
will compare are
1. Use closed testing. For each subset K ⊆ {1, 2, 3}, compute b∗Γ,ΛK in (25.19) with ΛK =
{±ek , k ∈ K} in (25.19). Reject the intersection null for all outcomes in K if 1 − Φ(b∗Γ,Λ ) ≥
α/|ΛK |, recovering the procedure in [35].
2. Combine individual sensitivity analyses using the techniques for univariate outcomes applied
separately to each outcome variable in a closed testing procedure. This amounts to using a
Bonferroni correction on the individual p-values for testing the global null hypothesis, and using
Holm-Bonferroni to assess individual null hypotheses. Letting PΓ,k be the worst-case two-sided
p-value for the kth outcome, reject the global null if mink=1,2,3 PΓ,k ≤ α/3. For each subset
K ⊆ {1, 2, 3}, reject the intersection null if mink∈K PΓ,k ≤ α/|K|
Both procedures will control the familywise error rate asymptotically. From the discussion in §25.5.3,
we know that approach 2 will be unduly conservative, as it allows different patterns of hidden bias
for each outcome variable when testing intersection nulls.
As a baseline method, we additionally consider the following modification to procedure 2
2’. Conduct individual sensitivity analyses as in procedure 2. Reject the global null if the smallest
two-sided p-value is below α, and reject an individual null if its two-sided p-value is below α.
Note that procedure 2’ does not account for multiple comparisons: it simply rejects each individual
test if its two-sided p-value falls below α. Procedure 2’ does not control the familywise error rate,
giving it an unfair advantage in the simulations that follow. When testing individual null hypotheses,
procedure 2’ would provide the best possible power for a given choice of test statistic. We thus
investigate how close procedures 1 and 2 come to attaining this optimal power for testing nulls on
individual outcoes while, unlike procedure 2’, providing familywise error control.
Figures 25.1 and 25.2 show the power of these procedures for the global null hypothesis and for
rejecting the null for the first outcome variable, respectively, as a function of Γ for I = 250, 500, 1000,
and 2000 when testing at α = 0.05. We see in both figures that procedure 1 uniformly dominates
procedure 2 in terms of power, due to the conservativeness of combining individual sensitivity
analyses; see [35] for additional discussion. In Figure 25.1, we see that procedure 2’ initially
performs best for testing the overall null for I = 250 and I = 500. That is, despite procedure 1
having superior design sensitivity, the use of a larger critical value to control the familywise error
rate impacts performance in small samples. At I = 1000 and beyond procedure 1 has superior power
to both procedure 2 and 2’. This is in spite of the fact that procedure 2’ does not provide familywise
error control and is once again a reflection of the conservativeness of combining individual sensitivity
analyses while allowing different patterns of hidden bias for each outcome. By I = 2000 we begin to
see convergence of the power curves to step functions dicatated by the procedures’ design sensitivities,
wherein it is reflected that procedure 1 has the superior design sensitivity.
Sensitivity Analysis for Average Effects 569
I=250 I=500
1.0
1.0
1
2
2'
0.8
0.8
0.6
0.6
Power
Power
0.4
0.4
0.2
0.2
0.0
0.0
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
Γ Γ
I=1000 I=2000
1.0
1.0
0.8
0.8
0.6
0.6
Power
Power
0.4
0.4
0.2
0.2
0.0
0.0
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
Γ Γ
FIGURE 25.1
Power of a sensitivity analysis for testing the global null hypothesis of no effect for K = 3 outcomes
as a function of Γ at I = 250, 500, 1000, 2000 using procedures 1,2, and 2’. Only procedures 1 and
2 provide familywise error control.
When testing the sharp null of no effect for the first outcome, at I = 250 and I = 500 the power
of procedure 1 lags that of procedure 2’ (which does not account for multiple comparisons). Note that
the gap decreases as the sample size increases, and by I = 1000 the power profiles for procedures
1 and 2’ are in near perfect alignment. The improvements provided by procedure 1 when testing
intersection null hypotheses trickle down to improving power for testing individual nulls after closed
testing, essentially providing the same power for testing individual null hypotheses as what would
have been attained had we not controlled for multiple comparisons at all. Observational studies with
I = 1000 and above are commonplace in practice: observational data is cheap and plentiful but
prone to bias. Once we account for hidden bias in sensitivity analysis, we see that in large enough
samples there is no further loss for controlling for multiple comparisons when closed testing is used
in conjunction with a suitable method for combining test statistics. Rather, it is primarily a statistic’s
robustness to hidden bias that governs the sensitivity analysis.
1.0
1.0
1
2
2'
0.8
0.8
0.6
0.6
Power
Power
0.4
0.4
0.2
0.2
0.0
0.0
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
Γ Γ
I=1000 I=2000
1.0
1.0
0.8
0.8
0.6
0.6
Power
Power
0.4
0.4
0.2
0.2
0.0
0.0
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
Γ Γ
FIGURE 25.2
Power of a sensitivity analysis for testing the null of no effect for the first outcome as a function of Γ
at I = 250, 500, 1000, 2000. Only procedures 1 and 2 provide familywise error control.
potential outcomes. Sharp nulls are viewed by some as overly restrctive. Moreover, a particular fear
is that sensitivity analyses conducted assuming Fisher’s sharp null of no effect at all may paint an
overly optimistic picture of the study’s sensitivity to hidden bias if effects are instead heterogeneous
but average to zero, with unmeasured confounding conspiring with the unidentified aspects of the
constant effects model to render the analysis assuming constant effects inadequate.
In this section we focus attention on matched pair designs, and consider an approach for
sensitivity Panalysis
Pnfor the average of the N = 2I treatment effects in the matched sample,
I
τ̄ = N −1 i=1 j=1 i
τij , while allowing for heterogeneity in the individual treatment effects.
The procedure, developed in [24], provide a sensitivity analysis that is simultaneously (i) asymptoti-
cally valid for inference on the average treatment effect; and (ii) valid for any sample size if treatment
effects are constant at the hypothesized value.
Assume that the model (25.2) holds at Γ and consider testing the weak null hypothesis
with a greater-than alternative. Let τ̂i = (Wi1 − Wi2 )(Yi1 − Yi2 ) be the treated-minus-control
difference in means in the ith of I pairs. Define DΓi as
Γ−1
DΓi = τ̂i − τ̄0 − |τ̂i − τ̄0 |, (25.21)
1+Γ
Sensitivity Analysis for Average Effects 571
I
and let D̄Γ = I −1 i=1 DΓi . The term {(Γ − 1)/(1 + Γ)}|τ̂i − τ̄0 | in (25.21) is the largest possible
P
expectation for τ̂i − τ0 at Γ when treatment effects are assumed to be constant at τ̄0 , such that
E(D̄Γi | F, W) ≤ 0 under constant effects. Perhaps surprisingly, this centered variable DΓi
continues to be useful for sensitivity analysis even when effects are heterogeneous.
Define se(D̄Γ ) as the usual standard error estimate for a paired design,
v
u I
u 1 X
se(D̄Γ ) = t (DΓi − D̄Γ )2 .
I(I − 1) i=1
which studentizes D̄Γ by the standard error se(D̄Γ ), and replaces negative values for the standardized
deviate by zero.
Despite effects being heterogeneous under (25.20), consider using the worst-case distribution for
SΓ (W, Y −Wτ̄0 ) under the assumption of constant effects as a reference distribution for conducting
inference. More precisely, let a = Y − Wτ̄0 be the observed vector of adjusted responses, let sobs Γ
be the observed value for the test statistic, let uij = 1{aij ≥ aij 0 } for j 6= j 0 , i = 1, ..., I, and use as
a p-value
[24] shows that under under mild regularity conditions, rejecting the weak null hypothesis (25.20)
when pΓ falls at or below α yields a sensitivity analysis whose Type I error rate is, asymptotically, no
larger than α if (25.2) holds at Γ and the weak null (25.20) holds. Furthermore, if treatment effects
are constant at τ̄0 as in (25.13), using pΓ as a p-value produces a sensitivity analysis whose Type I
error rate is less than or equal to α at any sample size when (25.2) holds at Γ.
If we are indifferent to maintaining finite-sample Type I error control under constant effects, the
following large-sample procedure also produces an asymptotically valid sensitivity analysis for any
Γ: rather than constructing a p-value using (25.20), simply reject the weak null (25.20) when
D̄Γ
≥ Φ−1 (1 − α). (25.23)
se(D̄Γ )
In the usual way the procedure may be inverted to produce sensitivity intervals for the sample average
treatment effect at any Γ.
Note that the procedure using (25.22) as a p-value differs from a sensitivity analysis based solely
upon the average of the treated-minus-control difference in means due to the use of studentization.
[24] shows that while the a sensitivity analysis based upon the difference in meansP and assuming
I
constant effects uses a reference distribution whose expectation bounds that of I −1 i=1 (τ̂i − τ̄0 )
even when effects are heterogeneous, the resulting reference distribution may have too small a
variance when (25.2) holds at Γ > 1. As a result, the sensitivity analysis using the unstudentized
difference in means and assuming constant effects might be anti-conservative when conducted at
Γ > 1. That said, by virtue of the reference distribution properly bounding the worst-case expectation
it is shown that this sensitivity analysis will be asymptotically valid if conducted at Γ + for any
> 0, as in that case the candidate worst-case expectation will exceed the true expectation if (25.2)
actually holds at Γ. This provides assurances that sensitivity analyses based upon the difference
572 Sensitivity Analysis
valid under the sharp null hypothesis of no effect, when effects are actually heterogeneous but average
to zero. They show that sensitivity analyses based upon McNemar’s statistic are asymptotically valid
as sensitivity analyses of the null of no sample average treatment effect. Using terminology common
in causal inference with binary outcomes, sensitivity analyses using McNemar’s test under the
assumption of Fisher’s sharp null maintain asymptotic validity as tests that the sample causal risk
difference is zero, or equivalently that the sample causal risk ratio is 1, when (25.2) is assumed to
hold at Γ. This provides both a useful fortification and further interpretations for sensitivity analyses
which have used this statistic in the past: any sensitivity analysis using McNemar’s test may be
interpreted as a asymptotically valid sensitivity analysis testing that the average effect is zero, while
additionally providing an exact sensitivity analysis for testing Fisher’s sharp null.
Should the assumption of proportional doses prove unpalatable, in paired designs we may instead
conduct a sensitivity analysis for a parameter known as the effect ratio [49],
PI P2
N −1 i=1 j=1 (y1ij − y0ij )
β= PI P2 , (25.25)
N −1 − d0ij )
i=1 j=1 (d1ij
Pn P2
where we assume that i=1 j=1 (d1ij − d0ij ) 6= 0. The effect ratio is the ratio of two treatment
effects. In the numerator we have the effect of the encouragement Wij on the outcome variable, while
in the denominator we have the effect of the encouragement of the level of the treatment actually
received.
The method to be presented will provide a valid sensitivity analysis for the effect ratio β when
interpreted as the ratio of two treatment effects. That said, in instrumental variable studies it is
common to make additional assumptions about the relationships between the potential levels of
treatment received and the potential outcomes. In defining an instrumental variable, [50] require
that the variable satisfies the exclusion restriction: the instrument can only affect the outcome by
influencing the treatment received, stated formally as dwij = dw0 ij ⇒ ywij = yw0 ij for w, w0 = 0, 1.
Observe that this assumption holds under the proportional dose model (25.24). Typically, IV studies
further invoke a monotonicity assumption of the following form: for all individuals, dw0 ij ≥ dwij
for w0 ≥ w. When the potential treatment levels dwij are binary variables reflecting whether or not
the treatment would actually be received, the exclusion restriction and monotonicity imply that the
causal estimand β may then be interpreted as the sample average treatment effect among compliers,
those individuals for whom d0ij = 0 and d1ij = 1. These assumptions are not needed for inference
on the effect ratio β; rather, by making these assumptions, we confer additional interpretations unto
β.
Suppose we wants to conduct a sensitivity analysis for the null that the effect ratio in (25.25)
equals some value β0 without imposing the proportional dose model (25.24). The methods for
sensitivity analysis for the sample average treatment effect described in §25.6 may also be used for
the effect ratio through a simple redefinition of DΓi in (25.21). First let ζ̂i be the treated-minus-control
difference in the adjusted responses Yij − β0 Dij ,
With this change in definition of DΓi , the developments in §25.6 may be used without further modifi-
cation. The p-value in (25.22) provides a sensitivity analysis that is simulataneously asymptotically
valid for testing the null that the effect ratio in (25.25) equals β0 , and is further valid for any sample
size when the proportional dose model (25.24) holds at β0 . Should we be indifferent to performance
under the proportional dose model, an alternative asymptotically valid approach for inference on the
effect ratio can be attained by rejecting the null based upon (25.23). The latter approach extends the
proposal of [49] for inference on the effect ratio at Γ = 1 to providing inference for any value of Γ in
(25.2).
574 Sensitivity Analysis
Note that this presentation differs slightly from that presented in [51] and [2], who choose U to be
one of the potential outcomes; this distinction has no practical consequences. Further note that Λ in
(25.27) has no connection with Λ as used in §25.5. Equivalently, we imagine that the following logit
model holds:
pr(W = 1 | X = x, U = u) e(x)
log = log + λ(2u − 1), (25.28)
pr(W = 0 | X = x, U = u) 1 − e(x)
where λ = log(Λ) and 0 ≤ u ≤ 1. The models (25.27) and (25.28) are equivalent in the sense that
there is a model of the form (25.27) describing assignment probabilities if and only if (25.28) is
satisfied. To see that (25.27) implies (25.28), for any U define for Λ > 1
π(X, U ){1 − e(X)}
U ∗ = log /(2λ) + 1/2, (25.29)
e(X){1 − π(X, U )}
and set U ∗ = 0 when Λ = 1. Observe that 0 ≤ U ∗ ≤ 1 when (25.27) holds at Λ. Moreover, observe
that (X, U ∗ ) is a balancing score in the terminology of [52], as by inspection of (25.29) we can
express π(X, U ) as a function of X and U ∗ for any fixed Λ. By the assumption of strong ignorability
given (X, U ), this implies by Theorem 3 of [52] that pr(W = 1 | X, U ∗ , Y1 , Y0 ) = pr(W = 1 |
X, U ∗ ) = pr(W = 1 | X, U ). Therefore, both (25.28) and (25.27) hold when we replace U with U ∗ .
Showing that (25.28) implies (25.27) is straightforward, and the proof is omitted.
Compare the logit form (25.28) with that used in sensitivity analysis for matched designs,
pr(W = 1 | X = x, U = u)
log = κ(x) + γu, (25.30)
pr(W = 0 | X = x, U = u)
with 0 ≤ u ≤ 1 and κ(xij ) = κ(xij 0 ) for each i = 1, ..., I, j, j 0 = 1, ..., ni . If individuals in the
same matched set have the same covariates xij , then (25.28) holding at Λ implies that (25.30) holds at
Γ = Λ2 setting κ(xij ) = log{e(x)/(1 − e(x))} − λ, such that the previously described methods for
sensitivity
√ analysis when (25.2) holds at Γ are also applicable whenever (25.27) is assumed to hold
at Γ. The motivation for using this slightly different model comes from the different approaches to
dealing with nuisance parameters in matching versus weighting. When matching, by conditioning on
the matched structure we remove dependence upon the parameters κ(x) under the model (25.30)
whenever all individuals in the same matched set have the same value for κ(x). When using a
weighting estimator such as inverse probability weighting, we instead use a plug-in estimate for
the nuisance parameter. The model (25.30) does not specify a particular form for κ(x). As a result,
we cannot estimate κ(x) from the observed data. The modification (25.28) elides this difficulty by
Additional Reading 575
relating κ(x) to the propensity score e(x), such that we may proceed with a plug-in estimator for the
propensity score.
Under this modification, [2] describe an approach for sensitivity analysis for stabilized inverse
probability weighted estimators and stabilized augmented inverse probability weighted estimators.
For now, imagine that the propensity score e(x) is known to the researcher. For each i = 1, .., N , let
ei = e(xi ) and let gi = log{ei /(1 − ei )} be the log odds of the assignment probabilities given xi ,
i.e. the log odds transform of the propensity score. If (25.28) holds at Λ, we have that
1 1
≤ pr(Wi = 1 | Xi = xi , Ui = ui ) ≤ ,
1 + Λ exp(−gi ) 1 + Λ−1 exp(−gi )
Problem (25.31) is a fractional linear program, and may be converted to a standard linear program
through the use of a Charnes-Cooper transformation [36]. As a result, the problem may be solved
efficiently using off-the-shelf solvers for linear programs.
Let τ̂Λ,min and τ̂Λ,max be the optimal objective values when performing either minimization or
maximization in (25.31). [2] additionally describe how 100(1 − α)% sensitivity intervals may be
attained through the use of the percentile bootstrap. In short, in the bth of B bootstrap iterations we
(b)
solve the optimization problems in (25.31) for the bootstrap sample, storing these values as τ̂Λ,min and
(b) (b)
τ̂Λ,max . The lower bound of sensitivity interval is then the α/2 quantile of {τ̂Λ,min : b = 1, .., B},
(b)
and the upper bound equal to the (1 − α/2) quantile {τ̂Λ,max : b = 1, ..., B}. In practice the
propensity scores ei must be estimated for each individual, with the resulting estimators needing to
be sufficiently smooth such that the bootstrap sensitivity intervals maintain their asymptotic validity.
Estimation introduces the possibility of model misspecification, particularly when parametric models
such as logistic regression are used. [2] consider a slight modification to the model (25.27) that
instead bounds the odds ratio of π(X, U ) and the best parametric approximation to e(X) between
Λ−1 and Λ. This modification allows for discrepancies due to both unmeasured confounding and
misspecification of the propensity score model and is similar in spirit to the model (25.2), which
allowed for differences in assignment probabilities for two individuals in the same matched set due
to both hidden bias and discrepancies on observed covariates.
• When conducting a sensitivity analysis using the model (25.2), we bound the impact that hidden
bias may have on the assignment probabilities for two individuals in the same matched set.
The model imposes no restrictions on the relationship between hidden bias and the potential
outcomes; indeed, the potential outcomes are fixed quantities through conditioning. When
performing a sensitivity analysis, the worst-case vector of unmeasured covariates bounding the
upper tail probabilities for a given test statistic generally has near perfect correlation with the
potential outcomes. This may seem implausible and overly pessimistic. An amplification of
a sensitivity analysis maps the one-parameter sensitivity analysis considered here to a set of
two-parameter sensitivity analysis separately bounding the impact of hidden bias on the treatment
and the potential outcomes. No new calculations are required; rather, for each Γ we are entitled
to an alternative set of interpretations of the strength of biases under consideration. See [30]
and [19, §3.6] for further discussion.
• Evidence factors are tests of the same null hypothesis about treatment effects which may be
treated as statistically independent, by virtue of either being truly independent or producing
p-values whose joint distribution stochastically dominates that of the multivariate uniform
distribution. Evidence factors are, through their nature, susceptible to distinct patterns of hidden
bias. With evidence factors a given counterargument may overturn the finding of a given test, yet
have little bearing on the findings of another. For a comprehensive overview of evidence factors
and their role in strengthening evidence rather than replicating biases, see [21].
• There is much qualitative advice on what can be done to improve the design and analysis
of observational studies. Through a sensitivity analysis, we can often quantify the benefits
of utilizing quasi-experimental devices. Some examples are coherence, or multiple opera-
tionalism [26, 33, 35, 37]; and the incorporation of either a negative control outcome (i.e. an
outcome known to be unaffected by the treatment) [56] or an outcome with a known direction
of effect [23, §6] to rule out certain patterns of hiddden bias.
• The choice of test statistic plays a large role in the sensitivity value returned by a sensitivity
analysis. While a unifying theory for choosing test statistics for optimal performance in a
sensitivity analysis does not exist, many statistics have been designed which have superior
performance to the usual choices such as the difference in means or Wilcoxon’s signed rank test.
Certain m-tests, described in §25.4.3, provide one example. Members of the class of u-statistics
provide yet another. See [49] for details
• Other metrics beyond design sensitivity exist for guiding the choice of a test statistic. Bahadur
efficiency, introduced in [58] and developed for sensitivity analysis in [26], provides comparisons
between test statistics below the minimum of their design sensitivities, providing of an assessment
of which test statistic best distinguishes a moderately large treatment effect from a degree of bias
below the minimal design sensitivities.
• For worked examples of sensitivity analyses, see [60, §9] and [23, §4] among several re-
sources. Most of the references cited in this chapter also have data examples accompany-
ing them. For a more hands-on and visual tutorial of how to conduct a sensitivity analysis,
the following shiny app, created by Paul Rosebaum, provides an interactive illustration:
https://rosenbap.shinyapps.io/learnsenShiny/
25.10 Software
Many R packages and scripts implement the methods described in this chapter. Below are descriptions
of some of these resources.
Software 577
References
[1] Jerome Cornfield, William Haenszel, E Cuyler Hammond, Abraham M Lilienfeld, Michael B
Shimkin, and Ernst L Wynder. Smoking and lung cancer: Recent evidence and a discussion of
some questions. Journal of the National Cancer Institute, 22:173–203, 1959.
[2] Qingyuan Zhao, Dylan S Small, and Bhaswar B Bhattacharya. Sensitivity analysis for inverse
probability weighting estimators via the percentile bootstrap. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 81(4):735–761, 2019.
[3] Sue M Marcus. Using omitted variable bias to assess uncertainty in the estimation of an aids
education treatment effect. Journal of Educational and Behavioral Statistics, 22(2):193–201,
1997.
[4] James M Robins, Andrea Rotnitzky, and Daniel O Scharfstein. Sensitivity analysis for selection
bias and unmeasured confounding in missing data and causal inference models. In Statistical
models in epidemiology, the environment, and clinical trials, pages 1–94. Springer, 2000.
[5] Guido W Imbens. Sensitivity to exogeneity assumptions in program evaluation. American
Economic Review, 93(2):126–132, 2003.
[6] Binbing Yu and Joseph L Gastwirth. Sensitivity analysis for trend tests: application to the risk
of radiation exposure. Biostatistics, 6(2):201–209, 2005.
[7] Liansheng Wang and Abba M Krieger. Causal conclusions are most sensitive to unobserved
binary covariates. Statistics in Medicine, 25(13):2257–2271, 2006.
[8] Brian L Egleston, Daniel O Scharfstein, and Ellen MacKenzie. On estimation of the survivor
average causal effect in observational studies when important confounders are missing due to
death. Biometrics, 65(2):497–504, 2009.
[9] Carrie A Hosman, Ben B Hansen, and Paul W Holland. The sensitivity of linear regression
coefficients’ confidence limits to the omission of a confounder. The Annals of Applied Statistics,
4(2):849–870, 2010.
[10] José R Zubizarreta, Magdalena Cerdá, and Paul R Rosenbaum. Effect of the 2010 Chilean
earthquake on posttraumatic stress reducing sensitivity to unmeasured bias through study design.
Epidemiology, 24(1):79–87, 2013.
[11] Weiwei Liu, S Janet Kuramoto, and Elizabeth A Stuart. An introduction to sensitivity analysis
for unobserved confounding in nonexperimental prevention research. Prevention Science,
14(6):570–580, 2013.
[12] Tyler J VanderWeele and Peng Ding. Sensitivity analysis in observational research: introducing
the e-value. Annals of Internal Medicine, 167(4):268–274, 2017.
[13] Paul R Rosenbaum. Observational studies. Springer, New York, 2002.
[14] Paul R. Rosenbaum. Sensitivity analysis for certain permutation inferences in matched observa-
tional studies. Biometrika, 74(1):13–26, 1987.
[15] Paul R Rosenbaum. Quantiles in nonrandom samples and observational studies. Journal of the
American Statistical Association, 90(432):1424–1431, 1995.
Software 579
[16] Colin B Fogarty. On mitigating the analytical limitations of finely stratified experiments.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80:1035–1056,
2018.
[17] Nicole E Pashley and Luke W Miratrix. Insights on variance estimation for blocked and
matched pairs designs. Journal of Educational and Behavioral Statistics, 46(3):271–296, 2021.
[18] William G Cochran. The planning of observational studies of human populations. Journal of
the Royal Statistical Society. Series A (General), 128(2):234–266, 1965.
[19] Paul R Rosenbaum. Design of observational studies. Springer, New York, 2010.
[20] Qingyuan Zhao. On sensitivity value of pair-matched observational studies. Journal of the
American Statistical Association, 114(526):713–722. 2019.
[21] Joseph L Hodges and Erich L Lehmann. Rank methods for combination of independent
experiments in analysis of variance. The Annals of Mathematical Statistics, 33(2):482–497,
1962.
[22] Paul R Rosenbaum and Abba M Krieger. Sensitivity of two-sample permutation inferences in
observational studies. Journal of the American Statistical Association, 85(410):493–498, 1990.
[23] Joseph L Gastwirth, Abba M Krieger, and Paul R Rosenbaum. Asymptotic separability in
sensitivity analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
62(3):545–555, 2000.
[24] Joseph L Hodges and Erich L Lehmann. Estimates of location based on rank tests. The Annals
of Mathematical Statistics, 34(2):598–611, 1963.
[25] Elizabeth A Stuart and David B Hanna. Commentary: Should epidemiologists be more sensitive
to design sensitivity? Epidemiology, 24(1):88–89, 2013.
[26] Paul R Rosenbaum. Design sensitivity in observational studies. Biometrika, 91(1):153–164,
2004.
[27] Paul R Rosenbaum. Heterogeneity and causality. The American Statistician, 59(1):147–152,
2005.
[28] Paul R Rosenbaum. Impact of multiple matched controls on design sensitivity in observational
studies. Biometrics, 69(1):118–127, 2013.
[29] Paul R Rosenbaum. Design sensitivity and efficiency in observational studies. Journal of the
American Statistical Association, 105(490):692–702, 2010.
[30] JS Maritz. A note on exact robust confidence intervals for location. Biometrika, 66(1):163–170,
1979.
[31] Paul R Rosenbaum. Sensitivity analysis for M-estimates, tests, and confidence intervals in
matched observational studies. Biometrics, 63(2):456–464, 2007.
[32] Paul R Rosenbaum. Testing one hypothesis twice in observational studies. Biometrika,
99(4):763–774, 2012.
[33] Peter L Cohen, Matt A Olson, and Colin B Fogarty. Multivariate one-sided testing in matched
observational studies as an adversarial game. Biometrika, 107(4):809–825, 2020.
[34] Paul R Rosenbaum. Using Scheffé projections for multiple outcomes in an observational study
of smoking and periodontal disease. The Annals of Applied Statistics, 10(3):1447–1471, 2016.
580 Sensitivity Analysis
[35] Colin B Fogarty and Dylan S Small. Sensitivity analysis for multiple comparisons in matched
observational studies through quadratically constrained linear programming. Journal of the
American Statistical Association, 111(516):1820–1830, 2016.
[36] Abraham Charnes and William W Cooper. Programming with linear fractional functionals.
Naval Research Logistics Quarterly, 9(3-4):181–186, 1962.
[37] Paul R Rosenbaum. Signed rank statistics for coherent predictions. Biometrics, pages 556–566,
1997.
[38] Siyu Heng, Hyunseung Kang, Dylan S Small, and Colin B Fogarty. Increasing power for obser-
vational studies of aberrant response: An adaptive approach. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 2021.
[39] Henry Scheffé. A method for judging all contrasts in the analysis of variance. Biometrika,
40(1-2):87–110, 1953.
[40] Jelle J Goeman and Aldo Solari. The sequential rejection principle of familywise error control.
Annals of Statistics, 38(6):3782–3810, 2010.
[41] Ruth Marcus, Peritz Eric, and K Ruben Gabriel. On closed testing procedures with special
reference to ordered analysis of variance. Biometrika, 63(3):655–660, 1976.
[42] Nicolai Meinshausen. Hierarchical testing of variable importance. Biometrika, 95(2):265–278,
2008.
[43] Jelle J Goeman and Livio Finos. The inheritance procedure: multiple testing of tree-structured
hypotheses. Statistical Applications in Genetics and Molecular Biology, 11(1):1–18, 2012.
[44] Colin B Fogarty. Studentized sensitivity analysis for the sample average treatment effect in
paired observational studies. Journal of the American Statistical Association, 115(531):1518–
1530, 2020.
[45] Colin B Fogarty, Kwonsang Lee, Rachel R Kelz, and Luke J Keele. Biased encouragements and
heterogeneous effects in an instrumental variable study of emergency general surgical outcomes.
Journal of the American Statistical Association, pages 1–12, 2021.
[46] Paul R Rosenbaum. Identification of causal effects using instrumental variables: Comment.
Journal of the American Statistical Association, 91(434):465–468, 1996.
[47] Paul R Rosenbaum. Choice as an alternative to control in observational studies. Statistical
Science, 14(3):259–278, 1999.
[48] Guido W Imbens and Paul R Rosenbaum. Robust, accurate confidence intervals with a weak
instrument: quarter of birth and education. Journal of the Royal Statistical Society: Series A
(Statistics in Society), 168(1):109–126, 2005.
[49] Mike Baiocchi, Dylan S Small, Scott Lorch, and Paul R Rosenbaum. Building a stronger
instrument in an observational study of perinatal care for premature infants. Journal of the
American Statistical Association, 105(492):1285–1296, 2010.
[50] Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identification of causal effects
using instrumental variables. Journal of the American Statistical Association, 91(434):444–455,
1996.
[51] Zhiqiang Tan. A distributional approach for causal inference using propensity scores. Journal
of the American Statistical Association, 101(476):1619–1637, 2006.
Software 581
[52] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observa-
tional studies for causal effects. Biometrika, 70(1):41–55, 1983.
[53] Paul R Rosenbaum. Attributing effects to treatment in matched observational studies. Journal
of the American Statistical Association, 97(457):183–192, 2002.
[54] Paul R Rosenbaum and Jeffrey H Silber. Amplification of sensitivity analysis in matched
observational studies. Journal of the American Statistical Association, 104(488), 2009.
[55] Paul R Rosenbaum. Replication and evidence factors in observational studies. Chapman and
Hall/CRC, 2021.
[56] Paul R Rosenbaum. Sensitivity analyses informed by tests for bias in observational studies.
Biometrics, 2021.
[57] Paul R Rosenbaum. A new u-statistic with superior design sensitivity in matched observational
studies. Biometrics, 67(3):1017–1027, 2011.
[58] Raghu Raj Bahadur. Stochastic comparison of tests. The Annals of Mathematical Statistics,
31(2):276–295, 1960.
[59] Paul R Rosenbaum. Bahadur efficiency of sensitivity analyses in observational studies. Journal
of the American Statistical Association, 110(509):205–217, 2015.
[60] Paul R Rosenbaum. Observation and experiment: An introduction to causal inference. Harvard
University Press, 2017.
[61] Paul R Rosenbaum. Sensitivity analysis for stratified comparisons in an observational study of
the effect of smoking on homocysteine levels. The Annals of Applied Statistics, 12(4):2312–
2334, 2018.
[62] Colin B Fogarty, Pixu Shi, Mark E Mikkelsen, and Dylan S Small. Randomization infer-
ence and sensitivity analysis for composite null hypotheses with binary outcomes in matched
observational studies. Journal of the American Statistical Association, 112(517):321–331,
2017.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
26
Evidence Factors
Bikram Karmakar
Department of Statistics, University of Florida
CONTENTS
26.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
26.1.1 Evidence factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
26.2 Evidence Factors in Different Study Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
26.2.1 Treatments with doses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
26.2.2 Case–control studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
26.2.3 Nonreactive exposure and its reactive dose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
26.2.4 Studies with possibly invalid instrument(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
26.3 Structure of Evidence Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
26.3.1 A simple construction of exactly independent comparisons . . . . . . . . . . . . . . . . 594
26.3.2 Brief introduction to sensitivity analysis for unmeasured confounders . . . . . 595
26.3.3 Evidence factors: definition and integrating evidence using joint sensitivity
analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
26.4 Planned Analyses of Evidence Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598
26.4.1 Partial conjunctions with several factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598
26.4.2 Incorporating comparisons that are not factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
26.5 Algorithmic Tools for Designs of Evidence Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
26.5.1 Matching with three or more groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
26.5.2 Stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
26.5.3 Balanced blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
26.6 Supplemental Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
26.6.1 Absolute standardized difference for stratified designs . . . . . . . . . . . . . . . . . . . . 604
26.6.2 Guided choice of test statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
26.6.3 Evidence factors and triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
26.7 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
26.1 Introduction
There is nothing thought to be known more precisely in the recorded human knowledge than the
charge of a single electron. The National Institute of Standards and Technology reports the numerical
absolute value of the elementary charge e as 1.602 176 634 ×10−19 coulomb (C), where in the
standard error1 field it reports (exact) [1]. This exact determination was made official in 2018, and
before that, there was still a standard error of the order of 10−27 C associated with its determination.
1 In metrology, conventionally, this is called standard uncertainty.
It is quite remarkable that in the short period from a Friday in 1897 when J. J. Thompson announced
his discovery of the electron to the audience at the Royal Institution of Great Britain, to 2018, we
have been able to determine this minuscule number accurately. It is helpful to look at some history
regarding the efforts in determining e. Robert A. Millikan designed the famous oil-drop experiment
and published his determination of e in 1913 in a seminal paper [2]. He reported his estimate of
e with a standard error that was several times smaller than anything know before. Millikan was
awarded the 1923 Nobel prize in Physics “for his work on the elementary charge of electricity and
on the photoelectric effect.”
Using his newly developed apparatus Millikan had been able to calculate e as 1.5924 ± 0.003 ×
10−19 C.2 This determination was corroborated independently in several replications between 1913
and 1923 using the oil drop experiment. During this time, Millikan further refined the apparatus and
his calculations and collected more data points to improve this estimate. In his Nobel lecture in 1924,
he reported a more than 40% improved standard error of 0.0017 × 10−19 C [3]. During the lecture
he notes:
After ten years of work in other laboratories in checking the methods and the results obtained
in connection with the oil-drop investigation published ... there is practically universal
concurrence upon their correctness...
If we look closely, though, we see that the true value of e is more than 3.25 standard error larger
than Millikan’s first estimate and more than 5.75 standard error larger than Millikan’s improved
estimate. Thus, in statistical terms, the estimate was biased. It was biased the first time it was
published, and it was biased in every replication of the experiment during that ten years. The
replicated results using the oil-drop experiment only provided a higher certainty about this biased
estimate.
How did we correct this bias? Did future oil-drop investigations suddenly remove the bias in the
estimate? No. Did much larger sample sizes remove the bias in the estimate? No. Did refinements of
the statistical methods remove the bias in the estimate? No.
In 1924 Karl Manne Siegbahn was awarded the Nobel prize in Physics for his work in the field
of X-ray spectroscopy. Using this technology, in his thesis, Erik Bäcklin (1928) was able to find an
alternate method to calculate e. This value was higher than Millikan’s determination. But this was
not a confirmation that Millikan’s determination was biased. Far from that, this would be considered
an extreme determination among several determinations of the value of e that provided surprising
concurrence.
That same year, Arthur S. Eddington published a purely theoretical determination of e based on
the newly developed Pauli’s exclusion principle and General Relativity [4]. His calculation also gave
a higher estimate than Millikan’s determination. In the paper Eddington writes:
the discrepancy is about three times the probable error attributed to [Millikan’s] value, I
cannot persuade myself that the fault lies with the theory.
He then writes:
I have learnt of a new determination of e by Siegbahn .... I understand that a higher value is
obtained which would closely confirm the present theory.
Today we know that Millikan’s determination was biased because of an erroneous value for the
coefficient of the air’s inner friction in the oil-drop experiment. Replications of the experiment or
refinement of the calculations could not correct this bias. Only when different experiments were
conducted did the bias start become clear. Logically, it was also possible that the determination from
the oil-drop experiments was correct, and the determinations by Siegbahn and Eddington were biased.
2 Millikan reported in statC; 1 C = 2997924580 statC.
Evidence Factors in Different Study Designs 585
After all, each of these methods used fairly new technologies that would be open to criticism at the
time. It is not the many replications or variations of one method but the corroboration of the results
from a multitude of independent methods that are susceptible to different biases that strengthened
our belief regarding the value of e.
3 Data were obtained from the Radiation Effects Research Foundation; a public interest foundation funded by the Japanese
Ministry of Health, Labour and Welfare and the U.S. Department of Energy. The views of the author do not necessarily reflect
those of the two governments.
586 Evidence Factors
(a) (b)
0.5 15
10
Risk of solid cancer
0.3
0.2
5
0.1
FIGURE 26.1
Radiation exposure and solid cancer incidence adjusted for age and sex. Panel (b) only shows in
city survivors. Dose categories are 2 for 0–5 mGy, 3 for 5–20 mGy, 4 for 20–40 mGy, ... 20 for
1750–2000 mGy, 21 for 2000–2500 mGy, and 22 for 2500-3000 mGy colon dose, see www.rerf.or.jp.
Figure 26.1 shows two plots based on this study. Individuals are stratified into 30 strata based on
age and sex; see details in [5]. Panel (a) of Figure 26.1 plots the risks of solid cancer for the NIC
residents and survivors. The risk tends to be higher for the survivors relative to NIC residents. The
Mantel–Haenszel test rejects the null hypothesis of no effect of radiation exposure on solid cancer
in favor of a carcinogenic effect of radiation exposure; one-sided p-value 2.35 × 10−10 . Note that
this comparison is valid under the assumption that the survivors and NIC residents are similar in all
characteristics other than their radiation exposure. This assumption could be violated, for example, if
not-in-city residents were better educated or employed.
We could redo this test with a different test statistic that provides increased power for one-sided
alternatives in contingency tables; see examples in [6]. But if in fact there is no carcinogenic effect of
radiation and the previous analysis is biased, this second test does not serve to weaken the effect of
the bias in our evidence. In this section we are interested in structures of evidence factor analyses.
Hence, we will not focus on the choice of test statistics. In Section 26.6 we provide some discussion
on appropriate choices of test statistics.
Figure 26.1(b) looks at a different aspect of the effect of radiation. This time, the plot is of
estimated odds ratios of cancer for the survivors versus NIC residents plotted against dose categories
of the survivors. The pattern apparent in this plot is that the odds ratio increases with a higher
radiation dose. There are a few important facts about this pattern. First, this pattern would appear
if the null hypothesis of no effect was false and the carcinogenic effect gets worse with higher
radiation dose. Kendall’s test for correlation gives a one-sided p-value of 5.26 × 10−14 . Second, we
are not entitled to this p-value without assumptions. This analysis could also be biased because the
hypocentre was close to the urban center so that the survivors who were exposed to higher doses
tended to be located in more urban areas; also, high-dose survivors might have been comparatively
healthy to have survived a high dose. But, this bias acts differently from the one that worried us in
the first analysis. Third, Figure 26.1(a) gives no indication that such a pattern would exist in Figure
26.1(b). In a sense, these two pieces of evidence complement each other. This is in contrast to two
Evidence Factors in Different Study Designs 587
FIGURE 26.2
Injury severity in matched crashes across three ears of side-airbag availability, None, Optional, and
All. The star in the boxplot and the vertical line in the barplot represent the corresponding mean.
This is Figure 3 of [7].
tests on the same Figure 26.1(b), say with Kendall’s test and Pearson’s correlation test, which would
be highly correlated and similarly biased. Stated a bit more formally, one could derive the asymptotic
joint distribution of the two test statistics using a bi-variate Central Limit Theorem. The asymptotic
correlation between the two tests will be 0, indicating that they are asymptotically independent.
Figure 26.2(a) and (b) thus illustrate two factors in an evidence factor analysis of the carcinogenic
effect of radiation from the Life Span Study data.
In the above example the doses of the treatment were easily conceived. The p-values were very
small. We discuss in Section 26.4 how to leverage the properties of evidence factors analyses to make
stronger conclusions. The example below is different from the previous example: it does not come
with an obvious treatment with doses, and the comparisons do not unambiguously agree with each
other. How do we create the factors? What do we learn from their evidence factor analysis?
Example 26.2. Adoption of new safety technology in cars happens gradually. There is an earlier
period where cars would not have had the technology and a later period where most cars will have
the safety technology by default. In between, there is often a transition period where the technology
will be available for purchase optionally. There are many car manufacturers; some may follow a
faster adoption than others. This timeline may also vary across different vehicle models by the same
manufacturer.
Consider in this context studying the effectiveness of side-airbags during a crash. A simple
com/parison of injuries to drivers of cars with side-airbags and without side-air bags is not an
appealing comparison because side airbags are only one difference between these vehicles and
588 Evidence Factors
their drivers. Perhaps Volvos (which tended to have side-air bags earlier than others) attract drivers
concerned with safety, with the possible consequence that Volvos are driven differently from, say,
Dodge Chargers.
A more attractive comparison can be made that avoids this bias. We can compare crashes of the
same makes and models of cars across eras that differ in side-airbag availability. This comparison
was made in [7] using the U.S. NHTSA’s U.S. Fatality Analysis Reporting System (FARS) records.
We avoid some discussion regarding possible selection bias and its resolution in this data set which
could occur since FARS only has records of fatal crashes; see [7] Section 2.2. The analysis created
matched sets of three vehicle types of vehicles with each set consisting of cars with the same make
and model, one from the “none” period where the car did not have side-airbags, either one or three
from the “optimal” period where side-airbag was available only as an optional purchase and one
from the “all” period. The matching adjusted for measured characteristics of the driver, e.g., age,
gender and if belted, and characteristics of the crash, e.g., the direction of impact. The outcome is a
measure of severity of the injury, 0 for uninjured to 4 for death.
This type of matching of three groups is not standard in observational study designs. We discuss
the algorithm we used to create this matched design in Section 26.5.1. The matched design in this
study had 2,375 matched sets and total 9,084 cars.
This study has two evidence factors: (i) all-and-optional-versus-none-era and (ii) all-versus-
optional. Note that only 18% of studied vehicles in the “optional” era had side airbags. Figure 26.2
reproduces Figure 3 of [7]. This figure shows that the “none” period had higher injuries. This is
also seen in a stratified test for the first factor using the senm function of the sensitivitymult
package in R which gives a one-sided p-value 1.26 × 10−06 . But the second factor gives a one-sided
p-value 0.5801. This indicates the evidence from the first factor is not plausibly an effect caused by
side-airbags, because there is no significant difference between the “all” and “optional” eras. Yet,
in the “optional” era, only 18% of owners had purchased vehicles with side-airbags. If we had not
looked at two factors, the comparison of the “none” era to the other eras might mistakenly have been
taken to be evidence for effects caused by side-airbags.
Other comparisons could also be of interest in this design. Comparison of the “all” era to the
“none” era also provides a signal of reduced injury when side-airbags were standard in cars. But, this
does not complement the first analysis from our previous discussion. These two analyses are not
independent; thus they do not form evidence factors. We continue this discussion in Section 26.4.2.
Factor 3
8
relative to non−smokers in the stratum
0.8
6
Periodontal Disease
0.6
4
0.4
2
0.2
0.0
0
Non−smoker Smoker <= 33th percentile (33,66]th percentile > 66th percentile
Smoking Status Cotinine level for smokers
relative to non−smokers in the stratum
(c)
Smoker
FIGURE 26.3
Covariate adjusted periodontal disease for smokers and nonsmokers (a). Smokers’ relative periodontal
disease for groups of relative cotinine levels (b). Smokers’ relative cotinine level to nonsmokers (c).
Definition 26.1. ( [14]) The ordered partial exclusion restriction holds for K if, with the values of
the first k − 1 instruments fixed by conditioning, the potential outcomes of the units are specified by
the exposure status of the units.
Evidence Factors in Different Study Designs 591
Definition 26.2. ( [15]) The unordered partial exclusion restriction holds for K if, with the values of
the instruments not in K fixed by conditioning, the potential outcomes of the units are specified by
the exposure status of the units.
The unordered partial exclusion restriction is less restrictive than its ordered counterpart. For
example, with K = 3 and K = {2}, the unordered partial exclusion restriction holds for K if the
third instrument directly affects the outcome, but the ordered partial exclusion restriction does not
hold. We refer the readers to [15] for a detailed discussion of these two restrictions.
If the ordered partial restriction holds for K, and further there are no unmeasured confounders
in the instruments in K then we can create |K| (the cardinality of K) evidence factors using the
reinforced design developed in [14]. In this method, K comparisons are made where the kth step
performs an instrumental variables analysis using the kth instrument after stratifying on the previous
k − 1 instruments. If K is known, one could only conduct |K| analyses that stratify on the first
kmin − 1 instruments. But it is unlikely in practice that K would be known. In that case, a planned
analysis can synthesize the results from the K comparisons to provide useful detail regarding the
strength of the evidence in the presence of invalid instruments. Planned analyses of evidence factors
are discussed in Section 26.4.
If the unordered partial exclusion restriction holds for K, then a reinforced design could fail to
give valid evidence factors. For example, suppose K = 3, K = {2} and the third instrument directly
affects the outcome. In this case the reinforced design does not work. Instead, we can construct
evidence factors for this situation using the balanced blocking method of [15]. This method creates
ex-post-facto strata where the empirical distribution of the instruments in each stratum are made
jointly independent. For example, for K = 2 binary instruments, a stratum may look like (a) or (b)
in Table 26.1, but not Table 26.1(c) where the second instrument tends to be higher for higher values
of the first instrument.
TABLE 26.1
Examples of balanced blocks (a) and (b), and an unbalanced block (c).
A few remarks on differences between a reinforced design and balanced blocking follow. (i)
Balanced blocking requires the creation of a special design, which is not required in reinforced design,
where we balance the distribution of the instruments against each other. If the instruments are highly
imbalanced, balanced blocking may require throwing away many data points, or it may be impossible
to create a balanced blocking. The latter happens, for example, when the instruments are nested. For
nested instruments, [15] develop the mutual stratification method for evidence factor analysis. (ii)
The analysis in balanced blocking does not require conditional analyses on other instruments that
are central to a reinforced design. Instead, one can analyze each instrument marginally. Marginal
analyses often improve the interpretability of the inference; see Section 3.5 of [15]. (iii) The factors
from a reinforced design are nearly independent in the finite sample (technical details in Section
26.3.3), but the factors from a balanced blocking method are only asymptotically independent.
Both reinforced design and balanced blocking ask for novel design tools. It turns out, both these
design problems are NP-hard. Intuitively, this means it is not always possible to compute the optimal
592 Evidence Factors
designs fast. Section 26.5 gives algorithms that run fast to create designs that are close to the optimal
designs.
Example 4. Does Catholic schooling have a positive impact in terms of future earnings relative to
public schooling? [16] were the first to study this question by directly comparing public high schools
to private Catholic high schools. Following the study, controversy arose partly due to methodological
reasons. It is easy to see that parents’ education, income and social status can confound this analysis.
So, we should adjust for them. But, parents’ and child’s commitment to education and ambitions
are not easy to measure accurately and thus are not easy to adjust for, which could bias the direct
comparison.
In the literature at least two instrumental variables have been used to study this question. The
first uses the fact that geographic proximity to Catholic schools nudges a child to attend a Catholic
school. Thus, if the answer to the above question was yes, then we should see higher income for the
individuals raised closer to Catholic schools relative to others who were otherwise comparable. The
second uses the fact that being raised in a Catholic family also has a strong influence on the decision
to attend a Catholic school. But the literature also has many arguments and counter-arguments
regarding the validity of these candidate instruments.
[14] present an evidence factor analysis where rather than choosing one comparison, there
are three factors from the two candidate binary instruments (raised in an urban/rural area and a
Catholic/other household) above and the direct comparison. They use a reinforced design where the
comparisons are made in the order, first, proximity to Catholic schools, then, raised in a Catholic
family, and finally, the direct comparison. This analysis provides nearly independent inferences that
provide valid inference for the question under an ordered partial exclusion restriction.
Using the same data from the Wisconsin Longitudinal Study of students graduating high school
1957 (see details in [14]), we present an evidence factor analysis using the balanced blocking method.
Our outcome measure is yearly wages in 1974. We adjust for students’ IQ, parents’ education,
income, occupational socioeconomic index (SEI) score and occupational prestige score. We create a
balanced blocking that adjusts for these covariates, where every balanced block includes 5 individuals
each for four possible pairs of values of the two instruments. We discuss the algorithm for how this is
achieved in Section 26.5.3.
Table 26.2 is our balance table showing the summary of the covariates before and after balanced
blocking. An absolute standardized difference on the table has the same interpretation as in a matched
pairs design but requires a different definition for a stratified design which is given in Section 26.6.
Pearson’s χ2 statistic for an association between the urban/rural instrument and attending Catholic
school is 27.65, and between the religion instrument and attending Catholic school is 204.38. Both
instruments are therefore strongly associated with the treatment.
There are several choices of test statistics for our three comparisons. We choose the stratified
Wilcoxon’s rank-sum statistic because it is robust to extreme values of the outcome, and the stratum
sizes are equal. We have three factors in this study. The first compares the outcomes for urban vs
rural using the strata created by the balanced blocks. The second compares the outcomes for raised
Catholic vs other using the strata created by the balance blocks. The third compares the outcomes
for Catholic vs public schooling using strata created by the balance blocks and the two binary
instruments.
Table 26.3 reports the results from these analyses. The point estimates and confidence intervals
are calculated in the usual way by inverting the test ( [17], section 2.4) under the assumption of a
Tobit effect, see Section 2.4.5 of [18]. Our three factors agree in finding that there is no significant
effect of Catholic schooling vs public schooling. Since these comparisons are nearly independent,
more can be said from these results. For example, what is the evidence by combining these factors?
What evidence remains if we believe one of these factors is biased? These are discussed in Section
26.4.1.
Structure of Evidence Factors 593
TABLE 26.2
Covariate balance before and after balanced blocking for example 4. We report the means for each
covariate at the two levels of the instruments and then the absolute standardized difference of the
covariate between the levels, e.g., 0.107 is the absolute standardized difference in IQ score between
the urban and rural groups before balanced blocking.
TABLE 26.3
Evidence factors in the study for the effect of Catholic schooling on yearly wage ($) in 1974.
Raised Schooling
Urban/Rural Catholic/Other Public/Catholic
p-value .0640 .1398 .1608
Point estimate 4,946 1,145 419
95% Confidence interval (−1, 671, 25,833) (−945, 3,586) (−697, 1,759)
A brief remark: In our analysis, we made two adjustments for the covariates. Once in balanced
blocking where we created blocks with the units close in their covariate values. Then, we used a
covariance adjustment method that used the residuals from a robust linear regression of the outcome
on the covariates and block levels as inputs for the stratified Wilcoxon’s tests; this method is discussed
in [19].
The above proposition shows that we can create exactly independent tests for a null hypothesis in
the same study. It is easy to see that we do not need to use Wilcoxon’s rank-sum statistics for the
tests. Any function of (R1σ , . . . , Rnσ1 ) could be used for T1 and any function of (R̃1σ , . . . , R̃nσ2 +n3 )
could be used for T2 .
Structure of Evidence Factors 595
Alternatively, exactly independent comparisons can also be created by sample splitting. However,
performing independent comparisons from two or more split samples is not a good strategy for
creating evidence factors. First, it reduces the effective sample size of analysis. Second, more
importantly, two analyses using similar statistical methods but from separate splits do not protect us
from a variety of unmeasured biases.
The representation of a treatment assignment as a random permutation comes in quite handy in
one kind of construction of evidence factors where, as we did here, the factors are created from a
factorization of the treatment assignment. A general construction along this line is given in [20],
which is further extended in Theorem 9 of [21].
Ranks are somewhat special, in the sense that they can provide exactly independent analyses. But
many common pairs of test statistics will not have this property. It turns out that for an evidence
factor analysis we do not need exact independence; a sense of near independence is sufficient. Near-
independence of tests is formally defined in Definition 26.5. Also, nearly independent comparisons
are easier to achieve. Further, nearly independent comparisons can be combined as if they were
independent to report combined evidence against the null that can only have a conservative type-I
error. Thus, pieces of evidence from two nearly independent comparisons do not repeat each other.
this way, a bias level of Γ = 2 could be created by an unmeasured confounder that increases the odds
of getting the treatment 5 times and increases the chance of a positive difference in the treatment
minus control outcome for the pair by threefold. Again, these effects are not small.
What do we do after specifying a value of Γ? If Γ = 1, we know there are no unmeasured
confounders. Thus, in a matched pairs design we know that the probability of a unit in a pair
receiving the treatment is a toss of a fair coin. We can use this randomized treatment assignment
distribution to get the null distribution of our test statistic under the null. This way, we can calculate
an exact p-value for our test [23]. All the p-values we reported in Section 26.2 are calculated this
way but under different randomization distributions.
If we specify Γ > 1, we do not know the exact distribution of the treatment assignment. Rather,
a family of treatment assignment distributions is possible. Then, a sensitivity analysis p-value is
calculated as an upper bound of all possible p-values that are possible under this family.
In example 3, when Γ = 1 the comparison of the smokers to nonsmokers on periodontal disease
gave a p-value of < 1.0 × 10−16 for van Elteren’s test. To assess the sensitivity of this inference, let
Γ = 2. We calculate the sensitivity analysis p-value for the test (this can be done using the R package
senstrat) as 0.0003. This number is larger than the p-value under Γ = 1, which is expected. But
it is still smaller than the nominal level 0.05. Thus, a bias of Γ = 2 is not enough to refute a rejection
decision for the null hypothesis. A large enough bias level can invalidate any inference made under
the assumption of no unmeasured confounding. At Γ = 2.75, the sensitivity analysis p-value is .1707,
and at Γ = 3 the sensitivity analysis p-value is .3937.
26.3.3 Evidence factors: definition and integrating evidence using joint sensitivity
analyses
Now consider K factors and their corresponding sensitivity parameters Γ1 , . . . , ΓK . Let P k,Γk
denote the p-value upper bound from comparison k at bias Γk ≥ 1. We have the following Definition
26.5 for evidence factors.
Definition 26.4. A vector of p-values (P1 , ..., PK ) ∈ [0, 1]K is stochastically larger than the uniform
when for all coordinate-wise non-decreasing bounded function g : [0, 1]K → R, we have, under the
null,
E{g(P1 , . . . , PK )} ≥ E{g(U1 , . . . , UK )}, (26.1)
where U1 , . . . , UK are i.i.d. uniform[0,1] random variables.
Definition 26.5. Multiple analyses are evidence factors when (i) bias that invalidates one analysis
does not necessarily bias other analyses; and (ii) for any (Γ1 , . . . , ΓK ) the p-value upper bounds
(P 1,Γ1 , . . . , P K,ΓK ) are stochastically larger than the uniform.
Consider a simple implication of two factors satisfying the definition. Define a non-decreasing
bounded function g on [0, 1]2 as g(p1 , p2 ) = 1 if and only if p1 ≥ α1 or p2 ≥ α2 for some specified
values of α1 and α2 in (0, 1]. Then the definition applied to two factors for Γ1 = Γ2 = 1 implies
Pr(P 1,1 < α1 and P 2,1 < α2 ) ≤ Pr(U1 < α1 and U2 < α2 ) = α1 α2 .
Assume P 1,1 has a uniform distribution, which will happen in many common situations for a large
sample. Then, the above implies that, under the null, Pr(P 2,1 < α2 | P 1,1 < α1 ) ≤ α2 . Thus, if we
had a testing procedure that rejected the null if P 1,1 < α1 , then a decision to reject the null using the
first factor will not inflate the type-I error for the second factor. Thus, the comparisons are nearly
independent in the sense of (ii) in Definition 26.5.
Another important implication of the definition is the following result. Consider any method
that combines K independent p-values into a valid p-value. If the method is coordinatewise non-
decreasing, then it can be applied to (P 1,Γ1 , . . . , P K,ΓK ) to calculate a p-value that is valid for
Structure of Evidence Factors 597
testing the null hypothesis when the biases in the factors are at most Γ1 , . . . , ΓK . Hence, we can
combine the factors as if they were independent.
Many methods are available that combine independent p-values. Fisher’s method is a popular
choice [31]. This method calculates a new test statistic as −2 times the sum of the logarithms of the
K p-values. It is easy to see that if the p-values were independent and uniformly distributed on [0,1]
this statistic will have a χ2 -distribution with 2K degrees of freedom. Fisher’s combination method
calculates the combined p-values using this null distribution. Specifically, Fisher’s combination of
P k,Γk ’s is
XK
Pr(χ22K > −2 log P k,Γk ).
k=1
By the near independence property of the factors, this is a valid p-value when the biases are at
most Γk s. [32] gives a comprehensive, although slightly outdated, summary of various combination
methods. Using such methods we can produce joint sensitivity analyses with several factors.
Two questions: Which combination method should we use? What do you get extra from looking
at joint sensitivity analyses of the factors rather than at the individual factors?
Consider the first question. Some combination methods are better for joint sensitivity analysis.
An attractive choice is thePtruncated product method of [33]. It uses a threshold κ where 0 < κ ≤ 1
to define the statistic −2 k:P k,Γ log P k,Γk . The combined p-value is calculated by comparing the
k
value of this statistic to its null distribution, which has an analytic form [33]. The truncated product
method de-emphasizes larger (> κ) p-values by treating them all as 1. When κ = 1 we get back
Fisher’s combination test. Since P k,Γk ’s are p-value upper bounds, their exact numeric values are of
little interest beyond a certain value. This fact makes the truncated product method appealing in joint
sensitivity analyses.
Although Fisher’s and the truncated product method have equivalent asymptotic power perfor-
mance (see Proposition 7 of [34] with k = 1), in practical situations, the truncated product method
often shows higher sensitivity to unmeasured confounders [35]. Typically, the suggested value of κ is
.1 or .2.
For the second question, consider example 3 again.
Example 3 (continued). Recall the three factors from Figure 26.3. Notice that not everything about
Figure 26.3(a) is independent of Figure 26.3(b). There are smokers on the second boxplot of 26.3(a)
and their periodontal disease are on the vertical axis, and in 26.3(b), we again have the periodontal
disease of the smokers on the vertical axis. Yet, the three comparisons in Figure 26.3(a) and (b) are
nearly independent.
Assuming no unmeasured biases, i.e., Γ1 = Γ2 = Γ3 = 1, the p-values from the three com-
parisons were < 1.0 × 10−16 , .0024 and .066, respectively. Considered separately, they provide
different impressions regarding the strength of evidence from the factors. Factor 3 is not significant
at the 5% significance level. But taken together with factor 2, the combined p-value of the two is
significant; combined p-value .0013 using the truncated product method with κ = .2.
The factors can also be susceptible to different levels of biases. In Table 26.4 we report joint
sensitivity analyses of the factors with varied values of Γ1 , Γ2 , and Γ3 . [5] show that this table, despite
having many p-values, does not require a multiplicity correction. Thus, we can read off the table at
the same time that at the 5% significance level, we have evidence to reject the null when Γ1 = 2 and
Γ2 = Γ3 = 1.5, and that the inference is sensitive to bias at Γ1 = 2.25 and Γ2 = Γ3 = 1.75. We can
also see that if factor 1 is infinitely biased, the evidence remains if the other factors are free of biases.
Also, if factors 2 and 3 are infinitely biased, we still have evidence to reject the null for Γ1 ≤ 2.
The analysis in Table 26.4 combines all three factors. After reading this table, can we see the
table for factors 1 and 2, for the individual factors? Can we also consider the comparison of high vs
low relative cotinine levels in Figure 26.3(b)? These questions are answered in the following section.
598 Evidence Factors
TABLE 26.4
Joint sensitivity analyses of the evidence factors for smoking and periodontal disease. The p-value
upper bounds of the factors are combined using the truncated product method with κ = .2.
Γ2 = Γ3
Γ1 1 1.25 1.5 1.75 2 2.25 2.5 ∞
1 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000
1.25 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000
1.5 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000
1.75 .0000 .0000 .0000 .0002 .0002 .0002 .0002 .0002
2 .0000 .0004 .0012 .0071 .0071 .0071 .0071 .0071
2.25 .0001 .0044 .0128 .0593 .0593 .0593 .0593 .0593
2.5 .0004 .0200 .0522 .1896 .1896 .1896 .1896 .1896
2.75 .0012 .0494 .1200 .4317 .4317 .4317 .4317 .4317
∞ .0047 .1448 .3059 1.0000 1.0000 1.0000 1.0000 1.0000
k|K
for a kth order partial conjunction hypothesis among the K factors. In other words, H0;Γ1 ,...,ΓK
says that at most k − 1 of the K factors are valid at their specified bias levels. A partial conjunction
Planned Analyses of Evidence Factors 599
can be defined for each of the order k = 1, . . . , K. The Kth order partial conjunction is the global
intersection hypothesis from before. Thus, testing several of these partial conjunctions could provide
more comprehensive evidence in a causal inference method.
A primary reason to focus on the K partial conjunctions rather than the 2K − 1 combinations is
that we can guarantee a family-wise error rate control without any multiplicity correction. To see
k|K
this, first define P rΓ1 ,...,ΓK as the combination of largest K − k + 1 of the p-value upper bounds in
(P 1,Γ1 , . . . , P K,ΓK ).
We have the following results for type-I error control. These results are proved in [34].
k|K k|K
1. For fixed k and Γ1 , . . . , ΓK , PΓ1 ,...,Γk is a valid p-value for testing H0;Γ1 ,...,ΓK for any coordi-
natewise nondecreasing combination method.
k|K
2. Fix k and any set J of values for (Γ1 , . . . , ΓK ). The testing procedure that rejects HΓ1 ,...,ΓK if
k|K k|K
PΓ1 ,...,ΓK < α has a familywise error rate of at most α for the set of hypotheses {HΓ1 ,...,ΓK :
(Γ1 , . . . , ΓK ) ∈ J}.
3. The same testing procedure as above also provides a familywise error of at most α for the set
k|K
of hypotheses {HΓ1 ,...,ΓK : 1 ≤ k ≤ K; (Γ1 , . . . , ΓK ) ∈ J} under an additional mild ordering
condition on the combination methods, see [34] Proposition 3.
These results allow us to look at the sensitivity analyses of partial conjunctions of all orders in an
evidence factor analysis without having to pay for an inflated type-I error due to multiplicity.
After rejecting a partial conjunction hypothesis, it could be of interest to test the individual
k|K
hypotheses, asking if at least k of the K hypotheses are false, that is, if H0;Γ1 ,...,ΓK is rejected, which
of the individual hypotheses are false? This question can be answered by comparing each factor t at
an adjusted level α/(K − k) or by adjusting the corresponding p-value by (K − k)P k,Γk .
A special case of the above deserves attention. Suppose K = 2 and we reject the hypothesis
that at least one of them is false by comparing the combination of the two factors to level α. Since
K = 2, k = 1, hence K − k = 1, and H 1|K has been rejected, the individual p-value upper bounds
can be considered without further adjustments. This is equivalent to closed testing of two evidence
factors [5, 36]. Closed testing of more than two factors is also possible, but with 2K − 1 conjunctions
it could get complicated. The partial conjunction method has a universal structure in this sense.
There is a family of combination methods that is optimal in the sense of having the largest
Bahadur slopes [37]. Fisher’s combination and the truncated product combination for p-values belong
to this family.
Example 4 (continued). In Table 26.3 the p-values were calculated assuming no bias from unmea-
sured confounding. Since the evidence is separated into three nearly independent analyses, combining
them will strengthen our evidence. Fisher’s method for p-value combination applied to the three
factors gives a value of 0.0237. Although the individual factors are not statistically significant at level
0.05, this small combined p-value tells us that there is only a 2.37% probability that we would have
seen these three relatively small p-values, .0640, .1398, and .1608, for the factors if the null were
true. We can also get an estimate and a confidence interval for the effect by inverting this combined
p-value. They are: effect estimate $1, 265 and 95% confidence interval ($120, $5, 303). Thus, the
combined p-value supports hypothesis of a positive effect the when all the factors are free of any
bias.
What evidence remains if at most one of them is invalid? Using the partial conjunction analysis
this may be calculated as Fisher’s combination of the two largest numbers, i.e., the combination of
.1398 and .1608, which is .0714.
If we assume small bias levels Γ1 = Γ2 = Γ3 = 1.1, the p-values from the three factors become
.2099, .3572 and .2895, respectively, which when combined is no longer significant. Thus, our
evidence factor analysis gives very weak evidence regarding a positive effect of Catholic schooling
600 Evidence Factors
vs public schooling that does not hold up if either the factor that most favors this hypothesis is invalid
or there are small biases.
The technical results in this section tell us that we do not have to pick one of these results to
report worrying that we might incur additional type-I error; we can report all of them at the same
time.
Here is a note regarding possible violation of the exclusion restriction and bias from unmeasured
confounding in an instrument. It is not always possible to separate these two. For instance, higher
ambition might be interpreted as a confounder so that conditioning on it renders the urban/rural
instrument valid, but without measurements of ambition, the instrument is invalid because of the
failure of the no unmeasured confounders assumption. Or, one might argue differently that living in
urban areas comes along with ambition, resulting in a direct effect on earnings when ambitions affect
earnings. It is not possible within the study to separate the two.
where u, u0 and u00 are from different groups. We are interested in matched sets of these three groups.
Specifically, given positive integers κ1 and κ2 we want to create |I1 | many disjoint sets (here indexed
by i) of the form (ui , u0i1 , . . . , u0iκ1 , u00i1 , . . . , u00iκ2 ) where u is from I1 , u0 ’s are from I2 and u00 ’s are
from I3 that attempt to minimize
κ1 κ2 κ1 Xκ2
1 X 1 X 1 X
dui ,u0ii0 + dui ,u00ii00 + du0ii0 ,u00ii00 . (26.3)
κ2 0 κ1 00 κ1 κ2 0 00
i =1 i =1 i =1 i =1
In the side-airbag example (example 2), the three groups were the cars from three eras of
side-airbag availability. Conceptually, there we solved this problem many times for each make
and model. However, operationally, this is done in one large problem with all cars that imposed
exact matching on make and model. We defined the distances based on the pre-treatment covariates
that we mentioned in our earlier discussion. We skip some details regarding the calculations and
only give pthe key ideas. Let xu denote the vector of covariates for unit u. Then one can define
du,u0 = (xu − xu0 )> Σ−1 (xu − xu0 ), where Σ is the covariance matrix combining all the groups.
The is the Mahalanobis distance of the covariates. The squared root is needed in defining d’s to
satisfy (26.2). In our matching, we used a robust version of the Mahalanobis distance that uses ranks
of the covariates instead of the actual values, hence less sensitive to extreme covariate values, but
also satisfies (26.2); see [7] and Chapter 8 of [18].
It turns out even a simpler version of the problem (26.3) where all three groups have equal size
and κ1 = κ2 = 1 is NP-hard [41]. But in our design problems we are unlikely to get equal-sized
groups, and since some groups can be very large, it is natural to ask if we can match more than one
unit from the group in our matched sets. Obviously, we cannot have κ1 and κ2 too large. Otherwise,
there will not be any feasible designs.
[7] give an approximation algorithm for this problem with approximation factor a = 2. Given κ1
and κ2 , this algorithm has a worst case computational complexity that is of the order O(n3 ). This
algorithm is implemented in the R package approxmatch. The function tripletmatching
in that package takes as input the distances and group labels of the units to create an approximately
optimal design. The package also has a function multigrp dist struc that provides several
ways to calculate the distances from the covariates that satisfy (26.2).
The R function for three group matching can do more than just minimize (26.3). If there are cate-
gorical variables, it can be asked to solve the design problem approximately while also maintaining
a near-fine balance of the categorical variables; details are in [7]. Intuitively, a near-fine balance
requires making the marginal distributions of the categorical variables as close to equal as possible
across the groups in the matched design.
The algorithm has two steps that are relatively easy to understand, although
Pκ we skip the details. It
starts with matching the first group to the second group by minimizing κ12 i01=1 dui ,u0ii0 . This way
the partial matched sets are created. Then,
Pκin the next step it completes
Pκ the Pκmatched sets by choosing
the rest of the units by minimizing κ11 i002=1 dui ,u00ii00 + κ11κ2 i01=1 i002=1 du0ii0 ,u00ii00 . These two
steps, individually, can be solved optimally as each is a matching problem between two groups. But
together, the three-group matched sets design is not optimal because we may regret some of the
partial matches in the first step between the first two groups if we had seen the third group. But it
Algorithmic Tools for Designs of Evidence Factors 603
can be shown that because of (26.2), this regret is not large, and the algorithm has an approximation
factor 2.
The same package approxmatch also has a function kwaytmatching that can create
matched sets for more than 3 groups. All the features of tripletmatching extend to this
function. This time, for K groups, we have a (K − 1)-approximation algorithm. As the problem
with more groups is more difficult, the worst case approximation performance is also worse. Given
K, the computational complexity of the algorithm does not change; it is O(n3 ).
26.5.2 Stratification
In some instances, we want to adjust for covariates without necessarily disturbing the distribution of
the treatments, because the distributions of the treatments might be informative. Such is the case in
the reinforced design with multiple instruments [14]. The design problem, in this case, is different
from the matching problem of Section 26.5.1. This time we want to stratify ungrouped/unlabeled data
based on the covariate values. In example 3 we had a few categorical variables. Thus, we stratified
based on joint categories of the variable, e.g., female, between 50 and 59, white, high school graduate
defines a stratum. In the Catholic schooling example, we had many variables, both continuous and
categorical. Thus, we cannot stratify this way.
Consider n units and distances dij between units. We want to divide these n units into non-
overlapping sets of units of size k each so that the within-set differences are minimized. This is an
attempt to stratify the data into strata of size k that are homogeneous in the covariates. The problem is
important even outside of the context of observational studies. In experimental designs with multiple
treatments, or one treatment with multiple factors, this stratification can be used to create a blocked
design.
This problem is also NP-hard for any k ≥ 3. For k = 2, it is equivalent to a non-bipartite
matching problem that can be solved efficiently [42]; this algorithm is implemented in the R package
nbpMatching [43].
For general k, there have been some heuristic algorithms for this problem. Moore [44] proposed a
heuristic algorithm that works as follows. Let n = mk. This algorithm first creates m pairs greedily,
finding one pair at a time which is the best from the available units. Then, if k = 3, it greedily
matches one unit to each pair. For larger k, the method proceeds similarly until a stratification is
created. [45] proposed a randomized heuristic algorithm for the problem (implemented in the R
package blockingChallenge4 ). It first randomly chooses m units as template units for the m
strata. The remaining m(k − 1) units are optimally assigned to these m templates at a ratio (k − 1) : 1
to create the stratification. This method attempts many sets of random templates and chooses the best
of the created stratification.
An approximation algorithm for this design problem is developed in [46]5 . The objective function
is the maximum within strata distance (There are certain advantages to using this objective instead of
a sum of within strata distances. Briefly, it guarantees that the average imbalance for any treatment
assignment cannot exceed this maximum; minimizing the sums or averages does not provide such a
guarantee.). For the special case where k = 2J for some positive integer J, this algorithm calls a
non-bipartite matching J times. The first call creates m2J−1 strata of size 2, i.e., pairs, such that the
maximum paired distance is minimized. Each subsequent call halves the number of strata by pairing
the previous set of blocks so that the maximum of the paired within block distances is minimized in
this local problem. This procedure needs some modification if k is not a power of 2.
[46] gives a theoretical study of this algorithm. Theorem 3 from their paper shows that when
dij ’s satisfy the triangle inequality, this is an approximation algorithm with approximation factor
4 available from https://github.com/bikram12345k/blockingChallenge
5 An implementation of this algorithm is available from https://github.com/bikram12345k/BlockingAlgo
604 Evidence Factors
a = (k − 1). The algorithm has a computational complexity of O(n3 ). Further, Proposition 4 of the
paper shows that no other approximation algorithm can have a better constant in general.
The general balanced blocking problem is therefore to find non-overlapping blocks from the
n units by optimizing over the nb(z1 ,z2 ) values under the constraint (26.4) while minimizing the
total/maximum within-block average distances. There has not been any algorithm, approximation or
heuristic, that solves this design problem. Thus a careful study of the design problem for balanced
blocking remains incomplete.
In our example 4, we solved a slightly easier problem. We pre-specified the numbers nb(z1 ,z2 ) ’s
to be 5 for all blocks b and all (z1 , z2 ) values. Hence, (26.4) is satisfied by this choice. We used a
combination of the algorithms from Section 26.5.1 and 26.5.2 to create our design heuristically.
We first defined four groups based on the values of (Z1 , Z2 ) ∈ {0, 1}2 . Next, using our ap-
proximation algorithm for matching with multiple groups, we created an intermediate design of
quadruples, matched sets where the groups are represented 1:1:1:1. We used a squared root rank based
on Mahalanobis distance based on the covariates in Table 26.2. Additionally, we used a near-fine
balancing constraint on indicators for missing values of the covariates. Next, we used this quadruples
structure as an input to a stratification algorithm from Section 26.5.2. This time we defined the
distance between two quadruples as the total of 4 × 4 distances between the units from the quadruples.
The stratification problem is solved with k = 5 to create our balanced blocked design. The number
5 was chosen by trying out a few choices of k with an eye on the covariate balance after balanced
blocking.
in our experiments (which are not reported here) this proposed definition captures similar imbalances
in the data that the usual standardized mean difference captures. After stratification, we calculate our
absolute standardized differences in two steps. First, perform a two-way ANOVA with groups as the
first factor and the stratum levels as the second factor. Take the ratio of the sum of squares for the
groups from this ANOVA to the sum of squares of the residuals from the previous ANOVA. Finally,
calculate the absolute standardized difference as the squared root of this ratio after multiplying it by
the ratio of sample size before stratification over the sample size after stratification.
The standardized differences in Table 26.2 were calculated using this process. The sample size
before stratification was 4,449. After stratification, the sample size was 2,980 in the form of 149
strata of 20 units in each balanced block.
26.7 Acknowledgment
Bikram Karmakar was partly supported by NSF Grant DMS-2015250.
References
[1] Eite Tiesinga, Peter J Mohr, David B Newell, and Barry N Taylor. Codata recommended values
of the fundamental physical constants: 2018. Journal of Physical and Chemical Reference
Data, 50(3):033105, 2021.
[2] Robert Andrews Millikan. On the elementary electrical charge and the avogadro constant.
Physical Review, 2(2):109, 1913.
[3] Robert Andrews Millikan. The electron and the light-quant from the experimental point of view.
Stockholm: Imprimerie Royale. P. A. Norstedt & Fils, 1925, Nobel Lecture delivered on May
23 1924. Available from https://www.nobelprize.org/uploads/2018/06/millikan-lecture.pdf.
[4] Arthur Stanley Eddington. The charge of an electron. Proceedings of the Royal Society of
London. Series A, Containing Papers of a Mathematical and Physical Character, 122(789):358–
369, 1929.
[5] Bikram Karmakar, Benjamin French, and Dylan S Small. Integrating the evidence from
evidence factors in observational studies. Biometrika, 106(2):353–367, 2019.
[6] Vance Berger and Harold Sackrowitz. Improving tests for superior treatment in contingency
tables. Journal of the American Statistical Association, 92(438):700–705, 1997.
[7] Bikram Karmakar, Dylan S Small, and Paul R Rosenbaum. Using approximation algorithms to
build evidence factors and related designs for observational studies. Journal of Computational
and Graphical Statistics, 28(3):698–709, 2019.
[8] Bikram Karmakar, Chyke A Doubeni, and Dylan S Small. Evidence factors in a case-control
study with application to the effect of flexible sigmoidoscopy screening on colorectal cancer.
The Annals of Applied Statistics, 14(2):829–849, 2020.
[9] David A Savitz and Gregory A Wellenius. Invited commentary: exposure biomarkers indicate
more than just exposure. American journal of epidemiology, 187(4):803–805, 2018.
[10] Marc G Weisskopf and Thomas F Webster. Trade-offs of personal vs. more proxy exposure
measures in environmental epidemiology. Epidemiology (Cambridge, Mass.), 28(5):635, 2017.
[11] Bikram Karmakar, Dylan S Small, and Paul R Rosenbaum. Using evidence factors to clarify
exposure biomarkers. American Journal of Epidemiology, 189(3):243–249, 2020.
[12] Scott L Tomar and Samira Asma. Smoking-attributable periodontitis in the united states:
findings from nhanes iii. Journal of Periodontology, 71(5):743–751, 2000.
[13] PH Van Elteren. On the combination of independent two sample tests of wilcoxon. Bull Inst
Intern Staist, 37:351–361, 1960.
Acknowledgment 607
[14] Bikram Karmakar, Dylan S Small, and Paul R Rosenbaum. Reinforced designs: Multiple
instruments plus control groups as evidence factors in an observational study of the effectiveness
of catholic schools. Journal of the American Statistical Association, 116(533):82–92, 2021.
[15] Anqi Zhao, Youjin Lee, Dylan S Small, and Bikram Karmakar. Evidence factors from multiple,
possibly invalid, instrumental variables. The Annals of Statistics, 50(3):1266–1296, 2022.
[16] James Coleman, Thomas Hoffer, and Sally Kilgore. Cognitive outcomes in public and private
schools. Sociology of education, pages 65–76, 1982.
[17] Erich Leo Lehmann and Howard J D’Abrera. Nonparametrics: Statistical methods based on
ranks. Holden-day, 1975.
[18] Paul R Rosenbaum. Design of observational studies, volume 10. Springer, 2010.
[31] Ronald Aylmer Fisher. Statistical methods for research workers. In Breakthroughs in statistics,
pages 66–70. Springer, 1992.
608 Evidence Factors
[32] Betsy Jane Becker. Combining significance levels. The handbook of research synthesis, pages
215–230, 1994.
[33] Dmitri V Zaykin, Lev A Zhivotovsky, Peter H Westfall, and Bruce S Weir. Truncated prod-
uct method for combining p-values. Genetic Epidemiology: The Official Publication of the
International Genetic Epidemiology Society, 22(2):170–185, 2002.
[34] Bikram Karmakar and Dylan S Small. Assessment of the extent of corroboration of an elaborate
theory of a causal hypothesis using partial conjunctions of evidence factors. The Annals of
Statistics, 48(6):3283–3311, 2020.
[35] Jesse Y Hsu, Dylan S Small, and Paul R Rosenbaum. Effect modification and design sensitivity
in observational studies. Journal of the American Statistical Association, 108(501):135–148,
2013.
[36] Ruth Marcus, Peritz Eric, and K Ruben Gabriel. On closed testing procedures with special
reference to ordered analysis of variance. Biometrika, 63(3):655–660, 1976.
[37] Bikram Karmakar. Improved power of multiple sensitivity analyses in observational studies
using smoothed truncated product method. 2021. Unpublished manuscript.
[38] Donald B Rubin. The design versus the analysis of observational studies for causal effects:
parallels with the design of randomized trials. Statistics in Medicine, 26(1):20–36, 2007.
[39] Donald B Rubin. For objective causal inference, design trumps analysis. The Annals of Applied
Statistics, 2(3):808–840, 2008.
[40] David P Williamson and David B Shmoys. The design of approximation algorithms. Cambridge
University Press, 2011.
[41] Yves Crama and Frits CR Spieksma. Approximation algorithms for three-dimensional as-
signment problems with triangle inequalities. European Journal of Operational Research,
60(3):273–279, 1992.
[42] Robert Greevy, Bo Lu, Jeffrey H Silber, and Paul Rosenbaum. Optimal multivariate matching
before randomization. Biostatistics, 5(2):263–275, 2004.
[43] Cole Beck, Bo Lu, and Robert Greevy. nbpmatching: Functions for optimal non-bipartite
matching, 2016. R package version 1.5.1.
[44] Ryan T Moore. Multivariate continuous blocking to improve political science experiments.
Political Analysis, 20(4):460–479, 2012.
[45] Bikram Karmakar. blockingchallenge: Create blocks or strata which are similar within., 2018.
R package version 1.0.
[46] Bikram Karmakar. An approximation algorithm for blocking of an experimental design. Journal
of the Royal Statistical Society - Series B, 2022. doi: 10.1111/rssb.12545.
[47] Steven R Howard and Samuel D Pimentel. The uniform general signed rank test and its design
sensitivity. Biometrika, 108(2):381–396, 2021.
[48] Paul R Rosenbaum. Design sensitivity and efficiency in observational studies. Journal of the
American Statistical Association, 105(490):692–702, 2010.
[49] Paul R Rosenbaum. A new u-statistic with superior design sensitivity in matched observational
studies. Biometrics, 67(3):1017–1027, 2011.
Acknowledgment 609
[50] Paul R Rosenbaum. An exact adaptive test with superior design sensitivity in an observational
study of treatments for ovarian cancer. The Annals of Applied Statistics, 6(1):83–105, 2012.
[51] Paul R Rosenbaum. Weighted m-statistics with superior design sensitivity in matched ob-
servational studies with multiple controls. Journal of the American Statistical Association,
109(507):1145–1158, 2014.
[52] Debbie A Lawlor, Kate Tilling, and George Davey Smith. Triangulation in aetiological epi-
demiology. International Journal of Epidemiology, 45(6):1866–1886, 2016.
[53] Neil Pearce, Jan Vandenbroucke, and Deborah A Lawlor. Causal inference in environmental
epidemiology: old and new. Epidemiology (Cambridge, Mass.), 30(3):311, 2019.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Index
Note: Locators in italics represent figures and bold indicate tables in the text
A augmentation, 306
Absolute standardized mean difference, 81, 146, inverse propensity score weights, 296
266, 275, 302 balance and model classes, 297–299
Adjustment in non-experimental studies, 5–6, 22 connections to regression, 300
Albert’s weighting method, 385 modelling approach, 296–297
Algorithmic complexity, 93 primal-dual connection, 299
Amplification of a sensitivity analysis, 28, 142, weights in practice, 300
576 additional options for balancing in
Analysis of covariance (ANCOVA) model, practice, 305
272–273 balance-dispersion trade-off, 303
Aporia, 541 maximizing the effective sample size,
Approximation algorithm, 255 302–303
Asymmetric adjustments to distances, 250–252 BART, see Bayesian additive regression trees
Asymmetric caliper, 249 BART extensions, 430
Asymptotic separability, 558 evidence of performance, 437
ATE, see Average treatment effects generalizability, 433–434
Attributable effects, 575–576 grouped data structures, 434–435
Augmented inverse probability weighting sensitivity to unmeasured confounding,
estimator (AIPW), 288 435–437
Avalanche of Machine Learning, 484 treatment effect heterogeneity, 432
Average treatment effects (ATEs), 4, 8, 52, 263, treatment effect moderation, 432–433
286, 308, 351–354, 421, 446 Bayesian additive regression trees (BART), 9, 50,
on the overlap population (ATO), 267 52, 416, 424–425, 427, 518; see also
on the treated (ATT), 4, 7, 95, 284, 294, BART extensions for causal inference
417, 421 Gibbs sampler, 426
Bayesian adjustment for confounding (BAC),
B 510
Bad criticism of observational study, 531 Bayesian analysis, 501–503
Bahadur efficiency of a sensitivity analysis, 214, for causal inference, 503–504
576 intersection of confounding adjustment and,
Balance/balancing, 127; see also Fine balance 515–516
constraints, 106, 116, 122 nonparametric Bayesian prior distributions,
controlled deviations, 115–116 517
covariate balance and bias, 108–110 of propensity scores, 504–506
to external population, 125–126 accounting for uncertainty, 507–508
near-fine, 111 model feedback, 506–507
refined, 114 Bayesian backfitting, 426
Balancing weights for causal inference, 264–266, Bayesian causal forest (BCF), 521
293–294 Bayesian machine learning algorithm, see
estimating effects Bayesian additive regression trees
asymptotic properties, 307 (BART)
611
612 Index
design sensitivities for m-statistics, Exact matching, 6, 33, 93–96, 106–107, 114,
561–563 172, 188, 195–196, 240, 246, 601
with multiple comparisons, 566–567 sparse network, 246–248
Directed acyclic graphical (DAG) approach, 44 Exchangeability, 4–5, 352, 355–358, 361–364,
Direct effect, 374, 378 418
Directional penalties, 77 Exclusion restrictions, 322–323
Dirichlet process, 517, 520 External validity, 11, 40–43, 52, 54, 125, 154,
Distance matrix, 67, 90 278
Dogmatic criticism, 535
Doubly robust estimation, 6, 9, 50, 269, 512–515 F
Dynamic interventions, 353–354 Fine balance, 32, 76, 110, 248
Dynamic stochastic intervention, 355 controlled deviations from fine balance,
115–116
E and near-fine balance, 111–112
Effective sample sizes (ESS), 302, 303 R Code, 128–131
Effect modification refined balance, 112–114
in few nonoverlapping prespecified groups, solving matching problems under balance
210 constraints, 116
independent P-values, 210–211 assignment method, 116–117
in matched observational studies, 211–212 balancing to an external population,
CART method, 212–214, 217–218 125–126
Submax method, 214–218 computational complexity theory,
using sample-splitting, Denovo, 218 124–125
discovery step, 218–220 integer programming method, 120–122
inference step, 220–221 network flow method, 117–120
Effect ratio, 139–141, 235, 573, 577 recommendations, 126–127
Entire number, 96, 122, 252 strength-k matching, 114–115
Equal percent bias reduction, 88 First-order influence function, 360
E-value, 12 Fisher’s sharp null hypothesis, 556, 570
Evidence factors, 576, 583–585 Full matching, 9, 48, 69–70, 89–104
algorithmic tools for design, 601–604 Fuzzy RD designs, 160
definition and integrating evidence,
596–598 G
in different study designs, 585 Gaussian processes, 361, 519
case–control studies, 588 Generalizability index, 46
nonreactive exposure and its reactive Generalization and transportation, 39–40, 43
dose, 589 assess similarity, 46–47
studies with possibly invalid PATE, 48–51
instrument(s), 590–593 positivity fails, 47–48
treatments with doses, 585–588 sample selection probabilities, 46
planned analyses, 598 Generalized boosted modeling (GBM), 6, 382
comparisons that are not factors, 600 Gibbs sampler, 426
partial conjunctions with several factors, Glover’s algorithm, 78, 244–245, 250, 252
598–600 Greedy algorithms, 7, 66–69
structure of, 593 Gurobi, 79–81, 121, 255
independent comparisons, 594–595
sensitivity analysis for unmeasured H
confounders, 595–596 Hájek estimator, 266, 269
test statistics, 605 HAL-MLE, 486–497; see also Targeted
and triangulation, 605 maximum likelihood estimation
614 Index
size and speed of minimum cost flow treatment effect heterogeneity (PATH)
problems, 75 statement
optimal matching using mixed integer Pauli’s exclusion principle, 584
programming, 72, 80, 121, 193 Penalization techniques, 158
cardinality matching, 81 Plug-in estimators, 365
linear side constraints for minimum Point- and interval-estimation approaches, 52
distance matching, 79–80 Point estimates, 559–560
working without a worst-case time Pointwise inference, 359–361
bound, 78–79 Polynomial-time-bounds, 78
pair matching to minimize a distance Polynomial-time guarantees, 122
greedy algorithms, 66–69 Pooled cohort equations (PCEs), 446
nearest neighbor matching with Population average treatment effect on the
replacement, 69–70 treated (PATT), 47
notation for assignment problem, 66 Population average treatment effect (PATE), 41,
optimal assignment by auction algorithm, 43–44, 48–51, 54
70–71 doubly robust estimators, 50
simple variations on optimal assignment, matching, 49
71 modeling, 49–50
speed of algorithms, 71–72 subclassification estimator, 48
tactics for matching by minimum cost flow, weighting-based estimator, 48–49
75 Population in clinical equipoise, 267
adjusting edge costs to favor some types Population stratification, 145
of pairs, 77 Positivity, 352, 362
controls or fewer treated individuals, 77 Assumption, 378
fine, near-fine balance and refined Violation, 352–353
balance, 75–77 Possible exposure values, 376
threshold algorithms, 77 Posterior distribution, 502
Overlap weighting, 8, 263, 266–269 Poststratification, 31
balancing weights, 264–266 Potential mediator values, 376
causal estimands on target population, Potential outcomes, 171
263–264 Predictive approaches to treatment effect
extensions, 269 heterogeneity (PATH) statement,
covariate adjustment in randomized 447–448
experiments, 272–273 Prentice-Wilcoxon (PPW) test, 181
implementation and software, 273–274 Preservation of flow constraint, 118
multiple treatments, 269–270 Pre-test outcome, 157
time-to-even outcomes, 270–272 Principal causal effects, 313–317
illustration, 274 outcomes and principal stratification,
National Child Development Survey data, 317–320
275–278 principal score fit, 335–337
simulated example, 274–275 principal scores, 326–327
Overlap weight (OW), 263 definition and properties, 327–328
estimation, 328–335
P specification, 329–330
Pair matching, see optimal pair matching and sensitivity analysis, 337
optimization techniques in for monotonicity, 340–341
multivariate matching principal ignorability, 337–340
Pareto optimality, 124 structural assumptions, 320
PATE estimator, 47 exclusion restrictions, 322–323
PATH statement, see Predictive approaches to ignorable treatment assignment
mechanism, 320–321
618 Index