Causal Probabilistic Programming Without Tears
Causal Probabilistic Programming Without Tears
50 they collectively span Pearl’s causal hierarchy [Pearl 2001], and most are broadly applicable, empir-
51 ically validated, have an unconventional or limited identification result, and make use of modern
52 probabilistic machine learning tools, like neural networks or stochastic variational inference.
53 Our descriptions demonstrate how diverse real-world causal estimands and causal assumptions
54 can be expressed in declarative code free of unnecessary jargon and compatible with any inference
55 method implemented in the underlying PPL, especially scalable gradient-based approximations.
56
57 2 EXAMPLES
58 2.1 Backdoor adjustment
59
Our first example is more pedagogical. Suppose we wish to estimate how effective a particular
60
treatment 𝑇 is in producing a desired outcome 𝑌 . We assume we have observations of patients who
61
have and have not received treatment, but that we have not conducted a randomized controlled
62
trial. Estimating 𝑃 (𝑌 |𝑇 ) would not accurately characterize the effect of the treatment, because this
63
conditional distribution may reflect confounding from other patient attributes 𝑋 that affect both the
64
treatment the patient receives and the effectiveness of the treatment on the outcome. For example,
65
𝑃 (𝑌 | 𝑇 = 1) may place high mass on positive health outcomes, but only because younger patients
66
are both more likely to receive treatment and less likely to experience negative health outcomes.
67
68 Specifying a causal model. The causal assumptions at work can be encoded as a probabilistic program
69 (below, left). From this program, an intervened version can be derived, representing experimental
70 conditions for a patient assigned a certain treatment 𝑇 := 𝑡 (right).
71 def causal_model(theta): def intervened_causal_model(theta, t):
72 X ~ bernoulli(theta[0]) X ~ bernoulli(theta[0])
73 T ~ bernoulli(theta[X+1]) T = t
74 Y ~ bernoulli(theta[T+2*X+3]) Y ~ bernoulli(theta[T+2*X+3])
return Y, T, X return Y
75
76 Under these assumptions, we wish to compute the average treatment effect, 𝐴𝑇 𝐸 = E[𝑌 = 1|𝑑𝑜 (𝑇 =
77 1)] − E[𝑌 = 1|𝑑𝑜 (𝑇 = 0)]. The do notation indicates that the expectations are taken according to
78 intervened versions of the model, with 𝑇 set to a particular value. Note that this is different from
79 conditioning on 𝑇 in the original causal_model, which assumes 𝑋 and 𝑇 are dependent.
80
81 Estimating the treatment effect. As in all of our examples in this paper, the estimand of interest
82 (in this case, the ATE) can be expressed as a posterior expectation in an expanded probabilistic
83 program. (For the simple example in this section, this is a standard result, see e.g. Pearl [2009]
84 and Lattimore and Rohde [2019].) Suppose we have a set of measurements 𝑌 (𝑖) ,𝑇 (𝑖) , 𝑋 (𝑖) from an
85 observational population distribution 𝑃 (𝑌 ,𝑇 , 𝑋 ). The following probabilistic program encodes a
86 joint distribution over 𝜃 , samples from the causal model, and hypothetical data corresponding to
87 possible experimental outcomes if random patients were assigned treatment (1) or no treatment (0):
88
89
def joint_model():
90
theta ~ ThetaPrior()
θ0
93 Y_treated ~ intervened_causal_model(theta,t=1) θ3
94 Y_untreated ~ intervened_causal_model(theta,t=0)
do(T) θ4 T
97
98
Causal Probabilistic Programming Without Tears 3
99 The ATE is the expected return value of the program after conditioning our uncertainty of the
100 true model parameters 𝜃 on the measured observational data.
101
102 2.2 Causal effect variational autoencoder
103 The previous example assumed that it was always possible to measure all potential confounders 𝑋 ,
104 but when this is not the case, additional assumptions are necessary to perform causal inference.
105 This example, derived from Louizos et al. [2017], considers a setting where parametric assumptions
106 are necessary for a causal model to be fully identifiable from observed data.
107 Suppose we observe a population of individuals with features 𝑋𝑖 undergo treatment 𝑡𝑖 ∈ {0, 1}
108 with outcome 𝑦𝑖 . The treatment variable might represent a medication or an educational strategy, for
109 example, for populations of patients or students, respectively. The task is to estimate the conditional
110 average treatment effect: for a new individual with features 𝑋 ∗ , what difference in outcome 𝑦∗ should
111 we expect if we assign treatment 𝑡 ∗ = 1 vs. 𝑡 ∗ = 0? One cannot simply estimate the conditional
112 probabilities 𝑝 (𝑦∗ | 𝑋 = 𝑋 ∗, 𝑡 = 0) and 𝑝 (𝑦∗ | 𝑋 = 𝑋 ∗, 𝑡 = 1), because there may be hidden
113 confounders: latent factors 𝑧 that induce non-causal correlations between 𝑡 and 𝑦 even controlling
114 for the observed covariates 𝑋 . (For example, a student’s socio-economic status might influence both
115 their outcome 𝑦 and the educational strategy 𝑡 they are exposed to, and the observed covariates 𝑋
116 may not fully characterize the student’s SES. As a result, conditioning on 𝑡 may alter the distribution
117 over SES, changing the reported outcome.)
118
119
Specifying a causal model. We begin with a causal model of the data, with unknown parameters 𝜃 :
120
def causal_model(theta):
121
z ~ normal(0, I) θ1 Z θ2
125
mu_y, sigma_y = h(theta[t * 3 + (1-t) * 4], z)
y ~ normal(mu_y, sigma_y) θ3
126
127 def population_model(theta): θ4
128 for i in range(N): y
197 Like all counterfactuals, this estimand is not identified in general without further assumptions:
198 learning parameters 𝜃 that match observed data does not guarantee that the counterfactual dis-
199 tribution will match that of the true causal model. However, as discussed in the original paper
200 [Pawlowski et al. 2020] in the context of modeling MRI images, there are a number of valid practical
201 reasons one might wish to compute it anyway, such as explanation or expert evaluation.
202
203 2.4 Structured Latent Confounders
204 In the previous examples, we have demonstrated how probabilistic programs can be used to model
205 causal relationships between attributes of individual entities. However, it is often useful to model
206 relationships between multiple kinds of entities explicitly. For example, a student’s educational
207 outcome may depend on her own attributes, as well as the attributes of her school. In this hierarchical
208 setting, where multiple students belong to the same school, we can often estimate causal effects
209 even if these potentially confounding school-level attributes are latent.
210 Hierarchical structure is a common motif in social science and econometric applications of
211 causal inference; appearing in multi-level-models [Gelman and Hill 2006], difference-in-difference
212 designs [Shadish et al. 2002], and within-subjects designs [Loftus and Masson 1994], all of which are
213 out of scope for graph-based identification methods. Nonetheless, even flexible Gaussian process
214 versions of these kinds of causal designs can be implemented in a causal probabilistic programming
215 language [Witty et al. 2021]. Moving beyond simple linear models, recent work has introduced
216 Gaussian Processes with Structured Latent Confounders (GP-SLC) [Witty et al. 2020a], using flexible
217 Gaussian process priors for causal inference in hierarchical settings. The following generative
218 program is a slightly simplified variant of GP-SLC.
219 def instance_causal_model(f_x, f_t, f_y, U, theta):
220 mu_X = f_x(U)
221 X ~ normal(mu_X, theta[0]) # Generate observed covariates X, based on u
222
mu_T = f_t(U, X)
223
T ~ normal(mu_T, theta[1]) # Generate observed treatment, based on u and x
224
225 mu_Y = f_y(U, X, T)
226 Y ~ normal(mu_Y, theta[2]) # Generate outcome as a function of u, x, and t
return X, T, Y
227
(𝑜,𝑖)
228 This causal model allows estimation of individual treatment effects 𝐼𝑇 𝐸 (𝑜,𝑖) = 𝑓𝑦 (𝑌𝑑𝑜 (𝑇 =1)
)−
229 (𝑜,𝑖)
𝑓𝑦 (𝑌𝑑𝑜 (𝑇 =0)
), e.g. the increase in a particular student’s educational outcome with or without a
230
particular intervention. Following the same informal script as in the previous examples gives an
231 expanded generative program defining a joint distribution over object-level latent confounders
232 𝑈 and observed instance-level covariates 𝑋 , treatment 𝑇 , and outcomes 𝑌 , thereby inducing a
233 distribution on the individual treatment effects for each instance. Note that here we are able to
234 estimate the individual treatment effect because we assumed that exogenous noise is additive. Here,
235
the hierarchical structure is compactly expressed as a nested loop over objects 𝑜 and instances 𝑖.
236
def joint_model(n_o, n_i, doT, theta):
# Generate causal functions from a Gausssian process
237
f_x ~ GP(m_x, k_x)
238 f_t ~ GP(m_t, k_t)
239 f_y ~ GP(m_y, k_y)
240
241
for o in range(n_o):
U[o] ~ normal(0, I) # Generate an object-level latent confounder
242
for i in range(n_i):
243 X[o,i], T[o,i], Y[o,i] ~ instance_causal_model(f_x, f_t, f_y, U[o], theta)
244 ITE[o,i] = f_y(U[o], X[o,i], T[o,i]) - f_y(U[o], X[o,i], doT[o,i])
245
6Eli Bingham∗ , James Koppel∗ , Alexander Lew∗ , Robert Osazuwa Ness∗ , Zenna Tavares∗ , Sam Witty∗ , and Jeremy Zucker∗
246
247 return ITE # return array of all instance ITE values
248
249 3 DISCUSSION
250 This paper has examined a diverse set of examples in which it was natural to express causal
251 assumptions as a generative probabilistic program, and to express a causal query as probabilistic
252 inference in another program derived mechanically from the model. This initial success suggests a
253 research program centered on three basic questions:
254
(1) What program transformations and analyses might be necessary to cover a much larger
255
fraction of the causal inference literature?
256
(2) Can these transformations be formalized with efficient, model-agnostic implementations?
257
(3) Can they be distilled into a core calculus of a small number of composable primitives?
258
259 Answering the first question is challenging given the sheer scale and diversity of causal inference
260 research and practice. If one were to build on the example-driven approach in this paper, how
261 might one expand this collection? One area of particular interest is models of multiple interacting
262 causes [Wang and Blei 2019; Zheng et al. 2021], and generalizations to sequential experimentation
263 [Moodie et al. 2007] and control [Bouneffouf and Rish 2019; Levine 2018]. The GP-SLC case study
264 in §2.4 suggests closer study of models and workflows that are widely used in practice but may not
265 fit neatly into a classical theoretical framework, like hierarchical models [Feller and Gelman 2015]
266 or the synthetic control method and its generalizations [Doudchenko and Imbens 2016].
267 Causal inference theorists have thought extensively about the second question and proposed
268 general methods for large and important problem classes, most notably Pearl’s do-calculus [Barein-
269 boim and Pearl 2016; Pearl 2009] and others [Ding et al. 2018; Jensen et al. 2020; Richardson and
270 Robins 2013] for nonparametric identification. Perhaps surprisingly, however, relatively little effort
271 has been devoted to turning these powerful theoretical ideas into scalable general-purpose software
272 tools for end-to-end causal inference workflows [Wong 2020], a gap PPLs are ideally suited to fill.
273 The third question is more difficult and open-ended, but we close by noting the contours of an
274 answer that are already starting to emerge from the examples surveyed here and in other recent
275 PPL papers and software (e.g. [Bingham et al. 2018; Brulé 2018; Laurent et al. 2018; Perov et al. 2019;
276 Winn 2012]).
277 First, representing causal models as probabilistic programs and defining interventions as program
278 transformations independent of particular choices of model and data representation might point
279 the way to unified strategies for identification, estimation and even causal discovery (discussed
280 in §A.2) and reduce the need for extensive theoretical analysis for special cases (e.g. the "soft" or
281 "path-specific" interventions considered in [Correa and Bareinboim 2020; Malinsky et al. 2019]).
282 Second, building a semantics for counterfactuals as program transformations on top of a gen-
283 eral definition of intervention, as in Omega [Tavares et al. 2020], is known to recover as special
284 cases many existing identification and estimation results as standard probabilistic inference and
285 conditional independence queries (see e.g. Chapter 7 of [Pearl 2009] or [Shpitser and Pearl 2012]).
286 However, the subtle difference in estimands in §2.2 and §2.3 and the examples in [Lattimore and
287 Rohde 2019] suggests a need for greater control over whether randomness is reused or duplicated
288 to distinguish between CATEs and ITEs, and the examples in [Feller and Gelman 2015; Jensen et al.
289 2020] suggest that practitioners would prefer tools that allow incremental buy-in, e.g. the ability to
290 manually define joint distributions over factual and counterfactual variables.
291 Finally, as discussed in Appendix §B the concept of identification itself is perhaps better under-
292 stood as a spectrum of results constraining what causal conclusions may be safely drawn from
293 observational data [Balke and Pearl 1994; Zheng et al. 2021], with nonparametric identification an
294
Causal Probabilistic Programming Without Tears 7
295 ideal case lying at one extreme [Witty et al. 2021]. We conjecture that many techniques on this
296 spectrum may be understood in a unified way as standard probabilistic computations of uncertainty
297 in the expanded programs derived mechanically from a given causal model and intervention.
298
299 REFERENCES
300 Alexander Balke and Judea Pearl. 1994. Counterfactual probabilities: computational methods, bounds and applications. In
301 Uncertainty Proceedings 1994. Elsevier, 46–54. https://doi.org/10.1016/B978-1-55860-332-5.50011-0
302 Elias Bareinboim and Judea Pearl. 2016. Causal inference and the data-fusion problem. Proceedings of the National Academy
303 of Sciences of the United States of America 113, 27 (5 jul 2016), 7345–7352. https://doi.org/10.1073/pnas.1510507113
Eli Bingham, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh,
304
Paul Szerlip, Paul Horsfall, and Noah D. Goodman. 2018. Pyro: Deep Universal Probabilistic Programming. Journal of
305 Machine Learning Research (2018).
306 Djallel Bouneffouf and Irina Rish. 2019. A survey on practical applications of multi-armed and contextual bandits. arXiv
307 preprint arXiv:1904.10040 (2019).
308 Joshua Brulé. 2018. Causal programming: inference with structural causal models as finding instances of a relation. arXiv (4
may 2018). https://arxiv.org/abs/1805.01960
309
J. Correa and E. Bareinboim. 2020. A Calculus For Stochastic Interventions: Causal Effect Identification and Surrogate
310 Experiments. In Proceedings of the 34th AAAI Conference on Artificial Intelligence. AAAI Press, New York, NY.
311 Peng Ding, Fan Li, et al. 2018. Causal inference: A missing data perspective. Statist. Sci. 33, 2 (2018), 214–237.
312 Nikolay Doudchenko and Guido W Imbens. 2016. Balancing, regression, difference-in-differences and synthetic control methods:
313 A synthesis. Technical Report. National Bureau of Economic Research.
Avi Feller and Andrew Gelman. 2015. Hierarchical models for causal effects. Emerging Trends in the Social and Behavioral
314
Sciences: An interdisciplinary, searchable, and linkable resource (2015), 1–16.
315 AlexanderM Franks, Alexander D’Amour, and Avi Feller. 2019. Flexible sensitivity analysis for observational studies without
316 observable implications. J. Amer. Statist. Assoc. (2019).
317 Andrew Gelman and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge
318 University Press.
Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint
319
arXiv:1611.01144 (2016).
320 David Jensen, Javier Burroni, and Matthew Rattigan. 2020. Object conditioning for causal inference. In Uncertainty in
321 Artificial Intelligence. PMLR, 1072–1082.
322 Nathan Kallus, Xiaojie Mao, and Angela Zhou. 2019. Interval estimation of individual-level causal effects under unobserved
323 confounding. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 2281–2290.
Diederik P Kingma, Tim Salimans, and Max Welling. 2015. Variational dropout and the local reparameterization trick. arXiv
324
preprint arXiv:1506.02557 (2015).
325 Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
326 James Koppel and Daniel Jackson. 2020. Demystifying dependence. In Proceedings of the 2020 ACM SIGPLAN International
327 Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. ACM, New York, NY, USA,
328 48–64. https://doi.org/10.1145/3426428.3426916
Finnian Lattimore and David Rohde. 2019. Replacing the do-calculus with Bayes rule. arXiv (17 jun 2019). https:
329
//arxiv.org/abs/1906.07125
330 Jonathan Laurent, Jean Yang, and Walter Fontana. 2018. Counterfactual resimulation for causal analysis of rule-based
331 models |Proceedings of the 27th International Joint Conference on Artificial Intelligence. International Joint Conferences
332 on Artificial Intelligence. https://dl.acm.org/doi/abs/10.5555/3304889.3304920
333 Sergey Levine. 2018. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint
arXiv:1805.00909 (2018).
334
Geoffrey Loftus and Michael Masson. 1994. Using confidence intervals in within-subject designs. Psychonomic Bulletin &
335 Review 1, 4 (1994), 476–490.
336 Christos Louizos, Uri Shalit, Joris Mooij, David Sontag, Richard Zemel, and Max Welling. 2017. Causal effect inference with
337 deep latent-variable models. arXiv preprint arXiv:1705.08821 (2017).
338 Daniel Malinsky, Ilya Shpitser, and Thomas Richardson. 2019. A potential outcomes calculus for identifying conditional
path-specific effects. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 3080–3088.
339
Erica EM Moodie, Thomas S Richardson, and David A Stephens. 2007. Demystifying optimal dynamic treatment regimes.
340 Biometrics 63, 2 (2007), 447–455.
341 Robert Osazuwa Ness, Kaushal Paneri, and Olga Vitek. 2019. Integrating Markov processes with structural causal modeling
342 enables counterfactual inference in complex systems. arXiv preprint arXiv:1911.02175 (2019).
343
8Eli Bingham∗ , James Koppel∗ , Alexander Lew∗ , Robert Osazuwa Ness∗ , Zenna Tavares∗ , Sam Witty∗ , and Jeremy Zucker∗
344 Michael Oberst and David Sontag. 2019. Counterfactual off-policy evaluation with gumbel-max structural causal models. In
345 International Conference on Machine Learning. PMLR, 4881–4890.
346 Nick Pawlowski, Daniel C Castro, and Ben Glocker. 2020. Deep structural causal models for tractable counterfactual
inference. arXiv preprint arXiv:2006.06485 (2020).
347
Judea Pearl. 2001. Bayesianism and Causality, or, Why I am Only a Half-Bayesian. 24 (01 2001). https://doi.org/10.
348 1007/978-94-017-1586-7_2
349 Judea Pearl. 2009. Causality: Models, Reasoning and Inference (2nd ed.). Cambridge University Press, USA.
350 Judea Pearl. 2011. The algorithmization of counterfactuals. Annals of Mathematics and Artificial Intelligence 61, 1 (2011),
351 29–39.
Yura Perov, Logan Graham, Kostis Gourgoulias, Jonathan G. Richens, Ciarán M. Lee, Adam Baker, and Saurabh Johri.
352
2019. MultiVerse: Causal Reasoning using Importance Sampling in Probabilistic Programming. arXiv (17 oct 2019).
353 https://arxiv.org/abs/1910.08091
354 Thomas S Richardson and James M Robins. 2013. Single world intervention graphs (SWIGs): A unification of the counter-
355 factual and graphical approaches to causality. Center for the Statistics and the Social Sciences, University of Washington
356 Series. Working Paper 128, 30 (2013), 2013.
William Shadish, Thomas Cook, Donald Thomas Campbell, et al. 2002. Experimental and quasi-experimental designs for
357
generalized causal inference/William R. Shedish, Thomas D. Cook, Donald T. Campbell. Boston: Houghton Mifflin,.
358 Uri Shalit, Fredrik D. Johansson, and David Sontag. 2017. Estimating individual treatment effect: generalization bounds and
359 algorithms. arXiv:stat.ML/1606.03976
360 Ilya Shpitser and Judea Pearl. 2012. What Counterfactuals Can be Tested. Arxiv (2012). https://arxiv.org/abs/1206.
361 5294
Zenna Tavares, James Koppel, Xin Zhang, and Armando Solar-Lezama. 2020. A Language for Counterfactual Generative
362
Models. Technical Report. http://www.jameskoppel.com/publication/omega/
363 Yixin Wang and David M Blei. 2019. The blessings of multiple causes. J. Amer. Statist. Assoc. 114, 528 (2019), 1574–1596.
364 John Winn. 2012. Causality with gates. In Artificial Intelligence and Statistics. PMLR, 1314–1322.
365 Sam Witty, David Jensen, and Vikash Mansinghka. 2021. A Simulation-Based Test of Identifiability for Bayesian Causal
366 Inference. arXiv (23 feb 2021). https://arxiv.org/abs/2102.11761
Sam Witty, Alexander Lew, David Jensen, and Vikash Mansinghka. 2020a. Bayesian Causal Inference via Probabilistic
367
Program Synthesis. In Proceedings of the Second Conference on Probabilistic Programming.
368 Sam Witty, Kenta Takatsu, David Jensen, and Vikash Mansinghka. 2020b. Causal inference using Gaussian processes with
369 structured latent confounders. In International Conference on Machine Learning. PMLR, 10313–10323.
370 Jeffrey C Wong. 2020. Computational causal inference. arXiv preprint arXiv:2007.10979 (2020).
371 Jiajing Zheng, Alexander D’Amour, and Alexander Franks. 2021. Copula-based Sensitivity Analysis for Multi-Treatment
Causal Inference with Unobserved Confounding. arXiv preprint arXiv:2102.09412 (2021).
372
373
374 A ADDITIONAL EXAMPLES
375 A.1 Inductive Biases for Identifiable Twin-World Counterfactuals
376 Consider two types of counterfactuals; one of the form "Given 𝑋 was observed as 𝑥, what would 𝑌
377 have been had 𝑋 been 𝑥 ′" and another of the form "Given 𝑋 was observed as 𝑥 and 𝑌 was observed
378 as 𝑦, what would 𝑌 have been had 𝑋 been 𝑥 ′?". The former type is potentially identifiable without
379 an SCM using observational or interventional data [Pearl 2009; Richardson and Robins 2013]. Effect
380 of treatment on the treated is a useful example of the former that is widely useful. The second type
381 of counterfactual is called a “twin-world counterfactual" because it predicts 𝑌 in a world where
382 we do 𝑋 = 𝑥 ′ conditional on information from a world where 𝑋 = 𝑥 already caused 𝑌 = 𝑦 [Pearl
383 2009]. Structural counterfactuals enable interesting counterfactual quantities such as probability
384 of necessity and sufficiency, and have interesting applications in generative explanations as well
385 as quantifying regret, blame, and responsibility in decision theory and agent modeling. However,
386 inferring twin-world counterfactual counterfactuals require an explicit SCM. Further, if the SCM is
387 misspecified, it can produce incorrect counterfactual inferences even if it is a perfect statistical fit
388 for observational and interventional data. How do we do select the “right" SCM?
389
390 A.1.1 Reparameterization Tricks. A tempting approach is to simply convert a directed generative
391 model into an SCM using “reparameterization tricks" [Jang et al. 2016; Kingma et al. 2015], methods
392
Causal Probabilistic Programming Without Tears 9
393 that shunt randomness to exogenous variables to facilitate back-propagation through endogenous
394 variables.
395 The problem is that in general a causal generative model can yield different SCMs depending on
396 how it is reparameterized, and the different SCMs might yield different counterfactual inferences.
397 To illustrate, consider the following causal generative model.
398
def cgmodel ():
399
400
𝑥 1 ∼ Bernoulli (0.5)
401 𝑥 2 ∼ Bernoulli (0.5)
402 𝑦 ∼ Categorical (𝑝𝑥 1,𝑥 2 = 𝑔(𝑥 1, 𝑥 2 ))
403
Suppose we wished to “reparameterize" this into an SCM. To accomplish this, we shunt the
404
randomness in the Bernoulli and categorical distributions to exogenous variables 𝑁𝑋1 , 𝑁𝑋2 , and 𝑁𝑌
405
through deterministic transformations 𝑓𝑋1 , 𝑓𝑋2 , and 𝑓𝑌 .
406
407 def true_dgp ():
408 𝑛𝑥 1 ∼ Bernoulli(0.5)
409 𝑛𝑥 2 ∼ Bernoulli(0.5)
410 𝑛 𝑦 ∼ UniformDiscrete( [0, 1, 2])
411 𝑥 1 = 𝑓𝑋1 (𝑛𝑥 1 )
412 𝑥 2 = 𝑓𝑋2 (𝑛𝑥 2 )
413
𝑦 = 𝑓𝑌 (𝑥 1, 𝑥 2, 𝑛 𝑦 )
414
415 In this case, 𝑓𝑌 ,𝑎 and 𝑓𝑌 ,𝑏 are two different alternatives for 𝑓𝑌 above that would each yield the
416 same observational and interventional distributions:
417 def 𝑓𝑌 ,𝑎 (𝑥 1, 𝑥 2, 𝑛 𝑦 ) :
418
if (𝑥 1 != 𝑥 2 ):
419
if (𝑛 𝑦 == 0):
420
return 𝑥 1
421
422
else :
423
return 𝑥 2
424 else :
425 return 𝑛 𝑦
426
427 def 𝑓𝑌 ,𝑏 (𝑥 1, 𝑥 2, 𝑛 𝑦 ):
428 if (𝑥 1 != 𝑥 2 ):
429 if (𝑛 𝑦 == 0):
430 return 𝑥 1
431 else :
432
return 𝑥 2
433
else :
434
return 2 − 𝑛 𝑦
435
436 However, they would produce different counterfactual inferences. Suppose we conditioned
437 scmmodel on the observation {𝑋 1 = 1, 𝑋 2 = 0, 𝑌 = 0} and we are interested in the counterfactual
438 query “what would 𝑌 have been if 𝑋 1 had been 0?” In this degenerate case, we would infer a point
439 value 𝑁𝑌 = 0. When we re-execute the model after setting both 𝑁𝑌 and 𝑋 1 to 0, 𝑓𝑌 ,𝑎 would yield 0,
440 and 𝑓𝑌 ,𝑏 would yield 2.
441
Eli Bingham∗ , James Koppel∗ , Alexander Lew∗ , Robert Osazuwa Ness∗ , Zenna Tavares∗ , Sam Witty∗ , and Jeremy Zucker∗
10
442 A.1.2 Monotonicity as Inductive Bias. One solution is to limit ourselves to reparameterizations
443 where key counterfactual queries are identifiable from observational and interventional data. [Pearl
444 2009] named this constraint monotonicity and defined it for the binary outcome case. [Oberst
445 and Sontag 2019] extended the definition to categorical variables and showed that the Gumbel-
446 softmax reparameterization trick ([Jang et al. 2016]) produced a monotonic SCM. [Ness et al. 2019]
447 extended the definition to binomial and Poisson outcomes and provided a probabilistic programming
448 implementation.
449 In machine learning, we often talk about the inductive bias of a model, such as how convolutional
450 neural networks with max-pooling favor translation invariance. In contrast to inductive biases
451 implicit in architecture, a strength of the probabilistic programming community is that we favor
452 explicit inductive biases, i.e. constraining inference with domain knowledge built into the model.
453 Monotonicity is an example of an explicit inductive bias. To illustrate, suppose Anne has the
454 flu but still goes to work. Jon is exposed to Anne (𝑋 1 = 1) and and a few days later, Jon got the flu
455 (𝑌 = 1). Jon is may or may not have had exposure to the flu on the bus (𝑋 2 ), which is unknown.
456 Given knowledge that Jon was exposed to Anne and he got the flu, what are the chances he wouldn’t
457 have gotten the flu if Anne had stayed home (𝑃 (𝑌𝑋1 =0 = 0|𝑋 1 = 1, 𝑌 = 1))?
458 Given sufficient data, we could build a good probabilistic model of 𝑃 (𝑋 1, 𝑋 2, 𝑌 ). Theoretically we
459 know that if we were to apply a monotonic reparameterization (specifically with respect to 𝑋 1 and
460 𝑌 ) to an SCM then we could use that model to infer the above counterfactual. How would we know
461 if monotonicity is a valid counterfactual inductive bias in this case?
462 We can answer with a simple thought experiment. Is it conceivable that some strange group
463 of coworkers could have the flu, but then be cured by exposure to Anne? That would be a case of
464 non-monotonicity. In this case that is implausible, and thus monotonicity is a safe assumption.
465 Suppose, however, that 𝑋 1 were email promotion and 𝑌 were sales, and we were interested in if
466 John would have bought a product if he hadn’t seen an email promotion. In our thought experiment
467 we would ask it is plausible for some people who intended to buy a product to be annoyed enough
468 by an email promotion that they then decided not to buy. In that case, monotonicity would not be a
469 safe assumption, since non-monotonicity is plausible.
470
471 A.2 Bayesian Causal Discovery with Observational and Interventional Data
472
Consider the task of learning a causal model from some class of models M, based on observational
473 𝐸𝑖
data 𝑦1,...,𝑛 and experimental data from 𝐸 different experimental settings, 𝑦1,...,𝑛 . Here, 𝑦 may be
474 𝑖
multivariate and models 𝑚 ∈ M may or may not posit additional latent variables 𝑥𝑖𝑚 for each
475
subject. We write 𝐸𝑖 (𝑚) for the causal model obtained by applying an intervention modeling
476
experiment 𝐸𝑖 to the observational causal model 𝑚.
477
In the Bayesian setting, the practitioner needs to place a prior over causal models, 𝑝 (𝑚). The likeli- i
478 hÎ𝐸 Î𝑛 𝑗 ∫
𝐸𝑖 Î𝑛 ∫ 𝐸 (𝑚) 𝐸 𝐸 (𝑚)
479 hood is then 𝑝 (𝑦1...𝑛 , 𝑦1...𝑛𝑖 | 𝑚) = 𝑖=1 𝑚(𝑥𝑖 , 𝑦𝑖 )𝑑𝑥𝑖 ·
𝑚 𝑚
𝑗=1 𝑖=1 𝐸 𝑗 (𝑚) (𝑥𝑖 𝑗 , 𝑦𝑖 𝑗 )𝑑𝑥𝑖 𝑗 .
480 Witty et al. [2020a] show that both the prior over causal models and the likelihoods can be repre-
481 sented in a suitably expressive probabilistic programming language.
482
483 A.2.1 Embedded causal language for models 𝑚 ∈ M. To represent the prior over causal mod-
484 els, Witty et al. [2020a] introduce a restricted causal probabilistic programming language MiniStan;
485 the prior over causal models is then an ordinary (Gen) probabilistic program prior that generates
486 MiniStan syntax trees. They further develop a Gen probabilistic program interpret that inter-
487 prets the syntax of a MiniStan program, sampling the variables it defines, as well as a function
488 intervene that applies a program transformation to the syntax of a MiniStan program 𝑚 to yield
489 an experimental model program 𝐸𝑖 (𝑚).
490
Causal Probabilistic Programming Without Tears 11
491 A.2.2 Causal discovery as Bayesian inference. Having defined these helper functions, Witty et al.
492 [2020a] frame the entire causal discovery problem as inference in the following program:
493
494 def causal_discovery():
495 # Generate a possible true model from the prior
496 m ~ prior()
497
498 # Generate observational data
499 for i in range(n):
500 y[i] ~ interpret(m)
501
502 # For each experiment, generate experimental data
503 for j in range(E):
504 m_intervened = intervene(m, interventions[j])
505 for i in range(n_experimental[j]):
506 y_experimental[j][i] ~ interpret(m_intervened)
507
508 Using Gen’s programmable inference, Witty et al. [2020a] develop a sequential Monte Carlo
509 algorithm that incorporates one observation from each experiment at each time step, inferring
510 any latent variables posited by the model 𝑚 or its intervened versions. Other inference algorithms
511 could also be applied. The result is a posterior over models.
512
513 B ADDITIONAL MOTIVATION
514
Observation 1: Generative source code can naturally be interpreted as defining a causal
515
model.
516
517 Probabilistic programmers typically think of their code as defining a probability distribution over
518 a set of variables, but programs contain more information than just the joint distributions they
519 induce. In this way, programs are similar to Bayesian networks, which encode not just a joint
520 distribution but also a generative process that we can imagine unfolding in time. The language we
521 use to describe Bayesian networks – parents and children, ancestors and descendants – reflects this
522 understanding: some variables are generated before other variables, and intuitively, have a causal
523 effect on their immediate children.
524 Formally, a causal model specifies a family of probability distributions, indexed by a set of
525 interventions. An intervention represents a hypothetical experimental condition, under which we’d
526 expect the joint distribution over the variables of interest to change. For example, in a model
527 over the variables smokes and cancer, the joint distribution would change under the experimental
528 condition that randomly assigns each participant to either smoke or not smoke. (For one thing, the
529 marginal probability of smokes would be changed to 50%.)
530 In probabilistic programs, we can understand interventions as program transformations. For
531 example, in the smoking/cancer model, the experiment we considered above might be encoded
532 as a program transformation that replaces assignments to the smokes variable in the true causal
533 program with the line smokes ∼ bernoulli(0.5).
534 A probabilistic program specifies a causal model in that it (1) specifies a “default” or “observational”
535 joint distribution over the variables of interest (according to the usual semantics of probabilistic
536 programming languages), and (2) encodes the necessary information to determine the new joint
537 distribution under an arbitrary intervention (program transformation)—apply the transformation
538 and derive the new joint distribution.
539
Eli Bingham∗ , James Koppel∗ , Alexander Lew∗ , Robert Osazuwa Ness∗ , Zenna Tavares∗ , Sam Witty∗ , and Jeremy Zucker∗
12
540 Observation 2: Causal discovery, parameter estimation, causal effect estimation, and
541 counterfactual prediction can be framed as Bayesian inference in appropriately
542 specified generative models. Automated tooling could be developed to synthesize such
543 models as probabilistic programs, given causal models (also specified as probabilistic
544 programs) as input. Then, existing PPL inference engines can automate aspects of
545 inference.
546
Once we have a causal model, what can we use it for? We briefly describe several problem types
547
that practitioners of causal inference may be interested in solving (but do not claim that this is an
548
exhaustive list):
549
• Causal discovery. Given data (either observational, or collected under experimental condi-
550
tions, or both), infer the underlying causal model, from a class of possible models.
551
• Parameter estimation. Given data (either observational, or collected under experimental
552
conditions, or both), and a causal model with unknown parameters 𝜃 , infer plausible values
553
of 𝜃 .
554
• Causal effect estimation. Given data (either observational, or collected under experimental
555
conditions, or both), and a causal model (possibly with unknown structure or parameters),
556
estimate a causal effect, e.g. the Average Treatment Effect or the Individual Average Treatment
557
Effect. Such queries are designed to answer questions like, “On average, how much better
558
would a patient fare if they were given one medication vs. another?”
559
• Counterfactual prediction. Given observed data, and a causal model (possibly with un-
560
known structure or parameters), estimate a counterfactual query, designed to answer questions
561
like, “Given what we know about this patient (including their observed health outcome), how
562
would their outcome have differed had we treated them differently?”
563
564
All of these questions can be posed in a Bayesian framework. The quantities over which we have
565
uncertainty are:
566 • the structure of the true causal model,
567 • the parameters of the true causal model, and
568 • the values of any latent variables posited by the true causal model, for each subject in
569 our dataset. (In the presence of experimental data, we are also uncertain about the latent
570 variables posited by the intervened version of the true causal model, for each subject in the
571 experimental dataset.)
572 We can express priors over these quantities, and likelihoods that relate them to the observations.
573 For example, suppose we are uncertain about the true model structure 𝑚, and its unknown parame-
574 ters 𝜃 , as well as the values of latent variables 𝑥, but we have observed 𝑦 for a number of subjects,
575 indexed 𝑗 = 1, . . . , 𝑁 . Then the likelihood for 𝑦 𝑗 is 𝑝 (𝑦 𝑗 | 𝑚, 𝜃, 𝑥 𝑗 ) = 𝑚𝜃 (𝑦 𝑗 | 𝑥 𝑗 ). If we also have
576 observations 𝑦 ′ from an experimental setting modeled by intervention 𝑖, then the likelihood is
577 𝑝 (𝑦 ′𝑗 | 𝑚, 𝜃, 𝑥 𝑗 ) = intervene(𝑚𝜃 , 𝑖) (𝑦 ′𝑗 | 𝑥 𝑗 ).
578 Having expressed a prior and a likelihood, posterior inference can recover causal structures 𝑚 and
579 parameters 𝜃 . Causal effects and counterfactuals can be estimated by introducing additional variables
580 representing hypothetical potential outcomes. Such constructions might usefully be automated by
581 probabilistic programming languages, at which point existing PPL inference machinery could be
582 applied to estimating the posterior.
583
584 Observation 3: Bayesian causal inference places identifiability on a principled
585 continuum of irreducible causal uncertainty
586 On the surface, to claim that causal reasoning can be encapsulated by probabilistic computation
587 appears to be in direct conflict with Pearl’s insistence that causal and statistical concepts be kept
588
Causal Probabilistic Programming Without Tears 13
589 separate [Pearl 2001]. As Pearl describes them, statistical concepts are those that summarize the
590 distribution over observed variables. The probabilistic computations that we discuss in this extended
591 abstract are different in-kind from these assumption-free summaries of data, in that we aim compute
592 to probabilities of latent causal structure, effects, and counterfactuals. In our proposed approach,
593 causal probabilistic programs play the role of causal assumptions, relating observations to the
594 latent causal quantities of interest.
595 Casting causal inference as a particular instantiation of probabilistic inference does not change
596 the reality that many causal conclusions cannot be unambiguously identified from data, regardless of
597 sample size. How much of the mutual information between treatment and outcome is attributable to
598 latent confounding? Does A cause B or does B cause A? If C were c, what would have happened to D?
599 Answers to all of these questions are often ambiguous. Surprisingly, most existing formulations of
600 causal inference avoid quantifying these uncertainties, instead abandoning problems in which latent
601 causal quantities cannot be uniquely inferred from data.1 Instead, the probabilistic programming
602 approach we espouse here enables users to express their assumptions, compute the resulting
603 uncertainty, be it irreducible or not, and then make decisions accordingly.
604
605 C ADDITIONAL BACKGROUND
606 C.1 Structural causal models
607
Underlying Pearl’s causal hierarchy is a mathematical object known as a structural causal model
608
(SCM).
609
We refer the reader to Chapter 7 of Causality [Pearl 2009] for complete mathematical details,
610
including the notation we introduce in this section. We show how SCM’s can be represented in
611
deterministic and probabilistic programming languages.
612
𝑀 =< U, V, F > denotes a fully-specified deterministic structural causal model (SCM). U =
613
{𝑈 1, 𝑈 2, . . . , 𝑈𝑚 } represents a set of exogenous (unobserved) variables that are determined by factors
614
outside the model. V = {𝑉1, 𝑉2, . . . , 𝑉𝑛 } denotes a set of endogenous variables that are determined
615
by other variables U ∪ V in the model. Associated with each endogenous variable 𝑉𝑖 is a function
616
𝑓𝑖 : U𝑖 ∪ Pa𝑖 → 𝑉𝑖 that assigns a value 𝑣𝑖 ← 𝑓𝑖 (𝑝𝑎𝑖 , 𝑢𝑖 ) to 𝑉𝑖 that depends on the values of
617
the deterministic parents Pa𝑖 ⊂ V\𝑉𝑖 and a set of exogenous variables U𝑖 ⊂ U. Deterministic
618
programming languages are capable of representing deterministic SCMs.
619
The entire set of functions F = {𝑓1, 𝑓2, . . . , 𝑓𝑛 } forms a mapping from U to V. That is, the values
620
of the exogenous variables uniquely determine the values of the endogenous variables.
621
Every SCM 𝑀 can be associated with a directed graph, 𝐺 (𝑀), in which each node corresponds
622
to a variable and the directed edges point from members of the parents Pa𝑖 and U𝑖 toward 𝑉𝑖 . Static
623
analysis can be used to derive the dependency graph of deterministic programs[Koppel and Jackson
624
2020].
625
Let 𝑋 be a variable in V, and 𝑥 a particular value of 𝑋 . We define the effect of an intervention
626
𝑑𝑜 (𝑋 = 𝑥) on an SCM 𝑀 as a submodel 𝑀𝑑𝑜 (𝑋 =𝑥) = U, V, F𝑑𝑜 (𝑋 ) where F𝑑𝑜 (𝑋 ) is formed by
627
retracting from F the function 𝑓𝑋 corresponding to 𝑋 and assigning 𝑋 a constant value 𝑋 = 𝑥.
628
Intervention in programming languages can be represented as a program transformation, as
629
implemented in Pyro [Bingham et al. 2018].
630
Let 𝑋 and 𝑌 be two variables in V. The potential outcome of 𝑌 to action 𝑑𝑜 (𝑋 = 𝑥), denoted
631
𝑌𝑑𝑜 (𝑋 =𝑥) is the solution for 𝑌 from the set of equations F𝑑𝑜 (𝑋 =𝑥) . That is, 𝑌𝑑𝑜 (𝑋 =𝑥) = 𝑌𝑀𝑑𝑜 (𝑋 =𝑥 ) .
632
Higher order interventions, such as 𝑑𝑜 (𝑥 = 𝑔(𝑧)) can be represented by the replacement of
633
equations by functions instead of constants, as implemented in Omega [Tavares et al. 2020].
634
635 1 Anexception can be found in the small but growing literature on sensitivity analysis, which aims to place bounds on
636 nonidentified causal effects [Franks et al. 2019; Kallus et al. 2019]
637
Eli Bingham∗ , James Koppel∗ , Alexander Lew∗ , Robert Osazuwa Ness∗ , Zenna Tavares∗ , Sam Witty∗ , and Jeremy Zucker∗
14
638 One can compute a counterfactual using a graphical approach known as the twin network
639 method [Balke and Pearl 1994]. It uses two graphs, one to represent the factual world, and one to
640 represent the counterfactual world. Bayesian implementations of twin world networks are described
641 in [Lattimore and Rohde 2019].
642 A fully-specified probabilistic structural causal model is a pair ⟨𝑀, 𝑃 (U)⟩, where 𝑀 is a fully-
643 specified deterministic structural causal model and 𝑃 (U) is a probability function defined over
644 the domain of U. Probabilistic structural causal models can be represented using probabilistic
645 programming languages,
646 Given a probabilistic SCM ⟨𝑀, 𝑃 (U)⟩ the conditional probability of a counterfactual sentence
647 can be evaluated using the following three steps:
648 Abduction: Update 𝑃 (U) by the evidence 𝐸 = 𝑒 to obtain 𝑃 (U|𝐸 = 𝑒).
649 Action: Modify 𝑀 by the intervention 𝑑𝑜 (𝑋 = 𝑥) to obtain the submodel 𝑀𝑑𝑜 (𝑋 =𝑥) .
650 Prediction: Use the modified model ⟨𝑀𝑋 , 𝑃 (U|𝐸 = 𝑒)⟩ to compute the probability of 𝑌𝑑𝑜 (𝑋 =𝑥) ,
651 the potential outcome of the counterfactual.
652
653 C.2 Classical causal inference in a PPL
654 The do-calculus consists of 3 rules, and the second one applies in this situation. It says that if one
655 can stratify the observational distribution by all these confounding factors, then what remains is
656 the true causal effect. That is,
657
658 Õ 𝑖
Õ
659
𝑃 (𝑌 = 𝑦|𝑇 = 𝑡, 𝑋 = 𝑥)𝑃 (𝑋 = 𝑥) = 𝑃 (𝑌 = 𝑦|𝑑𝑜 (𝑇 = 𝑡), 𝑋 = 𝑥)𝑃 (𝑋 = 𝑥)
𝑋 =𝑥 𝑋 =𝑥
660
= 𝑃 (𝑌 = 𝑦|𝑑𝑜 (𝑇 = 𝑡))
661
662 The replacement of 𝑇 = 𝑡 in the first expression with 𝑑𝑜 (𝑇 = 𝑡) in the second expression is licensed
663 by Rule 2 of the do-calculus.
664
C.2.1 Specifying and estimating the do calculus problem. For simplicity, let’s just consider only one
665
confounder, say political affiliation. Let theta be an array of six learnable parameters. A program
666
that represents this situation could be as follows:
667
668
def causal_model(theta):
669
X ~ bernoulli(theta[0])
670
T ~ bernoulli(theta[X+1])
671
Y ~ bernoulli(theta[T+2*X+3])
672
return Y
673 The classical causal inference approach would be to extract the causal diagram from the model,
674 identify the causal effect, and then estimate it from data:
675 >>> causal_graph = extract_dependencies( causal_model )
676 >>> causal_graph.draw()
677
678
679
680
681
682
683
684 >>> estimand = identify(P(Y|do(T)), causal_graph)
685 >>> estimand
686
Causal Probabilistic Programming Without Tears 15
Õ
687 𝑃 (𝑌 |𝑑𝑜 (𝑇 )) = 𝑃 (𝑌 |𝑇 , 𝑋 )𝑃 (𝑋 )
688 𝑋
689 >>> P_of_covid_given_do_vaccine = estimate(causal_model, estimand, data)
690 >>> P_of_covid_given_do_vaccine
691 {'covid-positive':0.00534, 'covid-negative': 0.99466}
692 The extract_dependencies function takes a model as input. It then performs a static analysis of
693 the dependency structure to generate a causal diagram. The identify() function takes as input
694 a causal diagram and a causal query, represented as a symbolic probabilistic expression. It then
695 applies the do calculus to the diagram to identify the query. If the causal query is identified, it
696 will return an estimand, represented as a symbolic probabilistic expression composed of nested
697 conditionals and marginals. If the query is not identified, it will raise an exception. The estimate()
698 procedure takes a causal model, an estimand and a dataframe containing measurements of the
699 observed variables as input. It then applies the estimand to the dataset to generate an estimate of
700 the original causal query.
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735