Introduction_to_Causal_Inference-Aug25_2020-Neal
Introduction_to_Causal_Inference-Aug25_2020-Neal
Brady Neal
Prerequisites There is one main prerequisite: basic probability. This course assumes
you’ve taken an introduction to probability course or have had equivalent experience.
Topics from statistics and machine learning will pop up in the course from time to
time, so some familiarity with those will be helpful but is not necessary. For example, if
cross-validation is a new concept to you, you can learn it relatively quickly at the point in
the book that it pops up. And we give a primer on some statistics terminology that we’ll
use in Section 2.4.
Active Reading Exercises Research shows that one of the best techniques to remember
material is to actively try to recall information that you recently learned. You will see
“active reading exercises” throughout the book to help you do this. They’ll be marked by
the Active reading exercise: heading.
Many Figures in This Book As you will see, there are a ridiculous amount of figures in
this book. This is on purpose. This is to help give you as much visual intuition as possible.
We will sometimes copy the same figures, equations, etc. that you might have seen in
preceding chapters so that we can make sure the figures are always right next to the text
that references them.
Sending Me Feedback This is a book draft, so I greatly appreciate any feedback you’re
willing to send my way. If you’re unsure whether I’ll be receptive to it or not, don’t be.
Please send any feedback to me at [email protected] with “[Causal Book]” in the
beginning of your email subject. Feedback can be at the word level, sentence level, section
level, chapter level, etc. Here’s a non-exhaustive list of useful kinds of feedback:
I Typoz.
I Some part is confusing.
I You notice your mind starts to wander, or you don’t feel motivated to read some
part.
I Some part seems like it can be cut.
I You feel strongly that some part absolutely should not be cut.
I Some parts are not connected well. Moving from one part to the next, you notice
that there isn’t a natural flow.
I A new active reading exercise you thought of.
Bibliographic Notes Although we do our best to cite relevant results, we don’t want to
disrupt the flow of the material by digging into exactly where each concept came from.
There will be complete sections of bibliographic notes in the final version of this book,
but they won’t come until after the course has finished.
Contents
Preface ii
Contents iii
2 Potential Outcomes 6
2.1 Potential Outcomes and Individual Treatment Effects . . . . . . . . . . . . 6
2.2 The Fundamental Problem of Causal Inference . . . . . . . . . . . . . . . . 7
2.3 Getting Around the Fundamental Problem . . . . . . . . . . . . . . . . . . 8
2.3.1 Average Treatment Effects and Missing Data Interpretation . . . . 8
2.3.2 Ignorability and Exchangeability . . . . . . . . . . . . . . . . . . . 9
2.3.3 Conditional Exchangeability and Unconfoundedness . . . . . . . . 10
2.3.4 Positivity/Overlap and Extrapolation . . . . . . . . . . . . . . . . . 12
2.3.5 No interference, Consistency, and SUTVA . . . . . . . . . . . . . . 13
2.3.6 Tying It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Fancy Statistics Terminology Defancified . . . . . . . . . . . . . . . . . . . 15
2.5 A Complete Example with Estimation . . . . . . . . . . . . . . . . . . . . . 16
4 Causal Models 31
4.1 The do-operator and Interventional Distributions . . . . . . . . . . . . . . 31
4.2 The Main Assumption: Modularity . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Truncated Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.1 Example Application and Revisiting “Association is Not Causation” 35
4.4 The Backdoor Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4.1 Relation to Potential Outcomes . . . . . . . . . . . . . . . . . . . . . 38
4.5 Structural Causal Models (SCMs) . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.1 Structural Equations . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.2 Interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5.3 Collider Bias and Why to Not Condition on Descendants of Treatment 42
4.6 Example Applications of the Backdoor Adjustment . . . . . . . . . . . . . 43
4.6.1 Association vs. Causation in a Toy Example . . . . . . . . . . . . . 43
4.6.2 A Complete Example with Estimation . . . . . . . . . . . . . . . . 44
4.7 Assumptions Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 Randomized Experiments 47
5.1 Comparability and Covariate Balance . . . . . . . . . . . . . . . . . . . . . 47
5.2 Exchangeability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 No Backdoor Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6 General Identification 50
6.1 Coming Soon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7 Estimation 51
7.1 Coming Soon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
8 Counterfactuals 52
8.1 Coming Soon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Bibliography 53
Alphabetical Index 55
List of Figures
List of Tables
Listings
4.1 Python code for estimating the ATE, without adjusting for the collider . . 45
Motivation: Why You Might Care 1
1.1 Simpson’s Paradox . . . . . 1
1.1 Simpson’s Paradox
1.2 Applications of Causal Infer-
ence . . . . . . . . . . . . . . 2
Consider a purely hypothetical future where there is a new disease known
1.3 Correlation Does Not Imply
as COVID-27 that is prevalent in the human population. In this purely
Causation . . . . . . . . . . 3
hypothetical future, there are two treatments that have been developed:
Nicolas Cage and Pool
treatment A and treatment B. Treatment B is more scarce than treatment Drownings . . . . . . . . . . 3
A, so the split of those currently receiving treatment A vs. treatment Why is Association Not Cau-
B is roughly 73%/27%. You are in charge of choosing which treatment sation? . . . . . . . . . . . . 4
your country will exclusively use, in a country that only cares about 1.4 Main Themes . . . . . . . . . 5
minimizing loss of life.
You have data on the percentage of people who die from COVID-27,
given the treatment they were assigned and given their condition at the
time treatment was decided. Their condition is a binary variable: either
mild or severe. In this data, 16% of those who receive A die, whereas
19% of those who receive B die. However, when we examine the people
with mild condition separately from the people with severe condition,
the numbers reverse order. In the mild subpopulation, 15% of those who
receive A die, whereas 10% of those who receive B die. In the severe
subpopulation, 30% of those who receive A die, whereas 20% of those
who receive B die. We depict these percentages and the corresponding
counts in Table 1.1.
A
en
19%
B better in all subpopulations.
(5/50) (100/500) (105/550)
The apparent paradox stems from the fact that, in Table 1.1, the “Total”
column could be interpreted to mean that we should prefer treatment
A, whereas the “Mild” and “Severe” columns could both be interpreted
to mean that we should prefer treatment B.1 In fact, the answer is that if 1A key ingredient necessary to find Simp-
we know someone’s condition, we should give them treatment B, and if son’s paradox is the non-uniformity of
allocation of people to the groups. 1400
we do not know their condition, we should give them treatment A. Just
of the 1500 people who received treatment
kidding... that doesn’t make any sense. So really, what treatment should A had mild condition, whereas 500 of
you choose for your country? the 550 people who received treatment
B had severe condition. Because people
Either treatment A or treatment B could be the right answer, depending with mild condition are less likely to die,
on the causal structure of the data. In other words, causality is essential to this means that the total mortality rate
for those with treatment A is lower than
solve Simpson’s paradox. For now, we will just give the intuition for when
what it would have been if mild and severe
you should prefer treatment A vs. when you should prefer treatment B, conditions were equally split among them.
but it will be made more formal in Chapter 4. The opposite bias is true for treatment B.
1 Motivation: Why You Might Care 2
Scenario 2 If the prescription2 of treatment 𝑇 is a cause of the condition 2 𝑇 refers to the prescription of the treat-
𝐶 (Figure 1.2), treatment A is more effective. An example scenario is ment, rather than the subsequent recep-
tion of the treatment.
where treatment B is so scarce that it requires patients to wait a long
time after they were prescribed the treatment before they can receive
the treatment. Treatment A does not have this problem. Because the
condition of a patient with COVID-27 worsens over time, the prescription
of treatment B actually causes patients with mild conditions to develop
severe conditions, causing a higher mortality rate. Therefore, even if
𝑇 𝐶
treatment B is more effective than treatment A once administered (positive
effect along 𝑇 → 𝑌 in Figure 1.2), because prescription of treatment B
causes worse conditions (negative effect along 𝑇 → 𝐶 → 𝑌 in Figure
1.2), treatment B is less effective in total. Note: Because treatment B is 𝑌
more expensive, treatment B is prescribed with 0.27 probability, while Figure 1.2: Causal structure of scenario 2,
treatment A is prescribed with 0.73 probability; importantly, treatment where treatment 𝑇 is a cause of condition
𝐶 . Given this causal structure, treatment
prescription is independent of condition in this scenario.
A is preferable.
In sum, the more effective treatment is completely dependent on the
causal structure of the problem. In Scenario 1, where 𝐶 was a cause of
𝑇 (Figure 1.1), treatment B was more effective. In Scenario 2, where 𝑇
was a cause of 𝐶 (Figure 1.2), treatment A was more effective. Without
causality, Simpson’s paradox cannot be resolved. With causality, it is not
a paradox at all.
Many of you will have heard the mantra “correlation does not imply
causation.” In this section, we will quickly review that and provide you
with a bit more intuition about why this is the case.
It turns out that the yearly number of people who drown by falling into
swimming pools has a high degree of correlation with the yearly number
of films that Nicolas Cage appears in [1]. See Figure 1.3 for a graph of this [1]: Vigen (2015), Spurious correlations
data. Does this mean that Nicolas Cage encourages bad swimmers to
hop in the pool in his films? Or does Nicolas Cage feel more motivated to
act in more films when he sees how many drownings are happening that
year, perhaps to try to prevent more drownings? Or is there some other
explanation? For example, maybe Nicolas Cage is interested in increasing
his popularity among causal inference practitioners, so he travels back in
time to convince his past self to do just the right number of movies for us
to see this correlation, but not too close of a match as that would arouse
suspicion and potentiallyNumber of people
cause someone to who drowned
prevent him fromby falling into a pool
rigging
correlates with
the data this way. We may never know for sure.
Films Nicolas Cage appeared in
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
140 drownings 6 films
Swimming pool drownings
80 drownings 0 films
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Figure 1.3: The yearly number of movies Nicolas Cage appears in correlates with the yearly number of pool drownings [1].
illustrative example that will help clarify how spurious correlations can
arise.
Before moving to the next example, let’s be a bit more precise about
terminology. “Correlation” is often colloquially used as a synonym
for statistical dependence. However, “correlation” is technically only a
measure of linear statistical dependence. We will largely be using the
term association to refer to statistical dependence from now on.
Causation is not binary. For any given amount of association, it does not
need to be “all the association is causation” or “no causation.” It is possible
to have some causation while having a large amount of association. The
phrase “association is not causation” simply means that the amount of
association and the amount of causation can be different. Some amount
of association and zero causation is a special case of “association is not
causation.”
Say you happen upon some data that relates wearing shoes to bed and
waking up with a headache, as one does. It turns out that most times
that someone wears shoes to bed, that person wakes up with a headache.
And most times someone doesn’t wear shoes to bed, that person doesn’t
wake up with a headache. It is not uncommon for people to interpret
data like this (with associations) as meaning that wearing shoes to bed
causes people to wake up with headaches, especially if they are looking
for a reason to justify not wearing shoes to bed. A careful journalist might
make claims like “wearing shoes to bed is associated with headaches”
or “people who wear shoes to bed are at higher risk of waking up with
headaches.” However, the main reason to make claims like that is that
most people will internalize claims like that as “if I wear shoes to bed,
I’ll probably wake up with a headache.”
We can explain how wearing shoes to bed and headaches are associated
without either being a cause of the other. It turns out that they are
both caused by a common cause: drinking the night before. We depict
this in Figure 1.4. You might also hear this kind of variable referred
to as a “confounder” or a “lurking variable.” We will call this kind of
association confounding association since the association is facilitated by a
confounder.
The total association observed can be made up of both confounding
association and causal association. It could be the case that wearing shoes
to bed does have some small causal effect on waking up with a headache. Figure 1.4: Causal structure, where drink-
Then, the total association would not be solely confounding association ing the night before is a common cause of
nor solely causal association. It would be a mixture of both. For example, sleeping with shoes on and of waking up
with a headaches.
in Figure 1.4, causal association flows along the arrow from shoe-sleeping
to waking up with a headache. And confounding association flows along
the path from shoe-sleeping to drinking to headachening (waking up
with a headache). We will make the graphical interpretation of these
different kinds of association clear in Chapter 3.
1 Motivation: Why You Might Care 5
There are several overarching themes that will keep coming up through-
out this book. These themes will largely be comparisons of two different
categories. As you are reading, it is important that you understand which
categories different sections of the book fit into and which categories
they do not fit into.
Statistical vs. Causal Even with an infinite amount of data, we some-
times cannot compute some causal quantities. In contrast, much of
statistics is about addressing uncertainty in finite samples. When given
infinite data, there is no uncertainty. However, association, a statistical
concept, is not causation. There is more work to be done in causal infer-
ence, even after starting with infinite data. This is the main distinction
motivating causal inference. We have already made this distinction in
this chapter and will continue to make this distinction throughout the
book.
Identification vs. Estimation Identification of causal effects is unique
to causal inference. It is the problem that remains to solve, even when we
have infinite data. However, causal inference also shares estimation with
traditional statistics and machine learning. We will largely begin with
identification of causal effects (in Chapters 2, 4 and 6) before moving to
estimation of causal effects (in Chapter 7). The exceptions are Section 2.5
and Section 4.6.2, where we carry out complete examples with estimation
to give you an idea of what the whole process looks like early on.
Interventional vs. Observational If we can intervene/experiment,
identification of causal effects is relatively easy. This is simply because
we can actually take the action that we want to measure the causal effect
of and simply measure the effect after we take that action. Observational
data is where it gets more complicated because confounding is almost
always introduced into the data.
Assumptions There will be a large focus on what assumptions we are
using to get the results that we get. Each assumption will have its own
box to help make it difficult to not notice. Clear assumptions should make
it easy to see where critiques of a given causal analysis or causal model
will be. The hope is that presenting assumptions clearly will lead to more
lucid discussions about causality.
Potential Outcomes 2
In this chapter, we will ease into the world of causality. We will see that 2.1 Potential Outcomes and Indi-
new concepts and corresponding notations need to be introduced to vidual Treatment Effects . 6
clearly describe causal concepts. These concepts are “new” in the sense 2.2 The Fundamental Problem
that they may not exist in traditional statistics or math, but they should of Causal Inference . . . . 7
be familiar in that we use them in our thinking and describe them with 2.3 Getting Around the Funda-
natural language all the time. mental Problem . . . . . . . 8
Average Treatment Effects
Familiar statistical notation We will use 𝑇 to denote the random vari-
and Missing Data Interpre-
able for treatment, 𝑌 to denote the random variable for the outcome of
tation . . . . . . . . . . . . . 8
interest and 𝑋 to denote covariates. In general, we will use uppercase Ignorability and Exchange-
letters to denote random variables (except in maybe one case) and lower- ability . . . . . . . . . . . . . 9
case letters to denote values that random variables take on. Much of what Conditional Exchangeability
we consider will be settings where 𝑇 is binary. Know that, in general, we and Unconfoundedness . 10
can extend things to work in settings where 𝑇 can take on more than two Positivity/Overlap and Ex-
values or where 𝑇 is continuous. trapolation . . . . . . . . . . 12
No interference, Consis-
tency, and SUTVA . . . . . 13
Tying It All Together . . . . 14
2.1 Potential Outcomes and Individual 2.4 Fancy Statistics Terminology
Treatment Effects Defancified . . . . . . . . . 15
2.5 A Complete Example with
Estimation . . . . . . . . . . 16
We will now introduce the first causal concept to appear in this book.
These concepts are sometimes characterized as being unique to the
Neyman-Rubin [2–4] causal model (or potential outcomes framework), [2]: Splawa-Neyman (1923 [1990]), ‘On the
but they are not. For example, these same concepts are still present Application of Probability Theory to Agri-
cultural Experiments. Essay on Principles.
(just under different notation) in the framework that uses causal graphs Section 9.’
(Chapters 3 and 4). It is important that you spend some time ensuring [3]: Rubin (1974), ‘Estimating causal effects
that you understand these initial causal concepts. If you have not studied of treatments in randomized and nonran-
causal inference before, they will be unfamiliar to see in mathematical domized studies.’
[4]: Sekhon (2008), ‘The Neyman-Rubin
contexts, though they may be quite familiar intuitively because we Model of Causal Inference and Estimation
commonly think and communicate in causal language. via Matching Methods’
Scenario 1 Consider the scenario where you are unhappy. And you are
considering whether or not to get a dog to help make you happy. If you
become happy after you get the dog, does this mean the dog caused you
to be happy? Well, what if you would have also become happy had you
not gotten the dog? In that case, the dog was not necessary to make you
happy, so its claim to a causal effect on your happiness is weak.
Scenario 2 Let’s switch things up a bit. Consider that you will still be
happy if you get a dog, but now, if you don’t get a dog, you will remain
unhappy. In this scenario, the dog has a pretty strong claim to a causal
effect on your happiness.
In both the above scenarios, we have used the causal concept known as
potential outcomes. Your outcome 𝑌 is happiness: 𝑌 = 1 corresponds to
happy while 𝑌 = 0 corresponds to unhappy. Your treatment 𝑇 is whether
or not you get a dog: 𝑇 = 1 corresponds to you getting a dog while 𝑇 = 0
2 Potential Outcomes 7
the individual treatment effect (ITE) 2 for individual 𝑖 : 2 The ITE is also known as the individual
causal effect, unit-level causal effect, or unit-
𝜏𝑖 , 𝑌𝑖 (1) − 𝑌𝑖 (0) (2.1) level treatment effect.
could observe 𝑌(0) by not getting a dog and observing your happiness.
domized studies.’
However, you cannot observe both 𝑌(1) and 𝑌(0), unless you have a time
machine that would allow you to go back in time and choose the version
of treatment that you didn’t take the first time. You cannot simply get
a dog, observe 𝑌(1), give the dog away, and then observe 𝑌(0) because
the second observation will be influenced by all the actions you took
between the two observations and anything else that changed since the
first observation.
2 Potential Outcomes 8
This is known as the fundamental problem of causal inference [5]. It is [5]: Holland (1986), ‘Statistics and Causal
fundamental because if we cannot observe both 𝑌𝑖 (1) and 𝑌𝑖 (0), then we Inference’
cannot observe the causal effect 𝑌𝑖 (1) − 𝑌𝑖 (0). This problem is unique
to causal inference because, in causal inference, we care about making
causal claims, which are defined in terms of potential outcomes. For
contrast, consider machine learning. In machine learning, we often only
care about predicting the observed outcome 𝑌 , so there is no need for
potential outcomes, which means machine learning does not have to
deal with this fundamental problem that we must deal with in causal
inference.
The potential outcomes that you do not (and cannot) observe are known
as counterfactuals because they are counter to fact (reality). “Potential
outcomes” are sometimes referred to as “counterfactual outcomes,” but
we will never do that in this book because a potential outcome 𝑌(𝑡)
does not become counter to fact until another potential outcome 𝑌(𝑡 0) is
observed. The potential outcome that is observed is sometimes referred
to as a factual. Note that there are no counterfactuals or factuals until the
outcome is observed. Before that, there are only potential outcomes.
I suspect this section is where this chapter might start to get a bit unclear.
If that is the case for you, don’t worry too much, and just continue to the
next chapter, as it will build up parallel concepts in a hopefully more
intuitive way.
(missing data) is known as ignorability. Assuming ignorability is like procedure is equivalent to 𝔼[𝑌|𝑇 = 1] −
𝔼[𝑌|𝑇 = 0] in the data in Table 2.1.
ignoring how people ended up selecting the treatment they selected and
just assuming they were randomly assigned their treatment; we depict
this graphically in Figure 2.2 by the lack of a causal arrow from 𝑋 to 𝑇 .
We will now state this assumption formally.
𝑋
Assumption 2.1 (Ignorability / Exchangeability)
(𝑌(1), 𝑌(0)) ⊥
⊥𝑇 𝑇 𝑌
This assumption is key to causal inference because it allows us to reduce Figure 2.2: Causal structure when the
treatment assignment mechanism is ig-
the ATE to the associational difference: norable. Notably, this means there’s no
arrow from 𝑋 to 𝑇 , which means there is
𝔼[𝑌(1)] − 𝔼[𝑌(0)] = 𝔼[𝑌(1) | 𝑇 = 1] − 𝔼[𝑌(0) | 𝑇 = 0] (2.3) no confounding.
(𝑌(1), 𝑌(0)) ⊥
⊥𝑇 | 𝑋
The idea is that although the treatment and potential outcomes may
be unconditionally associated (due to confounding), within levels of 𝑋 ,
they are not associated. In other words, there is no confounding within
levels of 𝑋 because controlling for 𝑋 has made the treatment groups
comparable. We’ll now give a bit of graphical intuition for the above. We
will not draw the rigorous connection between the graphical intuition
and Assumption 2.2 until Chapter 3; for now, it is just meant to aid
intuition.
2 Potential Outcomes 11
This marks an important result for causal inference, so we’ll give it its
own proposition box. The proof we give above leaves out some details.
Read through to Section 2.3.6 (where we redo the proof with all details
specified) to get the rest of the details. We will call this result the adjustment
formula.
To see why positivity is important, let’s take a closer look at Equation 2.9:
Intuition That’s the math for why we need the positivity assumption,
but what’s the intuition? Well, if we have a positivity violation, that
means that within some subgroup of the data, everyone always receives
treatment or everyone always receives the control. It wouldn’t make
sense to be able to estimate a causal effect of treatment vs. control in that
subgroup since we see only treatment or only control. We never see the
alternative in that subgroup.
Another name for positivity is overlap. The intuition for this name is that
9
we want the covariate distribution of the treatment group to overlap Whenever we use a random variable (de-
with the covariate distribution of the control group. More specifically, noted by a capital letter) as the argument
for 𝑃 , we are referring to the whole dis-
we want 𝑃(𝑋 | 𝑇 = 1)9 to have the same support as 𝑃(𝑋 | 𝑇 = 0).10 This tribution, rather than just the scalar that
is why another common alias for positivity is common support. something like 𝑃(𝑥 | 𝑇 = 1) refers to.
The Positivity-Unconfoundedness Tradeoff Although conditioning 10 Active reading exercise: convince your-
on more covariates could lead to a higher chance of satisfying uncon- self that this formulation of overlap/posi-
foundedness, it can lead to a higher chance of violating positivity. As we tivity is equivalent to the formulation in
increase the dimension of the covariates, we make the subgroups for any Assumption 2.3.
level 𝑥 of the covariates smaller.11 As each subgroup gets smaller, there
11 This is related to the curse of dimensional-
is a higher and higher chance that either the whole subgroup will have
treatment or the whole subgroup will have control. For example, once ity.
𝑇 = 𝑡 =⇒ 𝑌 = 𝑌(𝑡) (2.12)
𝑌 = 𝑌(𝑇) (2.13)
𝔼[𝑌(1) − 𝑌(0)] is the causal estimand that we are interested in. In order
to actually estimate this causal estimand, we must translate it into a
statistical estimand: 𝔼𝑋 [𝔼[𝑌 | 𝑇 = 1 , 𝑋] − 𝔼[𝑌 | 𝑇 = 0 , 𝑋]].14 14 Active reading exercise: Why can’t we di-
rectly estimate a causal estimand without
When we say “identification” in this book, we are referring to the process first translating it to a statistical estimand?
of moving from a causal estimand to an equivalent statistical estimand.
When we say “estimation,” we are referring to the process of moving from
a statistical estimand to an estimate. We illustrate this in the flowchart in
Figure 2.5.
Identification Estimation
Causal Estimand Statistical Estimand Estimate
Figure 2.5: The Identification-Estimation Flowchart – a flowchart that illustrates the process of moving from a target causal estimand to a
corresponding estimate, through identification and estimation.
Theorem 2.1 and the corresponding recent copy in Equation 2.14 give
us identification. However, we haven’t discussed estimation at all. In
this section, we will give a short example complete with estimation. We
will cover the topic of estimation of causal effects more completely in
Chapter 7.
We use Luque-Fernandez et al. [8]’s example from epidemiology. The [8]: Luque-Fernandez et al. (2018), ‘Edu-
outcome 𝑌 of interest is (systolic) blood pressure. This is an important cational Note: Paradoxical collider effect
in the analysis of non-communicable dis-
outcome because roughly 46% of Americans have high blood pressure, ease epidemiological data: a reproducible
and high blood pressure is associated with increased risk of mortality illustration and web application’
[9]. The “treatment” 𝑇 of interest is sodium intake. Sodium intake is [9]: Virani et al. (2020), ‘Heart Disease and
a continuous variable; in order to easily apply Equation 2.14, which is Stroke Statistics—2020 Update: A Report
specified for binary treatment, we will binarize 𝑇 by letting 𝑇 = 1 denote From the American Heart Association’
daily sodium intake above 3.5 grams and letting 𝑇 = 0 denote daily
sodium intake below 3.5 grams.15 We will be estimating the causal effect 15 As we will see, this binarization is purely
of sodium intake on blood pressure. In our data, we also have the age pedagogical and does not reflect any limi-
of the individuals and amount of protein in their urine as covariates 𝑋 . tations of adjusting for confounders.
1X
[𝔼[𝑌 | 𝑇 = 1 , 𝑥 𝑖 ] − 𝔼[𝑌 | 𝑇 = 0, 𝑥 𝑖 ]] (2.15)
𝑛 𝑖
All of the above is done using the adjustment formula with model-assisted
estimation, where we first fit a model for the conditional expectation
𝔼[𝑌 | 𝑡, 𝑥], and then we take an empirical mean over 𝑋 , using that model.
However, because we are using a linear model, this is equivalent to just
taking the coefficient in front of 𝑇 in the linear regression as the ATE
estimate. This is what we do in the following code (which gives the exact
same ATE estimate):
y = df['blood_pressure']
model = LinearRegression()
model.fit(Xt, y)
ate_est = model.coef_[0]
print('ATE estimate:', ate_est)
It turns out that much of the work for causal graphical models was done
in the field of probabilistic graphical models. Probabilistic graphical
models are statistical models while causal graphical models are causal
models. Bayesian networks are the main probabilistic graphical model
that causal graphical models (causal Bayesian networks) inherit most of
their properties from.
Imagine that we only cared about modeling association, without any
causal modeling. We would want to model the data distribution 𝑃(𝑥 1 , 𝑥 2 , . . . , 𝑥 𝑛 ).
In general, we can use the chain rule of probability to factorize any distri-
bution:
Y
𝑃(𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 ) = 𝑃(𝑥1 ) 𝑃(𝑥 𝑖 | 𝑥 𝑖−1 , . . . , 𝑥 1 ) (3.1)
𝑖
However, if we were to model these factors with tables, it would take an Table 3.1: Table required to model the
exponential number of parameters. To see this, take each 𝑥 𝑖 to be binary single factor 𝑃(𝑥 𝑛 | 𝑥 𝑛−1 , . . . , 𝑥 1 ) where
and consider how we would model the factor 𝑃(𝑥 𝑛 | 𝑥 𝑛−1 , . . . , 𝑥 1 ). Since 𝑛 = 4 and the variables are binary. The
𝑥 𝑛 is binary, we only need to model 𝑃(𝑋𝑛 = 1 | 𝑥 𝑛−1 , . . . , 𝑥 1 ) because
number of parameters to necessary is ex-
ponential in 𝑛 .
𝑃(𝑋𝑛 = 0 | 𝑥 𝑛−1 , . . . , 𝑥 1 ) is simply 1 − 𝑃(𝑋𝑛 = 1 | 𝑥 𝑛−1 , . . . , 𝑥 1 ). Well, we
would need 2𝑛−1 parameters to model this. As a specific example, let 𝑥1 𝑥2 𝑥3 𝑃(𝑥4 |𝑥3 , 𝑥2 , 𝑥1 )
𝑛 = 4. As we can see in Table 3.1, this would require 24−1 = 8 parameters: 0 0 0 𝛼1
𝛼1 , . . . , 𝛼 8 . This brute-force parametrization quickly becomes intractable 0 0 1 𝛼2
as 𝑛 increases. 0 1 0 𝛼3
An intuitive way to more efficiently model many variables together in 0 1 1 𝛼4
a joint distribution is to only model local dependencies. For example, 1 0 0 𝛼5
rather than modeling the 𝑋4 factor as 𝑃(𝑥 4 |𝑥 3 , 𝑥 2 , 𝑥 1 ), we could model 1 0 1 𝛼6
it as 𝑃(𝑥 4 |𝑥 3 ) if we have reason to believe that 𝑋4 only locally depends 1 1 0 𝛼7
on 𝑋3 . In fact, in the corresponding graph in Figure 3.4, the only node 1 1 1 𝛼8
that feeds into 𝑋4 is 𝑋3 . This is meant to signify that 𝑋4 only locally
depends on 𝑋3 . Whenever we use a graph 𝐺 in relation to a probability
distribution 𝑃 , there will always be a one-to-one mapping between the
nodes in 𝐺 and the random variables in 𝑃 , so when we talk about nodes
𝑋1 𝑋2
being independent, we mean the corresponding random variables are
independent.
Given a probability distribution and a corresponding directed acyclic
graph (DAG), we can formalize the specification of independencies with 𝑋3 𝑋4
the local Markov assumption:
Figure 3.4: Four node DAG where 𝑋4 lo-
cally depends on only 𝑋3 .
Assumption 3.1 (Local Markov Assumption) Given its parents in the
DAG, a node 𝑋 is independent of all of its non-descendants.
3 The Flow of Association and Causation in Graphs 21
If 𝑃 is Markov with respect to the graph1 in Figure 3.4, then we can 1 A probability distribution is said to be
Hopefully you see the resemblance between the move from Equation 3.2
to Equation 3.3 or the move to Equation 3.4 and the generalization of this
that is presented in Definition 3.1.
The Bayesian network factorization is also known as the chain rule for
Bayesian networks or Markov compatibility. For example, if 𝑃 factorizes
according to 𝐺 , then 𝑃 and 𝐺 are Markov compatible.
We have given the intuition of how the local Markov assumption implies
the Bayesian network factorization, and it turns out that the two are
actually equivalent. In other words, we could have started with the
Bayesian network factorization as the main assumption (and labeled it as
an assumption) and shown that it implies the local Markov assumption.
See Koller and Friedman [12, Chapter 3] for these proofs and more [12]: Koller and Friedman (2009), Proba-
information on this topic. bilistic Graphical Models: Principles and Tech-
niques
As important as the local Markov assumption is, it only gives us infor-
mation about the independencies in 𝑃 that a DAG implies. It does not
even tell us that if 𝑋 and 𝑌 are adjacent in the DAG, then 𝑋 and 𝑌 are
dependent. And this additional information is very commonly assumed
in causal DAGs. To get this guaranteed dependence between adjacent
nodes, we will generally assume a slightly stronger assumption than the
local Markov assumption: minimality.
The previous section was all about statistical models and modeling
association. In this section, we will augment these models with causal
assumptions, turning them into causal models and allowing us to study
causation. In order to introduce causal assumptions, we must first have
an understanding of what it means for 𝑋 to be a cause of 𝑌 .
that every edge is “active,” just like in DAGs that satisfy minimality. In
other words, because the definition of a cause (Definition 3.2) implies
that a cause and its effect are dependent and because we are assuming
all parents are causes of their children, we are assuming that parents
and their children are dependent. So the second part of minimality
(Assumption 3.2) is baked into the strict causal edges assumption.
In contrast, the non-strict causal edges assumption would allow for
some parents to not be causes of their children. It would just assume
that children are not causes of their parents. This allows us to draw
graphs with extra edges to make fewer assumptions, just like we would
in Bayesian networks, where more edges means fewer independence
assumptions. Causal graphs are sometimes drawn with this kind of
non-minimal meaning, but the vast majority of the time, when someone
draws a causal graph, they mean that parents are causes of their children.
Therefore, unless we specify otherwise, throughout this book, we will
use “causal graph” to refer to a DAG that satisfies the strict causal edges
assumption. And we will often omit the word “strict” when we refer to
this assumption.
When we add the causal edges assumption, directed paths in the DAG
take on a very special meaning; they correspond to causation. This is in
contrast to other paths in the graph, which association may flow along,
but causation certainly may not. This will become more clear when we
go into detail on these other kinds of paths in Sections 3.5 and 3.6.
Moving forward, we will now think of the edges of graphs as causal, in
order to describe concepts intuitively with causal language. However,
all of the associational claims about statistical independence will still
hold, even when the edges do not have causal meaning like in the vanilla
Bayesian networks of Section 3.2.
As we will see in the next few sections, the main assumptions that we
need for our causal graphical models to tell us how association and
causation flow between variables are the following two:
1. Local Markov Assumption (Assumption 3.1)
2. Causal Edges Assumption (Assumption 3.3)
We will discuss these assumptions throughout the next few sections and
come back to discuss them more fully again in Section 3.8 after we’ve
established the necessary preliminaries.
Now that we’ve gotten the basic assumptions and definitions out of the
way, we can get to the core of this chapter: the flow of association and
causation in DAGs. We can understand this flow in general DAGs by
understanding the flow in the minimal building blocks of graphs. These
minimal building blocks consist of chains (Figure 3.7a), forks (Figure 3.7b),
immoralities (Figure 3.7c), two unconnected nodes (Figure 3.8), and two
connected nodes (Figure 3.9).
3 The Flow of Association and Causation in Graphs 24
𝑋2 𝑋1 𝑋3
𝑋1 𝑋2 𝑋3 𝑋1 𝑋3 𝑋2
Chains (Figure 3.10) and forks (Figure 3.11) share the same set of depen-
𝑋1 𝑋2 𝑋3
dencies. In both structures, 𝑋1 and 𝑋2 are dependent, and 𝑋2 and 𝑋3
are dependent for the same reason that we discussed toward the end Figure 3.10: Chain with flow of association
of Section 3.4. Adjacent nodes are always dependent when we make drawn as a dashed red arc.
the causal edges assumption (Assumption 3.3). What about 𝑋1 and 𝑋3 ,
3 The Flow of Association and Causation in Graphs 25
𝑃(𝑥1 , 𝑥2 )
𝑃(𝑥1 , 𝑥3 | 𝑥2 ) = 𝑃(𝑥3 |𝑥2 ) (3.8)
𝑃(𝑥2 )
= 𝑃(𝑥1 |𝑥 2 ) 𝑃(𝑥3 |𝑥2 ) (3.9)
Recall from Section 3.1 that we have an immorality when we have a child
whose two parents do not have an edge connecting them (Figure 3.14).
And in this graph structure, the child is known as a bastard. No, just
kidding; it’s called a collider.
In contrast to chains and forks, in an immorality, 𝑋1 ⊥ ⊥ 𝑋3 . Look at
the graph structure and think about it a bit. Why would 𝑋1 and 𝑋3 be
associated? One isn’t the descendent of the other like in chains, and
they don’t share a common cause like in forks. Rather, we can think of
𝑋1 and 𝑋3 simply as unrelated events that happen, which happen to
both contribute to some common effect (𝑋2 ). To show this, we apply the
Bayesian network factorization and marginalize out 𝑥 2 :
𝑋1 𝑋3
X
𝑃(𝑥1 , 𝑥3 ) = 𝑃(𝑥 1 , 𝑥2 , 𝑥3 ) (3.10)
𝑥2
X
= 𝑃(𝑥 1 ) 𝑃(𝑥 3 ) 𝑃(𝑥2 | 𝑥1 , 𝑥3 ) (3.11) 𝑋2
𝑥2
X
= 𝑃(𝑥1 ) 𝑃(𝑥3 ) 𝑃(𝑥2 | 𝑥1 , 𝑥3 ) (3.12)
𝑥2
Figure 3.14: Immorality with association
= 𝑃(𝑥1 ) 𝑃(𝑥3 ) (3.13) blocked by a collider.
Definition 3.4 (d-separation) Two (sets of) nodes 𝑋 and 𝑌 are d-separated
by a set of nodes 𝑍 if all of the paths between (any node in) 𝑋 and (any node
in) 𝑌 are blocked by 𝑍 [15]. [15]: Pearl (1988), Probabilistic Reasoning
in Intelligent Systems: Networks of Plausible
Inference
If all the paths between two nodes 𝑋 and 𝑌 are blocked, then we say that
𝑋 and 𝑌 are d-separated. Similarly, if there exists at least one path between
𝑋 and 𝑌 that is unblocked, then we say that 𝑋 and 𝑌 are d-connected.
3 The Flow of Association and Causation in Graphs 29
Theorem 3.1 Given that 𝑃 is Markov with respect to 𝐺 (satisfies the local
Markov assumption, Assumption 3.1), if 𝑋 and 𝑌 are d-separated in 𝐺
conditioned on 𝑍 , then 𝑋 and 𝑌 are independent in 𝑃 conditioned on 𝑍 . We
can write this succinctly as follows:
𝑋⊥
⊥ 𝐺 𝑌 | 𝑍 =⇒ 𝑋 ⊥
⊥ 𝑃𝑌 | 𝑍 (3.20)
Because this is so important, we will give Equation 3.20 a name: the global
Markov assumption. Theorem 3.1 tells us that the local Markov assumption
implies the global Markov assumption.
Markov assumption Just as we built up the intuition that suggested that the
local Markov assumption (Assumption 3.1) implies the Bayesian network
factorization (Definition 3.1) and alerted you to the fact that the Bayesian
network factorization also implies the local Markov assumption (the
two are equivalent), it turns out that the global Markov assumption also
implies the local Markov assumption. In other words, the local Markov
assumption, global Markov assumption, and the Bayesian network fac-
torization are equivalent all [see, e.g., 12, Chapter 3]. Therefore, we will [12]: Koller and Friedman (2009), Proba-
use the slightly shortened phrase Markov assumption to refer to these bilistic Graphical Models: Principles and Tech-
niques
concepts as a group, or we will simply write “𝑃 is Markov with respect
to 𝐺 ” to convey the same meaning.
one type of path takes on a whole new meaning: directed paths. This
assumption endows directed paths with the unique role of carrying
causation along them. Additionally, this assumption is asymmetric; “𝑋
is a cause of 𝑌 ” is not the same as saying “𝑌 is a cause of 𝑋 .” This means
that there is an important difference between association and causation:
association is symmetric, whereas causation is asymmetric.
Given that we have tools to measure association, how can we isolate
causation? In other words, how can we ensure that the association we
measure is causation, say, for measuring the causal effect of 𝑋 on 𝑌 ?
Well, we can do that by ensuring that there is no non-causal association
flowing between 𝑋 and 𝑌 . This is true if 𝑋 and 𝑌 are d-separated in
the augmented graph where we remove outgoing edges from 𝑋 . This
is because when all of 𝑋 ’s causal effect on 𝑌 would flow through it’s
outgoing edges; once those are removed, the only association that remains
is purely non-causal association.
In Figure 3.16, we illustrate what each of the important assumptions
gives us in terms of interpreting this flow of association. First, we have
the (local/global) Markov assumption (Assumption 3.1). As we saw
in Section 3.7, this assumption allows us to know which nodes are
unassociated. In other words, the Markov assumption tells along which
paths the association does not flow. When we slightly strengthen the
Markov assumption to the minimality assumption (Assumption 3.2),
we get which paths association does flow along (except in intransitive
edges cases). When we further add in the causal edges assumption
(Assumption 3.3), we get that causation flows along directed paths.
Therefore, the following two8 assumptions are essential for graphical 8 Recall that the first part of the minimal-
causal models: ity assumption is just the local Markov
assumption and that the second part is
1. Markov Assumption (Assumption 3.1) contained in the causal edges assumption.
2. Causal Edges Assumption (Assumption 3.3)
Figure 3.16: A flowchart that illustrates what kind of claims we can make about our data as we add each additional important assumption.
Causal Models 4
Causal models are essential for identification of causal quantities. When 4.1 The do-operator and Inter-
we presented the Identification-Estimation Flowchart (Figure 2.5) back ventional Distributions . . 31
in Section 2.4, we described identification as the process of moving 4.2 The Main Assumption: Mod-
from a causal estimand to a statistical estimand. However, to do that, ularity . . . . . . . . . . . . . 33
we must have a causal model. We depict this more full version of the 4.3 Truncated Factorization . . 34
Identification-Estimation Flowchart in Figure 4.1. Example Application and Re-
visiting “Association is Not
Causal Estimand Causal Model Causation” . . . . . . . . . . 35
4.4 The Backdoor Adjustment 36
Relation to Potential Out-
comes . . . . . . . . . . . . . 38
Statistical Estimand Data
4.5 Structural Causal Models
(SCMs) . . . . . . . . . . . . 39
Structural Equations . . . . 39
Estimate Interventions . . . . . . . . . 40
Collider Bias and Why to Not
Figure 4.1: The Identification-Estimation Flowchart – a flowchart that illustrates the process
of moving from a target causal estimand to a corresponding estimate, through identification Condition on Descendants
and estimation. In contrast to Figure 2.5, this version is augmented with a causal model of Treatment . . . . . . . . . 42
and data. 4.6 Example Applications of the
Backdoor Adjustment . . . 43
The previous chapter gives graphical intuition for causal models, but it Association vs. Causation in
a Toy Example . . . . . . . . 43
doesn’t explain how to identify causal quantities and formalize causal
A Complete Example with
models. We will do that in this chapter.
Estimation . . . . . . . . . . 44
4.7 Assumptions Revisited . . . 46
Note that we shorten do(𝑇 = 𝑡) to just do(𝑡) in the last option in Equation
4.1. We will use this shorthand throughout the book. We can similarly
write the ATE (average treatment effect) when the treatment is binary as
follows:
𝔼[𝑌 | do(𝑇 = 1)] − 𝔼[𝑌 | do(𝑇 = 0)] (4.2)
4 Causal Models 32
or or
𝑇=0 do(𝑇 = 0)
We will often work with full distributions like 𝑃(𝑌 | do(𝑡)), rather than
their means, as this is more general; if we characterize 𝑃(𝑌 | do(𝑡)), then
we’ve characterized 𝔼[𝑌 | do(𝑡)]. We will commonly refer to 𝑃(𝑌 | do(𝑇 =
𝑡)) and other expressions with the do-operator in them as interventional
distributions.
Interventional distributions such as 𝑃(𝑌 | do(𝑇 = 𝑡)) are conceptually
quite different from the observational distribution 𝑃(𝑌). Observational
distributions such as 𝑃(𝑌) or 𝑃(𝑌, 𝑇, 𝑋) do not have the do-operator in
them. Because they don’t have the do-operator, we can observe data from
them without needing to carry out any experiment. This is why we call
data from 𝑃(𝑌, 𝑇, 𝑋) observational data. If we can reduce an expression
𝑄 with do in it (an interventional expression) to one without do in it (an
observational expression), then 𝑄 is said to be identifiable. An expression
with a do in it is fundamentally different from an expression without a
do in it, despite the fact that in do-notation, do appears after a regular
conditioning bar. As we discussed in Section 2.4, we will refer to an
estimand as a causal estimand when it contains a do-operator, and we
refer to an estimand as a statistical estimand when it doesn’t contain a
do-operator.
Whenever, do(𝑡) appears after the conditioning bar, it means that ev-
erything in that expression is in the post-intervention world where the
intervention do(𝑡) occurs. For example, 𝔼[𝑌 | do(𝑡), 𝑍 = 𝑧] refers to the
expected outcome in the subpopulation where 𝑍 = 𝑧 after the whole
subpopulation has taken treatment 𝑡 . In contrast, 𝔼[𝑌 | 𝑍 = 𝑧] simply
refers to the expected value in the (pre-intervention) population where
individuals take whatever treatment they would normally take (𝑇 ). This
distinction will become important when we get to counterfactuals in
Chapter 8.
4 Causal Models 33
𝑇3 𝑇3 𝑇3
𝑇2 𝑇2 𝑇2
𝑇 𝑇 𝑇
𝑌 𝑌 𝑌
(a) Causal graph for observational distri- (b) Causal graph after intervention on 𝑇 (c) Causal graph after intervention on 𝑇2
bution (interventional distribution) (interventional distribution)
The key thing that changed when we moved from the regular factorization
in Equation 4.3 to the truncated factorization in Equation 4.4 is that the
latter’s product is only over 𝑖 ∉ 𝑆 rather than all 𝑖 . In other words, the
factors for 𝑖 ∈ 𝑆 have been truncated.
To see the power that the truncated factorization gives us, let’s apply it
to identify the causal effect of treatment on outcome in a simple graph.
Specifically, we will identify the causal quantity 𝑃(𝑦 | do(𝑡)). In this
example, the distribution 𝑃 is Markov with respect to the graph in Figure 𝑋
4.5. The Bayesian network factorization (from the Markov assumption),
gives us the following:
𝑇 𝑌
𝑃(𝑦, 𝑡, 𝑥) = 𝑃(𝑥) 𝑃(𝑡 | 𝑥) 𝑃(𝑦 | 𝑡, 𝑥) (4.5)
Figure 4.5: Simple causal structure where
When we intervene on the treatment, the truncated factorization (from 𝑋 counfounds the effect of 𝑇 on 𝑌 and
where 𝑋 is the only confounder.
adding the modularity assumption) gives us the following:
If we then plug in Equation 4.7 for 𝑃(𝑦 | do(𝑇 = 1)) and 𝑃(𝑦 | do(𝑇 = 0)),
we have a fully identified ATE. Given the simple graph in Figure 4.5, we
have shown how we can use the truncated factorization to identify causal
effects in Equations 4.5 to 4.7. We will now generalize this identification
process to a more general formula.
2. 𝑊 does not contain any descendants of 𝑇 . which set of nodes related to 𝑇 will always
be a sufficient adjustment set? Which set
of nodes related to 𝑌 will always be a
sufficient adjustment set?
4 Causal Models 37
Given that 𝑊 satisfies the backdoor criterion, we can write the following:
X X
𝑃(𝑦 | do(𝑡), 𝑤) 𝑃(𝑤 | do(𝑡)) = 𝑃(𝑦 | 𝑡, 𝑤) 𝑃(𝑤 | do(𝑡)) (4.12)
𝑤 𝑤
Here’s a concise recap of the proof (Equations 4.11 to 4.13) without all of
the explanation/justification:
Proof.
X
𝑃(𝑦 | do(𝑡)) = 𝑃(𝑦 | do(𝑡), 𝑤) 𝑃(𝑤 | do(𝑡)) (4.14)
𝑤
X
= 𝑃(𝑦 | 𝑡, 𝑤) 𝑃(𝑤 | do(𝑡)) (4.15)
𝑤
X
= 𝑃(𝑦 | 𝑡, 𝑤) 𝑃(𝑤) (4.16)
𝑤
We can derive this from the more general backdoor adjustment in a few
steps. First, we take an expectation over 𝑌 :
X
𝔼[𝑌 | do(𝑡)] = 𝔼[𝑌 | 𝑡, 𝑤] 𝑃(𝑤) (4.18)
𝑤
Then, we notice that the sum over 𝑤 and 𝑃(𝑤) is an expectation (for
discrete 𝑥 , but just replace with an integral if not):
(𝑌(1), 𝑌(0)) ⊥
⊥𝑇 |𝑊 (4.21)
As Judea Pearl often says, the equals sign in mathematics does not convey
any causal information. Saying 𝐴 = 𝐵 is the same as saying 𝐵 = 𝐴.
Equality is symmetric. However, in order to talk about causation, we
must have something asymmetric. We need to be able to write that 𝐴
is a cause of 𝐵, meaning that changing 𝐴 results in changes in 𝐵, but
changing 𝐵 does not result in changes in 𝐴. This is what we get when we
write the following structural equation:
𝐵 := 𝑓 (𝐴) , (4.22)
where 𝑓 is some function that maps 𝐴 to 𝐵. While the usual “=” symbol
does not give us causal information, this new “:=” symbol does. This
is a major difference that we see when moving from statistical models
to causal models. Now, we have the asymmetry we need to describe
causal relations. However, the mapping between 𝐴 and 𝐵 is deterministic.
Ideally, we’d like to allow it to be probabilistic, which allows room for
some unknown causes of 𝐵 that factor into this mapping. Then, we can
write the following:
𝐵 := 𝑓 (𝐴, 𝑈) , (4.23)
where 𝑈 is some unobserved random variable. We depict this in Figure 4.7,
𝐴 𝑈
where 𝑈 is drawn inside a dashed node to indicate that it is unobserved.
The unobserved 𝑈 is analogous to the randomness that we would
see by sampling units (individuals); it denotes all the relevant (noisy)
background conditions that determine 𝐵. More concretely, there are 𝐵
analogs to every part of the potential outcome 𝑌𝑖 (𝑡): 𝐵 is the analog of 𝑌 , Figure 4.7: Graph for simple structural
𝐴 = 𝑎 is the analog of 𝑇 = 𝑡 , and 𝑈 is the analog of 𝑖 . equation. The dashed node 𝑈 means that
𝑈 is unobserved.
The functional form of 𝑓 does not need to be specified, and when
left unspecified, we are in the nonparametric regime because we aren’t
making any assumptions about parametric form. Although the mapping
is deterministic, because it takes a random variable 𝑈 (a “noise” or
“background conditions” variable) as input, it can represent any stochastic
mapping, so structural equations generalize the probabilistic factors
𝑃(𝑥 𝑖 | pa𝑖 ) that we’ve been using throughout this chapter. Therefore, all
the results that we’ve seen such as the truncated factorization and the
backdoor adjustment still hold when we introduce structural equations.
Cause and Causal Mechanism Revisited We have now come to the
more precise definitions of what a cause is (Definition 3.2) and what a
causal mechanism is (introduced in Section 4.2). A causal mechanism
that generates a variable is the structural equation that corresponds to
that variable. For example, the causal mechanism for 𝐵 is Equation 4.23.
4 Causal Models 40
𝐵 := 𝑓𝐵 (𝐴, 𝑈𝐵 ) 𝐴 𝑈𝐵
𝑀: 𝐶 := 𝑓𝐶 (𝐴, 𝐵, 𝑈𝐶 ) (4.24)
𝐷 := 𝑓𝐷 (𝐴, 𝐶, 𝑈𝐷 )
𝐵 𝑈𝐶
In causal graphs, the noise variables are often implicit, rather than
explicitly drawn. The variables that we write structural equations for
are known as endogenous variables. These are the variables whose causal
mechanisms we are modeling – the variables that have parents in the 𝐶 𝑈𝐷
causal graph. In contrast, exogenous variables are variables who do not
have any parents in the causal graph; these variables are external to our
causal model in the sense that we choose not to model their causes. For
𝐷
example, in the causal model described by Figure 4.8 and Equation 4.24,
the endogenous variables are {𝐵, 𝐶, 𝐷}. And the exogenous variables Figure 4.8: Graph for the structural equa-
are {𝐴, 𝑈𝐵 , 𝑈𝐶 , 𝑈𝐷 }.
tions in Equation 4.24.
4.5.2 Interventions
𝑋
Interventions in SCMs are remarkably simple. The intervention do(𝑇 = 𝑡)
simply corresponds to replacing the structural equation for 𝑇 with 𝑇 := 𝑡 .
For example, consider the following causal model 𝑀 with corresponding 𝑇 𝑌
causal graph in Figure 4.9:
Figure 4.9: Basic causal graph
4 Causal Models 41
𝑇 := 𝑓𝑇 (𝑋 , 𝑈𝑇 )
𝑀: (4.25)
𝑌 := 𝑓𝑌 (𝑋 , 𝑇, 𝑈𝑌 )
for the 𝑈 in Equation 4.23 and the paragraph that followed it. This is
the notation that Pearl uses for SCMs as well [see, e.g., 16, Definition
4]. So 𝑌𝑡 (𝑢) denotes the outcome that unit 𝑢 would observe if they take [16]: Pearl (2009), ‘Causal inference in
treatment 𝑡 , given that the SCM is 𝑀 . Similarly, we define 𝑌𝑀𝑡 (𝑢) as statistics: An overview’
the outcome that unit 𝑢 would observe if they take treatment 𝑡 , given
that the SCM is 𝑀𝑡 (remember that 𝑀𝑡 is the same SCM as 𝑀 but with
the structural equation for 𝑇 changed to 𝑇 := 𝑡 ). Now, we are ready to
present one of Pearl’s two key principles from which all other causal
8 Active reading exercise: Can you recall
results follow:8
which was the other key principle/as-
sumption?
Definition 4.3 (The Law of Counterfactuals (and Interventions)) Active reading exercise: Take what you
now know about structural equations,
𝑌𝑡 (𝑢) = 𝑌𝑀𝑡 (𝑢) (4.27) and relate it to other parts of this chap-
ter. For example, how do interventions in
This is called “The Law of Counterfactuals” because it gives us informa- structural equations relate to the modu-
larity assumption? How does the mod-
tion about counterfactuals. Given an SCM with enough details about it ularity assumption for SCMs (Assump-
specified, we can actually compute counterfactuals. This is a big deal tion 4.2) relate to the modularity assump-
because this is exactly what the fundamental problem of causal inference tion in causal Bayesian networks (Assump-
tion 4.1)? Does this modularity assump-
(Section 2.2) told us we cannot do. We won’t say more about how to do
tion for SCMs still give us the backdoor
this until we get to the dedicated chapter for counterfactuals: Chapter 8. adjustment?
4 Causal Models 42
In this section, we posit a toy generative process and derive the bias of the
associational quantity 𝔼[𝑌 | 𝑡]. We compare this to the causal quantity 𝑇 𝑌
𝔼[𝑌 | do(𝑡)], which gives us exactly what we want. Note that both of
Figure 4.16: Causal graph depicting M-
these quantities are actually functions of 𝑡 . If the treatment were binary, Bias.
then we would just look at the difference between the quantities with
𝑇 = 1 and with 𝑇 = 0. However, because our generative processes will be
𝑑 𝔼[𝑌|𝑡] 𝑑 𝔼[𝑌| do(𝑡)]
linear, 𝑑𝑡 and 𝑑𝑡 actually gives us all the information about
the treatment effect, regardless of if treatment is continuous, binary, or
multi-valued. We will assume infinite data so that we can work with
expectations. This means this section has nothing to do with estimation;
for estimation, see the next section
The generative process that we consider has the causal graph in Figure 4.17
and the following structural equations:
𝑇 := 𝛼 1 𝑋 (4.28)
𝑌 := 𝛽𝑇 + 𝛼 2 𝑋 . (4.29) 𝑋
𝛼
Note that in the structural equation for 𝑌 , 𝛽 is the coefficient in front of 𝑇 . 𝛼1 2
𝔼𝑋 𝔼[𝑌 | 𝑡, 𝑋] = 𝔼𝑋 𝔼[𝛽𝑇 + 𝛼2 𝑋 | 𝑇 = 𝑡, 𝑋]
(4.30)
= 𝔼𝑋 𝛽𝑡 + 𝛼 2 𝑋
(4.31)
= 𝛽𝑡 + 𝛼 2 𝔼[𝑋] (4.32)
4 Causal Models 44
Importantly, we made use of the equality that the structural equation for
𝑌 (Equation 4.29) gives us in Equation 4.30. Now, we just have to take
the derivative to get the causal effect:
𝑑 𝔼𝑋 𝔼[𝑌 | 𝑡, 𝑋]
= 𝛽. (4.33)
𝑑𝑡
We got exactly what we were looking for. Now, let’s move to the associa-
tional quantity:
In Equation 4.36, we made use of the equality that the structural equation
for 𝑇 (Equation 4.28) gives us. If we then take the derivative, we see that
there is confounding bias:
𝑑 𝔼[𝑌 | 𝑡] 𝛼2
=𝛽+ . (4.37)
𝑑𝑡 𝛼1
Recall that we estimated a concrete value for the causal effect of sodium
intake on blood pressure in Section 2.5. There, we used the potential
outcomes framework. Here, we will do the same thing, but using causal
graphs. The spoiler is that the 19% error that we saw in Section 2.5 was
due to conditioning on a collider.
First, we need to write down our causal assumptions in terms of a causal
graph. Remember that in Luque-Fernandez et al. [8]’s example from [8]: Luque-Fernandez et al. (2018), ‘Edu-
epidemiology, the treatment 𝑇 is sodium intake, and the outcome 𝑌 is cational Note: Paradoxical collider effect
blood pressure. The covariates are age 𝑊 and amount of protein in urine
in the analysis of non-communicable dis-
ease epidemiological data: a reproducible
(proteinuria) 𝑍 . Age is a common cause of both blood pressure and the illustration and web application’
body’s ability to self-regulate sodium levels. In contrast, high amounts
of urinary protein are caused by high blood pressure and high sodium
intake. This means that proteinuria is a collider. We depict this causal
𝑊
graph in Figure 4.18.
Because 𝑍 is a collider, conditioning on it induces bias. Because 𝑊 and 𝑍
were grouped together as “covariates” 𝑋 in Section 2.5, we conditioned 𝑇 𝑌
on all of them. This is why we saw that our estimate was 19% off from
the true causal effect 1.05. Now that we’ve made the causal relationships
clear with a causal graph, the backdoor criterion (Definition 4.1) tells us
𝑍
to only adjust for 𝑊 and to not adjust for 𝑍 . More precisely, we were
doing the following adjustment in Section 2.5: Figure 4.18: Causal graph for the blood
pressure example. 𝑇 is sodium intake. 𝑌
is blood pressure. 𝑊 is age. And, impor-
𝔼𝑊 ,𝑍 𝔼[𝑌 | 𝑡, 𝑊 , 𝑍] (4.38) tantly, the amount of protein excreted in
urine 𝑍 is a collider.
4 Causal Models 45
And now, we will use the backdoor adjustment (Theorem 4.2) to change
our statistical estimand to the following:
𝔼𝑊 𝔼[𝑌 | 𝑡, 𝑊] (4.39)
We have simply removed the collider 𝑍 from the variables we adjust for.
For estimation, just as we did in Section 2.5, we use a model-assisted
estimator. We replace the outer expectation over 𝑊 with an empirical
mean over 𝑊 and replace the conditional expectation 𝔼[𝑌 | 𝑡, 𝑊] with a
machine learning model (in this case, linear regression).
Just as writing down the graph has lead us to simply not condition on 𝑍
in Equation 4.39, the code for estimation also barely changes. We need to
change just a single line of code in our previous program (Listing 2.1).
We display the full program with the fixed line of code below:
Xt = df[['sodium', 'age']]
y = df['blood_pressure']
model = LinearRegression()
model.fit(Xt, y)
Full code, complete with simulation,
is available at https://github.com/
Xt1 = pd.DataFrame.copy(Xt) bradyneal/causal-book-code/blob/
Xt1['sodium'] = 1 master/sodium_example.py.
Xt0 = pd.DataFrame.copy(Xt)
Xt0['sodium'] = 0
ate_est = np.mean(model.predict(Xt1) - model.predict(Xt0))
print('ATE estimate:', ate_est)
in Listing 2.1 to
Xt = df[['sodium', 'age']]
in Listing 4.1. When we run this revised code, we get an ATE estimate of 9 Active reading exercise: Given that 𝑌 is
1.0502, which corresponds to 0.02% error (true value is 1.05) when using generated as a linear function of 𝑇 and 𝑊 ,
a fairly large sample.9 could we have just used the coefficient in
front of 𝑇 in the linear regression as an
Progression of Reducing Bias When looking at the total association estimate for the causal effect?
between 𝑇 and 𝑌 by simply regressing 𝑌 on 𝑇 , we got an estimate that was
a staggering 407% off of the true causal effect, due largely to confounding 𝑍1 𝑍3
bias (see Section 2.5). When we adjusted for all covariates in Section 2.5,
we reduced the percent error all the way down to 19%. In this section,
we saw this remaining error is due to collider bias. When we removed
the collider bias, by not conditioning on the collider 𝑍 , the error became 𝑍2
non-existent.
Potential Outcomes and M-Bias In fairness to the general culture
around the potential outcomes framework, it is common to only condition 𝑇 𝑌
on pretreatment covariates. This would prevent a practitioner who Figure 4.19: Causal graph depicting M-
adheres to this rule from conditioning on the collider 𝑍 in Figure 4.18. Bias that can only be avoided by not con-
ditioning on the collider 𝑍2 . This is due to
However, there is no reason that there can’t be pretreatment colliders the fact that the dashed nodes 𝑍1 and 𝑍3
that induce M-bias (Section 4.5.3). In Figure 4.19, we depict an example are unobserved.
4 Causal Models 46
The first main set of assumptions is encoded by the causal graph that we
write down. Exactly what this causal graph means is determined by two
main assumptions, each of which can take on several different forms:
1. The Modularity Assumption
Different forms:
I Modularity Assumption for Causal Bayesian Networks (Assumption 4.1)
I Modularity Assumption for SCMs (Assumption 4.2)
I The Law of Counterfactuals (Definition 4.3)
2. The Markov Assumption
Different equivalent forms:
I Local Markov assumption (Assumption 3.1)
I Bayesian network factorization (Definition 3.1)
I Global Markov assumption (Theorem 3.1)
Positivity (Assumption 2.3) is still a very important assumption that we [17]: Pearl (2009), Causality
must make, though it is sometimes neglected in the graphical models [22]: Pearl (2010), ‘On the consistency rule
in causal inference: axiom, definition, as-
literature.
sumption, or theorem?’
Randomized Experiments 5
Randomized experiments are noticeably different from observational 5.1 Comparability and Covari-
studies. In randomized experiments, the experimenter has complete con- ate Balance . . . . . . . . . . 47
trol over the treatment assignment mechanism (how treatment is assigned). 5.2 Exchangeability . . . . . . . 48
For example, in the most simple kind of randomized experiment, the 5.3 No Backdoor Paths . . . . . 49
experimenter randomly assigns (e.g. via coin toss) each participant to
either the treatment group or the control group. This complete control
over how treatment is chosen is what distinguishes randomized experi-
ments from observational studies. In this simple experimental setup, the
treatment isn’t a function of covariates at all! In contrast, in observational
studies, the treatment is almost always a function of some covariate(s).
As we will see, this difference is key to whether or not confounding is
present in our data.
In randomized experiments, association is causation. This is because
randomized experiments are special in that they guarantee that there
is no confounding. As a consequence, this allows us to measure the
causal effect 𝔼[𝑌(1)] − 𝔼[𝑌(0)] via the associational difference 𝔼[𝑌 | 𝑇 =
1]− 𝔼[𝑌 | 𝑇 = 0]. In the following sections, we explain why this is the case
from a variety of different perspectives. If any one of these explanations
clicks with you, that might be good enough. Definitely stick through to
the most visually appealing explanation in Section 5.3.
Ideally, the treatment and control groups would be the same, in all
aspects, except for treatment. This would mean they only differ in the
treatment they receive (i.e. they are comparable). This would allow us to
attribute any difference in the outcomes of the treatment and control
groups to the treatment. Saying that these treatment groups are the same
in everything other than their treatment and outcomes is the same as
saying they have the same distribution of confounders. Because people
often check for this property on observed variables (often what people
mean by “covariates”), this concept is known as covariate balance.
𝑑
𝑃(𝑋 | 𝑇 = 1) = 𝑃(𝑋 | 𝑇 = 0) (5.1)
𝑑 𝑑
𝑃(𝑋 | 𝑇 = 1) = 𝑃(𝑋). Similarly, it means 𝑃(𝑋 | 𝑇 = 0) = 𝑃(𝑋). Therefore,
𝑑
we have 𝑃(𝑋 | 𝑇 = 1) = 𝑃(𝑋 | 𝑇 = 0).
Although we have proven that randomization implies covariate balance,
we have not proven that that covariate balance implies identifiability.
The intuition is that covariance balance means that everything is the
same between the treatment groups, except for the treatment, so the
treatment must be the explanation for the change in 𝑌 . We’ll now prove
that 𝑃(𝑦 | do(𝑇 = 𝑡)) = 𝑃(𝑦 | 𝑇 = 𝑡). For the proof, the main property we
utilize is that covariate balance implies 𝑋 and 𝑇 are independent.
Proof. First, let 𝑋 be a sufficient adjustment set. This is the case with
randomization since we know that randomization balances everything,
not just the observed covariates. Then, we have the following from the
backdoor adjustment (Theorem 4.2):
X
𝑃(𝑦 | do(𝑇 = 𝑡)) = 𝑃(𝑦 | 𝑡, 𝑥)𝑃(𝑥) (5.2)
𝑥
𝑃(𝑡|𝑥)
By multiplying by 𝑃(𝑡|𝑥)
, we get the joint distribution in the numerator:
X 𝑃(𝑦, 𝑡, 𝑥)
= (5.5)
𝑥 𝑃(𝑡)
5.2 Exchangeability
The final perspective that we’ll look at to see why association is causation
in randomized experiments in that of graphical causal models. In regular
𝑇 𝑌
observational data, there is almost always confounding. For example, in
Figure 5.1 we see that 𝑋 is a confounder of the effect of 𝑇 on 𝑌 . Non-causal Figure 5.1: Causal structure of 𝑋 con-
founding the effect of 𝑇 on 𝑌 .
association flows along the backdoor path 𝑇 ← 𝑋 → 𝑌 .
However, if we randomize 𝑇 , something magical happens: 𝑇 no longer
has any causal parents, as we depict in Figure 5.2. This is because 𝑇 is
purely random. It doesn’t depend on anything other than the output of a
coin toss (or a quantum random number generator, if you’re into the kind
of stuff). Because 𝑇 has no incoming edges, under randomization, there 𝑋
are no backdoor paths. So the empty set is a sufficient adjustment set. This
means that all of the association that flows from 𝑇 to 𝑌 is causal. We can
identify 𝑃(𝑌 | do(𝑇 = 𝑡)) by simply applying the backdoor adjustment
𝑇 𝑌
(Theorem 4.2), adjusting for the empty set:
Figure 5.2: Causal structure when we ran-
𝑃(𝑌 | do(𝑇 = 𝑡)) = 𝑃(𝑌 | 𝑇 = 𝑡) domize treatment.