MIT18 05S14 Reading3
MIT18 05S14 Reading3
Class 3, 18.05
1 Learning Goals
1. Know the definitions of conditional probability and independence of events.
3. Be able to use the multiplication rule to compute the total probability of an event.
6. Be able to organize the computation of conditional probabilities using trees and tables.
2 Conditional Probability
Conditional probability answers the question ‘how does the probability of an event change
if we have extra information’. We’ll illustrate with an example.
Example 1. Toss a fair coin 3 times.
(a) What is the probability of 3 heads?
(b) Suppose we are told that the first toss was heads. Given this information how should
P (A|B)
This is read as
‘the conditional probability of A given B’
or
‘the probability of A conditioned on B’
1
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014
2
or simply
‘the probability of A given B’.
We can visualize conditional probability as follows. Think of P (A) as the proportion of the
area of the whole sample space taken up by A. For P (A|B) we restrict our attention to B.
That is, P (A|B) is the proportion of area of B taken up by A, i.e. P (A ∩ B)/P (B).
B
A
A∩B
Let’s redo the coin tossing example using the definition in Equation (1). Recall A = ‘3 heads’
and B = ‘first toss is heads’. We have P (A) = 1/8 and P (B) = 1/2. Since A ∩ B = A, we
also have P (A ∩ B) = 1/8. Now according to (1), P (A|B) = 1/8 1/2 = 1/4, which agrees with
our answer in Example 1b.
3 Multiplication Rule
Now, let’s recompute this using formula (1). We have to compute P (S1 ), P (S2 ) and
P (S1 ∩ S2 ): We know that P (S1 ) = 1/4 because there are 52 equally likely ways to draw
the first card and 13 of them are spades. The same logic says that there are 52 equally
likely ways the second card can be drawn, so P (S2 ) = 1/4.
Aside: The probability P (S2 ) = 1/4 may seem surprising since the value of first card
certainly affects the probabilities for the second card. However, if we look at all possible
two card sequences we will see that every card in the deck has equal probability of being
the second card. Since 13 of the 52 cards are spades we get P (S2 ) = 13/52 = 1/4. Another
way to say this is: if we are not given value of the first card then we have to consider all
possibilities for the second card.
Continuing, we see that
13 · 12
P (S1 ∩ S2 ) = = 3/51.
52 · 51
This was found by counting the number of ways to draw a spade followed by a second spade
and dividing by the number of ways to draw any card followed by any other card). Now,
using (1) we get
P (S2 ∩ S1 ) 3/51
P (S2 |S1 ) = = = 12/51.
P (S1 ) 1/4
Finally, we verify the multiplication rule by computing both sides of (2).
13 · 12 3 12 1 3
P (S1 ∩ S2 ) = = and P (S2 |S1 ) · P (S1 ) = · = . QED
52 · 51 51 51 4 51
The law of total probability will allow us to use the multiplication rule to find probabilities
in more interesting examples. It involves a lot of notation, but the idea is fairly simple. We
state the law when the sample space is divided into 3 pieces. It is a simple matter to extend
the rule when there are more than 3 pieces.
Law of Total Probability
Suppose the sample space Ω is divided into 3 disjoint events B1 , B2 , B3 (see the figure
below). Then for any event A:
P (A) = P (A ∩ B1 ) + P (A ∩ B2 ) + P (A ∩ B3 )
The top equation says ‘if A is divided into 3 pieces then P (A) is the sum of the probabilities
of the pieces’. The bottom equation (3) is called the law of total probability. It is just a
rewriting of the top equation using the multiplication rule.
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014
4
Ω
B1
A ∩ B1
A ∩ B2 A ∩ B3
B2 B3
The sample space Ω and the event A are each divided into 3 disjoint pieces.
The law holds if we divide Ω into any number of events, so long as they are disjoint and
Our first example will be one where we already know the answer and can verify the law.
Example 3. An urn contains 5 red balls and 2 green balls. Two balls are drawn one after
the other. What is the probability that the second ball is red?
Let R1 be the event ‘the first ball is red’, G1 = ‘first ball is green’, R2 = ‘second ball is
The fast way to compute this is just like P (S2 ) in the card example above. Every ball is
equally likely to be the second ball. Since 5 out of 7 balls are red, P (R2 ) = 5/7.
Let’s compute this same value using the law of total probability (3). First, we’ll find the
Probability urns
The example above used probability urns. Their use goes back to the beginning of the
subject and we would be remiss not to introduce them. This toy model is very useful. We
quote from Wikipedia: http://en.wikipedia.org/wiki/Urn_problem
It doesn’t take much to make an example where (3) is really the best way to compute the
probability. Here is a game with slightly more complicated rules.
Example 4. An urn contains 5 red balls and 2 green balls. A ball is drawn. If it’s green
a red ball is added to the urn and if it’s red a green ball is added to the urn. (The original
ball is not returned to the urn.) Then a second ball is drawn. What is the probability the
second ball is red?
answer: The law of total probability says that P (R2 ) can be computed using the expression
in Equation (4). Only the values for the probabilities will change. We have
Therefore,
4 5 6 2 32
P (R2 ) = P (R2 |R1 )P (R1 ) + P (R2 |G1 )P (G1 ) = · + · = .
7 7 7 7 49
Trees are a great way to organize computations with conditional probability and the law of
total probability. The figures and examples will make clear what we mean by a tree. As
with the rule of product, the key is to organize the underlying process into a sequence of
actions.
We start by redoing Example 4. The sequence of actions are: first draw ball 1 (and add the
appropriate ball to the urn) and then draw ball 2.
5/7 2/7
R1 G1
4/7 3/7 6/7 1/7
R2 G2 R2 G2
You interpret this tree as follows. Each dot is called a node. The tree is organized by levels.
The top node (root node) is at level 0. The next layer down is level 1 and so on. Each level
shows the outcomes at one stage of the game. Level 1 shows the possible outcomes of the
first draw. Level 2 shows the possible outcomes of the second draw starting from each node
in level 1.
Probabilities are written along the branches. The probability of R1 (red on the first draw)
is 5/7. It is written along the branch from the root node to the one labeled R1 . At the
next level we put in conditional probabilities. The probability along the branch from R1 to
R2 is P (R2 |R1 ) = 4/7. It represents the probability of going to node R2 given that you are
already at R1 .
The muliplication rule says that the probability of getting to any node is just the product of
the probabilities along the path to get there. For example, the node labeled R2 at the far left
really represents the event R1 ∩ R2 because it comes from the R1 node. The multiplication
rule now says
5 4
P (R1 ∩ R2 ) = P (R1 ) · P (R2 |R1 ) = · ,
7 7
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014
6
The tree given above involves some shorthand. For example, the node marked R2 at the
far left really represents the event R1 ∩ R2 , since it ends the path from the root through
R1 to R2 . Here is the same tree with everything labeled precisely. As you can see this tree
is more cumbersome to make and use. We usually use the shorthand version of trees. You
should make sure you know how to interpret them precisely.
R1 ∩ R2 R1 ∩ G2 G1 ∩ R2 G1 ∩ G2
6 Independence
Two events are independent if knowledge that one occurred does not change the probability
that the other occurred. Informally, events are independent if they do not influence one
another.
Example 5. Toss a coin twice. We expect the outcomes of the two tosses to be independent
of one another. In real experiments this always has to be checked. If my coin lands in honey
and I don’t bother to clean it, then the second toss might be affected by the outcome of the
first toss.
More seriously, the independence of experiments can by undermined by the failure to clean or
recalibrate equipment between experiments or to isolate supposedly independent observers
from each other or a common influence. We’ve all experienced hearing the same ‘fact’ from
different people. Hearing it from different sources tends to lend it credence until we learn
that they all heard it from a common source. That is, our sources were not independent.
Translating the verbal description of independence into symbols gives
That is, knowing that B occurred does not change the probability that A occurred. In
terms of events as subsets, knowing that the realized outcome is in B does not change the
probability that it is in A.
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014
7
If A and B are independent in the above sense, then the multiplication rule gives P (A ∩
B) = P (A|B) · P (B) = P (A) · P (B). This justifies the following technical definition of
independence.
Formal definition of independence: Two events A and B are independent if
This is a nice symmetric definition which makes clear that A is independent of B if and only
if B is independent of A. Unlike the equation with conditional probabilities, this definition
makes sense even when P (B) = 0. In terms of conditional probabilities, we have:
1. If P (B) =6 0 then A and B are independent if and only if P (A|B) = P (A).
2. If P (A) 6= 0 then A and B are independent if and only if P (B|A) = P (B).
Independent events commonly arise as different trials in an experiment, as in the following
example.
Example 6. Toss a fair coin twice. Let H1 = ‘heads on first toss’ and let H2 = ‘heads on
second toss’. Are H1 and H2 independent?
An event A with probability 0 is independent of itself, since in this case both sides of
equation (6) are 0. This appears paradoxical because knowledge that A occurred certainly
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014
8
gives information about whether A occurred. We resolve the paradox by noting that since
P (A) = 0 the statement ‘A occurred’ is vacuous.
Think: For what other value(s) of P (A) is A independent of itself?
7 Bayes’ Theorem
Bayes’ theorem is a pillar of both probability and statistics and it is central to the rest of
this course. For two events A and B Bayes’ theorem (also called Bayes’ rule and Bayes’
formula) says
P (A|B) · P (B)
P (B|A) = . (7)
P (A)
Comments: 1. Bayes’ rule tells us how to ‘invert’ conditional probabilities, i.e. to find
P (B|A) from P (A|B).
2. In practice, P (A) is often computed using the law of total probability.
Proof of Bayes’ rule
The key point is that A ∩ B is symmetric in A and B. So the multiplication rule says
A common mistake is to confuse P (A|B) and P (B|A). They can be very different. This is
illustrated in the next example.
Example 9. Toss a coin 5 times. Let H1 = ‘first toss is heads’ and let HA = ‘all 5 tosses
are heads’. Then P (H1 |HA ) = 1 but P (HA |H1 ) = 1/16.
For practice, let’s use Bayes’ theorem to compute P (H1 |HA ) using P (HA |A1 ).The terms
are P (HA |H1 ) = 1/16, P (H1 ) = 1/2, P (HA ) = 1/32. So,
The base rate fallacy is one of many examples showing that it’s easy to confuse the meaning
of P (B|A) and P (A|B) when a situation is described in words. This is one of the key
examples from probability and it will inform much of our practice and interpretation of
statistics. You should strive to understand it thoroughly.
Example 10. The Base Rate Fallacy
Consider a routine screening test for a disease. Suppose the frequency of the disease in the
population (base rate) is 0.5%. The test is highly accurate with a 5% false positive rate
and a 10% false negative rate.
You take the test and it comes back positive. What is the probability that you have the
disease?
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014
9
answer: We will do the computation three times: using trees, tables and symbols. We’ll
We are given P (D+) = .005 and therefore P (D−) = .995. The false positive and false
The complementary probabilities are known as the true negative and true positive rates:
.995 .005
D− D+
.05 .95 .9 .1
T+ T− T+ T−
The question asks for the probability that you have the disease given that you tested positive,
i.e. what is the value of P (D+|T +). We aren’t given this value, but we do know P (T +|D+),
so we can use Bayes’ theorem.
P (T + |D+) · P (D+)
P (D + |T +) = .
P (T +)
The two probabilities in the numerator are given. We compute the denominator P (T +)
using the law of total probability. Using the tree we just have to sum the probabilities for
each of the nodes marked T +
Thus,
.9 × .005
P (D + |T +) = = 0.082949 ≈ 8.3%.
.05425
Remarks: This is called the base rate fallacy because the base rate of the disease in the
population is so low that the vast majority of the people taking the test are healthy, and
even with an accurate test most of the positives will be healthy people. Ask your doctor
for his/her guess at the odds.
To summarize the base rate fallacy with specific numbers
95% of all tests are accurate does not imply 95% of positive tests are accurate
We will refer back to this example frequently. It and similar examples are at the heart of
many statistical misunderstandings.
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014 10
|D + ∩ T + | 45
P (D + |T +) = = = 8.3%.
|T + | 543
Symbols: For completeness, we show how the solution looks when written out directly in
symbols.
P (T + |D+) · P (D+)
P (D + |T +) =
P (T +)
P (T + |D+) · P (D+)
=
P (T + |D+) · P (D+) + P (T + |D−) · P (D−)
.9 × .005
=
.9 × .005 + .05 × .995
= 8.3%
Visualization: The figure below illustrates the base rate fallacy. The large blue area
represents all the healthy people. The much smaller red area represents the sick people.
The shaded rectangle represents the the people who test positive. The shaded area covers
most of the red area and only a small part of the blue area. Even so, the most of the shaded
area is over the blue. That is, most of the positive tests are of healthy people.
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014 11
D− D+
As we said at the start of this section, Bayes’ rule is a pillar of probability and statistics.
We have seen that Bayes’ rule allows us to ‘invert’ conditional probabilities. When we learn
statistics we will see that the art of statistical inference involves deciding how to proceed
when one (or more) of the terms on the right side of Bayes’ rule is unknown.
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.