Introduction To Probability 1
Introduction To Probability 1
[partly based on slides by Sharon Goldwater & Frank Keller and John K. Kruschke]
3 Bayes Theorem
Example
With respect to S1 , describe the event B of rolling a total of 7
with the two dice.
B = {h1, 6i, h2, 5i, h3, 4i, h4, 3i, h5, 2i, h6, 1i}
Events
die 2
SG I K M O Q
6
RSFG HI JK LM NO PQ
RF H J L N P
E9 ; = ? A C
5
DE89 :; <= >? @A BC
D8 : < > @ B
7+ - / 1 3 5
4
67*+ ,- ./ 01 23 45
6* , . 0 2 4
) ! # % '
3
() ! "# $% &'
( " $ &
2
1
die 1
1 2 3 4 5 6
Events
B A
A
A AB
A B A B
AB AB
Axioms of Probability
Axioms of Probability
1 The probability of an event is a nonnegative real number:
p(A) 0 for any A S.
2 p(S) = 1.
3 If A1 , A2 , A3 , . . . , is a set of mutually exclusive events of S,
then:
Example
Assume all strings of three lowercase letters are equally
probable. Then whats the probability of a string of three
vowels?
There are 26 letters, of which 5 are vowels. So there are
N = 263 three letter strings, and n = 53 consisting only of
vowels. Each outcome (string) is equally likely, with probability
1
N , so event A (a string of three vowels) has probability
53
p(A) = Nn = 26 3 0.00711.
Rules of Probability
p(A B)
p(B|A) =
p(A)
p(A B) is the joint probability of A and B, also written p(A, B).
Example
A manufacturer knows that the probability of an order being
ready on time is 0.80, and the probability of an order being
ready on time and being delivered on time is 0.72.
What is the probability of an order being delivered on time,
given that it is ready on time?
R: order is ready on time; D: order is delivered on time.
p(R) = 0.80, p(R, D) = 0.72. Therefore:
p(R, D) 0.72
p(D|R) = = = 0.90
p(R) 0.80
Conditional Probability
Example
Consider sampling an adjacent pair of words (bigram) from a
large text T . Let BI = the set of bigrams in T (this is our sample
space), A = first word is run = {hrun, w2 i : w2 T } BI and
B = second word is amok = {hw1 , amoki : w1 T } BI.
If p(A) = 103.5 , p(B) = 105.6 , and p(A, B) = 106.5 , what is
the probability of seeing amok following run, i.e., p(B|A)? How
about run preceding amok, i.e., p(A|B)?
p(A, B) 106.5
p(run before amok) = p(A|B) = = = .126
p(B) 105.6
p(A, B) 106.5
p(amok after run) = p(B|A) = = = .001
p(A) 103.5
[How do we determine p(A), p(B), p(A, B) in the first place?]
(Con)Joint Probability and the Multiplication Rule
p(A, B) = p(B)p(A|B)
Marginal Probability and the Rule of Total Probability
B1 , B2 , . . . , Bk form a B1 B
B 6
partition of S if they are 2
pairwise mutually exclusive
B5
and if
B1 B2 . . . Bk = S. B
7
B3 B4
Marginalization
Example
In an experiment on human memory, participants have to
memorize a set of words (B1 ), numbers (B2 ), and pictures (B3 ).
These occur in the experiment with the probabilities
p(B1 ) = 0.5, p(B2 ) = 0.4, p(B3 ) = 0.1.
Then participants have to recall the items (where A is the recall
event). The results show that p(A|B1 ) = 0.4, p(A|B2 ) = 0.2,
p(A|B3 ) = 0.1. Compute p(A), the probability of recalling an
item.
By the theorem of total probability:
Pk
p(A) = i=1 p(Bi )p(A|Bi )
= p(B1 )p(A|B1 ) + p(B2 )p(A|B2 ) + p(B3 )p(A|B3 )
= 0.5 0.4 + 0.4 0.2 + 0.1 0.1 = 0.29
Joint, Marginal & Conditional Probability
Example
Proportions for a sample of University of Delaware students
1974, N = 592. Data adapted from Snee (1974).
hairColor
eyeColor black brunette blond red
blue .03 .14 .16 .03 .36
brown .12 .20 .01 .04 .37
hazel/green .03 .14 .04 .05 .27
.18 .48 .21 .12
Joint, Marginal & Conditional Probability
Example
hairColor
eyeColor black brunette blond red
blue .03 .14 .16 .03 .36
brown .12 .20 .01 .04 .37
hazel/green .03 .14 .04 .05 .27
.18 .48 .21 .12
Joint, Marginal & Conditional Probability
Example
hairColor
eyeColor black brunette blond red
blue .03 .14 .16 .03 .36
brown .12 .20 .01 .04 .37
hazel/green .03 .14 .04 .05 .27
.18 .48 .21 .12
Joint, Marginal & Conditional Probability
Example
hairColor
eyeColor black brunette blond red
blue .03 .14 .16 .03 .36
brown .12 .20 .01 .04 .37
hazel/green .03 .14 .04 .05 .27
.18 .48 .21 .12
Joint, Marginal & Conditional Probability
Example
hairColor
eyeColor black brunette blond red
blue .03 .14 .16 .03 .36
brown .12 .20 .01 .04 .37
hazel/green .03 .14 .04 .05 .27
.18 .48 .21 .12
Joint, Marginal & Conditional Probability
Example
hairColor
eyeColor black brunette blond red
blue .03 .14 .16 .03 .36
brown .12 .20 .01 .04 .37
hazel/green .03 .14 .04 .05 .27
.18 .48 .21 .12
Joint, Marginal & Conditional Probability
Example
hairColor
eyeColor black brunette blond red
blue .03 .14 .16 .03 .36
brown .12 .20 .01 .04 .37
hazel/green .03 .14 .04 .05 .27
.18 .48 .21 .12
Joint, Marginal & Conditional Probability
Example
To obtain the cond. prob. p(eyeColor|hairColor = brunette),
we do two things:
hairColor
eyeColor black brunette blond red
blue .03 .14 .16 .03 .36
brown .12 .20 .01 .04 .37
hazel/green .03 .14 .04 .05 .27
.18 .48 .21 .12
Joint, Marginal & Conditional Probability
Example
To obtain the cond. prob. p(eyeColor|hairColor = brunette),
we do two things:
i. reduction: we consider only the probabilities in the
brunette column;
hairColor
eyeColor black brunette blond red
blue .14
brown .20
hazel/green .14
.48
Joint, Marginal & Conditional Probability
Example
To obtain the cond. prob. p(eyeColor|hairColor = brunette),
we do two things:
ii. normalization: we divide by the marginal p(brunette),
since all the probability mass is now concentrated here.
hairColor
eyeColor black brunette blond red
blue .14/.48
brown .20/.48
hazel/green .14/.48
.48
Joint, Marginal & Conditional Probability
Example
E.g., p(eyeColor = brown|hairColor = brunette) = .20/.48.
hairColor
eyeColor black brunette blond red
blue .14/.48
brown .20/.48
hazel/green .14/.48
.48
Joint, Marginal & Conditional Probability
Example
Moreover:
p(eyeColor = brown|hairColor = brunette) 6=
p(hairColor = brunette|eyeColor = brown)
hairColor
eyeColor black brunette blond red
blue .03 .14 .16 .03 .36
brown .12 .20 .01 .04 .37
hazel/green .03 .14 .04 .05 .27
.18 .48 .21 .12
Joint, Marginal & Conditional Probability
Example
To obtain p(hairColor|eyeColor = brown), we reduce,
hairColor
eyeColor black brunette blond red
blue
brown .12 .20 .01 .04 .37
hazel/green
and we normalize.
hairColor
eyeColor black brunette blond red
blue
brown .12/.37 .20/.37 .01/.37 .04/.37 .37
hazel/green
Joint, Marginal & Conditional Probability
Example
So p(hairColor = brunette|eyeColor = brown) = .20/.37,
hairColor
eyeColor black brunette blond red
blue
brown .12/.37 .20/.37 .01/.37 .04/.37 .37
hazel/green
Example
Consider the memory example again. What is the probability
that an item that is correctly recalled (A) is a picture (B3 )?
By Bayes theorem:
p(B )p(A|B3 )
p(B3 |A) = Pk 3
i=1 p(Bi )p(A|Bi )
0.10.1
= 0.29 = 0.0345
Example
A fair coin is flipped three times. There are 8 possible
outcomes, and each of them is equally likely.
For each outcome, we can count the number of heads and the
number of switches (i.e., HT or TH subsequences):
Example
The joint probability p(#heads, #switches) is therefore:
#heads
#switches 0 1 2 3
0 1/8 0 0 1/8 2/8
1 0 2/8 2/8 0 4/8
2 0 1/8 1/8 0 2/8
1/8 3/8 3/8 1/8
Example
#heads
#switches 0 1 2 3
0 1/8 0 0 1/8 2/8
1 0 2/8 2/8 0 4/8
2 0 1/8 1/8 0 2/8
1/8 3/8 3/8 1/8
Note that:
p(#switches = 1|#heads = 1) = 2/3
p(#heads = 1|#switches = 1) = 1/2
Bayes Theorem
Example
#heads
#switches 0 1 2 3
0 1/8 0 0 1/8 2/8
1 0 2/8 2/8 0 4/8
2 0 1/8 1/8 0 2/8
1/8 3/8 3/8 1/8
2
The joint probability p(#switches = 1, #heads = 1) = 8 can
be expressed in two ways:
2 3 2
p(#switches = 1|#heads = 1) p(#heads = 1) = 3 8 = 8
Bayes Theorem
Example
#heads
#switches 0 1 2 3
0 1/8 0 0 1/8 2/8
1 0 2/8 2/8 0 4/8
2 0 1/8 1/8 0 2/8
1/8 3/8 3/8 1/8
2
The joint probability p(#switches = 1, #heads = 1) = 8 can
be expressed in two ways:
Example
#heads
#switches 0 1 2 3
0 1/8 0 0 1/8 2/8
1 0 2/8 2/8 0 4/8
2 0 1/8 1/8 0 2/8
1/8 3/8 3/8 1/8
p(A, B) = p(A)p(B)
Example
A coin is flipped three times. Each of the eight outcomes is equally
likely. A: heads occurs on each of the first two flips, B: tails occurs on
the third flip, C: exactly two tails occur in the three flips. Show that A
and B are independent, B and C dependent.
Example
A simple example of two attributes that are independent: the
suit and value of cards in a standard deck: there are 4 suits
{, , , } and 13 values of each suit {2, , 10, J, Q, K, A},
for a total of 52 cards.
Example
We can verify independence by cross-multiplying marginal
probabilities too. For every suit s {, , , } and value
v {2, , 10, J, Q, K, A}:
1
p(suit = s, value = v) = 52 (in a well-shuffled deck)
13
p(suit = s) = 52 = 41
4
p(value = v ) = 52 1
= 13
p(suit = s) p(value = v ) = 41 13
1 1
= 52
Example
In a noisy room, I whisper the same number n {1, . . . , 10} to
two people A and B on two separate occasions. A and B
imperfectly (and independently) draw a conclusion about what
number I whispered. Let the numbers A and B think they heard
be na and nb , respectively.
Example
Given an experiment in which we roll a pair of 4-sided dice, let
the random variable X be the total number of points rolled with
the two dice.
E.g. X = 5 picks out the set {h1, 4i, h2, 3i, h3, 2i, h4, 1i}.
Specify the full function denoted by X and determine the probabilities
associated with each value of X .
Random Variables
Example
Assume a balanced coin is flipped three times. Let X be the
random variable denoting the total number of heads obtained.
Example
For the probability function defined in the previous example:
x f (x)
1
0 8
3
1 8
3
2 8
1
3 8
Probability Distributions
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3
x
Probability Distributions
Example
In a raffle, there are 10,000 tickets. The probability of winning is
1
therefore 10,000 for each ticket. The prize is worth $4,800.
$4,800
Hence the expectation per ticket is 10,000 = $0.48.
Example
A balanced coin is flipped three times. Let X be the number of
heads. Then the probability distribution of X is:
1
for x = 0
83
f (x) = 8 for x = 1
3
for x = 2
81
8 for x = 3
Example
Let X be the number of points rolled with a balanced (6-sided)
die. Find the expected value of X and of g(X ) = 2X 2 + 1.
The probability distribution for X is f (x) = 16 . Therefore:
6
X X 1 21
E(X ) = x f (x) = x =
x
6 6
x=1
6
X X 1 94
E[g(X )] = g(x)f (x) = (2x 2 + 1) =
x
6 6
x=1
Summary
Sample space S contains all possible outcomes of an
experiment; events A and B are subsets of S.
rules of probability: p(A) = 1 p(A).
if A B, then p(A) p(B).
0 p(B) 1.
addition rule: p(A B) = p(A) + p(B) p(A, B).
conditional probability: p(B|A) = p(A,B)
p(A) .
independence: p(A, B) = p(A)p(B).
P
marginalization: p(A) = Bi p(Bi )p(A|Bi ).
Bayes theorem: p(B|A) = p(B)p(A|B)
p(A) .
any value of an r.v. picks out a subset of the sample
space.
for any value of an r.v., a distribution returns a probability.
the expectation of an r.v. is its average value over a
distribution.
References