2 Probability
2 Probability
Probability
Oct 2017
1 Introduction
It is that time in the quarter (it is still week one) when we get to talk about probability. Again we are going to
build up from first principles. We will heavily use the counting that we learned earlier this week.
Event Space, E, is some subset of S that we ascribe meaning to. In set notation (E ⊆ S).For example:
3 Probability
In the 20th century humans figured out a way to precisely define what a probability is:
n(E)
P(E) = lim
n→∞ n
In English this reads: lets say you perform n trials of an experiment. The probability of a desired event E
is the ratio of trials that result in E to the number of trials performed (in the limit as your number of trials
approaches infinity).
That is mathematically rigorous. You can also apply other semantics to the concept of a probability. One
common meaning ascribed is that P(E) is a measure of the chance of E occurring.
1
I often think of a probability in another way: I don’t know everything about the world. So it goes. As a
result I have to come up with a way of expressing my belief that E will happen given my limited knowledge.
This interpretation acknowledges that there are two sources of probabilities: natural randomness and our own
uncertainty.
The different interpretations of probability are reflected in the many origins of probabilities that you will en-
counter in the wild (and not so wild) world. Some probabilities are calculated analytically using mathematical
proofs. Some probabilities are calculated from data, experiments or simulations. Some probabilities are just
made up to represent a belief.
Most probabilities are generated from a combination of the above: someone will make up a prior belief, that
belief will be mathematically updated using data and evidence.
4 Axioms of Probability
Here are some basic truths about probabilities that we accept as axioms:
You can convince yourself of the first basic truth by thinking about the math definition of probability. As you
perform trials of an experiment it is not possible to get more events then trials (thus probabilities are less than
1) and its not possible to get less than 0 occurrences of the event.
The second basic truth makes sense too. If your event space is the sample space, then each trial must produce
the event. This is sort of like saying; the probability of you eating cake (event space) if you eat cake (sample
space) is 1.
The third basic truth of probability is possibly the most useful. For any event, every element is either in
the event or not in the event. A wonderful implication of this is that everything in the world (including you
friends, your family, your dinner) is either a potato or not a potato. That holds for all objects.
The rest of this chapter is going to give you tools for answering questions of the form: “What is the probability
of X”, where X is some event. You should be able to answer such questions when the sample space has equally
likely events, using data or simulation, or analytically from other known probabilities.
Because every outcome is equally likely, and the probability of the sample space must be 1, we can prove
that each outcome must have probability:
1
P(an outcome) =
|S|
2
If an event is a subset of a sample space with equally likely outcomes.
Probability of Equally Likely Outcomes If S is a sample space with equally likely outcomes, for an
event E that is a subset of the outcomes in S:
number of outcomes in E |E|
P(E) = =
number of outcomes in S |S|
There is some art form to setting up a problem to calculate a probability based on the equally likely outcome
rule. (1) The first step is to explicitly define your sample space and to argue that all outcomes in your sample
space are equally likely. (2) Next, you need to count the number of elements in the sample space and (3)
finally you need to count the size of the event space. The event space must be all elements of the sample
space that you defined in part (1). The first step leaves you with a lot of choice! For example you can decide
to make indistinguishable objects distinct, as long as your calculation of the size of the event space makes the
exact same assumptions.
Example 1
What is the probability that the sum of two die is equal to 7?
Mistake: You could define your sample space to be all the possible sum values of two die (2 through 12).
However this sample space fails the “equally likely” test. You are not equally likely to have a sum of 2 as you
are to have a sum of 7.
Solution: Consider the sample space from the previous chapter where we thought of the die as distinct and
enumerated all of the outcomes in the sample space. The first number is the roll on die 1 and the second
number is the roll on die 2. Note that (1,2) is distinct from (2, 1). We can calculate the probability of each
1
outcome using the product rule of counting to be 36 . All outcomes in the sample space are equally likely:
The event space (the subset of the sample space where the sum of the two dice is 7) is highlighted in red.
P(Sum of two dice is 7) = |E| 6 1
|S| = 36 = 6 .
Interestingly, this idea also applies to continuous sample spaces. Consider the sample space of all the out-
comes of the computer function “random” which produces a real valued number between 0 and 1, where all
real valued numbers are equally likely. Now consider the event E that the number generated is in the range
[0.3 to 0.7]. Since the sample space is equally likely, P(E) is the ratio of the size of E to the size of S. In this
case P(E) = 0.4.
3
Probabilities from Data
Machine Learning (sometimes called Data Science) is the love child of: probability, data and computers.
Sometimes machine learning involves complex algorithms. But often it’s just the core ideas of probability
applied to large datasets.
As an example let us consider Netflix, a company that has thrived because of well thought out machine
learning. One of the primary probabilities that they calculate is the probability that a user has watched a
given movie.
Let E be the event that a random user has watched a given movie, given no other information about the user.
Since the number of users in the Netflix database is large we can approximate P(E) using the definition of
probability, which comes out to some simple counting:
Simulations
Another way to compute probabilities is via simulation. For some complex problems where the probabili-
ties are too hard to compute analytically (for example, because there are complex constraints) you can run
simulations using your computers random number generator.
If your simulations generate random instances from the sample space, then the probability of an event E is
approximately equal to the fraction of simulations that produced an outcome from E. Again, by the definition
of probability, as your number of simulations approaches infinity, the estimate becomes more accurate.
While this is a powerful tool, sometimes a simulation would take too long to generate enough samples that
satisfy the event. As a result many probabilistic programs first start with a computer scientist calculating
probabilities using probability theory.
How you calculate the probability of either event A or event B happening, written P(A ∪ B), is truly analogous
to counting the size of outcome spaces. The equation that you can use to calculate the probability of the “or”
of two events depends on whether or not they are “mutually exclusive”.
4
Figure 1: Mutually Exclusive Events E and F
Both events E and F have event spaces that are subsets of the same sample space. Visually, we can note that
the two sets do not overlap. They are mutually exclusive: there is no outcome that is in both sets.
This property applies regardless of whether or not E and F are from an equally likely sample space.
Moreover, the idea extends to more than two events. Lets say you have events X1 , X2 . . . Xn where each
event is mutually exclusive of one another. Then:
n
P(X1 ∪ X2 ∪ · · · ∪ Xn ) = ∑ P(Xi )
i=1
In other words, if you are told that two events are mutually exclusive (or you can argue that they are),
calculating the probability of the intersection (“or”) of those events is truly straightforward. In the example in
figure 1, if we consider the outcomes in the sample space to be equally likely, then P(E) = 7/50 since event
E has 7 outcomes and P(F) = 4/50 since event F has 4 outcomes. Because they are mutually exclusive, we
can calculate that:
7 4 11
P(E ∪ F) = P(E) + P(F) = + = = 0.22
50 50 50
Inclusion Exclusion
Unfortunately, not all events are mutually exclusive. If you want to calculate P(E ∪ F) where the events E
and F are not mutually exclusive you need to account for your double counting using the inclusion exclusion
principle:
This property applies regardless of whether or not E and F are from an equally likely sample space and
whether or not the events are mutually exclusive. The Inclusion-Exclusion principle extends to more
than two events, but becomes considerably more complex. For events E1 , E2 . . . En :
n
P(E1 ∪ E2 ∪ · · · ∪ En ) = ∑ (−1)r+1Yr
r=1
5
Where Yr is the sum of the probability of the union, for all subsets of events where the size of the subset
is r. Expanded out this becomes: add all the probabilities of each of the events on their own. Then
subtract the probability of the union (“and”) of all combinations of two events. Then add the probability
of the union of all combinations of three events and so on for all powerset sizes until n, changing from
add to subtracting each time.
For three events, E, F, and G that formula mathematically expands to:
Phew! Probabilities of the “or” of events gets a ton harder when those events are not mutually exclusive!
How you calculate the probability of event E and event F happening, written P(E ∩ F), or sometimes P(EF)
depends on whether or not they are “independent”.
Independence
Two events E and F are called independent if: P(E ∩ F) = P(E)P(F). In English: two events are independent
is the probability of both events happening is equal to the probability of each of the events happening on their
own, multiplied together. Otherwise, they are called dependent events.
Independence is your friend, and is going to be particularly important as you get further into machine learning.
Knowing the “joint” probability of many events (the probability of the “and” of the events) can get hard for
complex systems. By making independence claims, computers can essentially decompose how to calculate
the joint: they can calculate the joint simply by calculating the probability of each underlying event.
Independence
If two events, E and F, are independent:
P(E ∩ F) = P(E)P(F)
This property applies regardless of whether or not E and F are from an equally likely sample space and
whether or not the events are mutually exclusive. The independence principle extends to more than two
events. For events E1 , E2 . . . En where all events are independent of one another:
n
P(E1 ∩ E2 ∩ · · · ∩ En ) = ∏ P(Ei )
i=1
In the same way that the mutual exclusion property makes it easier to calculate the probability of the OR of
two events, independence makes it easier to calculate the AND of two events. In the case where two events
are not independent, there are still ways to directly calculate their probability. However, to do so we are going
to learn it after we learn a concept in the next chapter: Conditional Probability.
Independence is a mathematical definition, and if someone tells you two events are independent, you should
be ready to pull out the implied formula. We also want you to develop an intuitive sense of what it means.
6
Independence occurs when the realization of one event does not effect the probability of the other happening.
For example the event that you roll a 4 on a fair sided dice (F) is independent of the event that you get eaten
by a heard of angry goats (G)1 . The probability of one happening does not affect the event of the other, and as
such if you want to know the probability of both happening P(FG), you can calculate the probability of each
event without considering the outcome of the other2 .
Often independence assumptions are made not because they are perfectly correct, but instead because they
are terribly useful for solving probability problems. When working with probabilities that come from data,
very few things will match the mathematical definition of independence. That can happen for two reasons:
first, events that are calculated from data or simulation are not perfectly precise. Second, in our complex
world many things actually influence each other, even if just a tiny amount. Despite that we often make the
wrong, but useful, independence assumption.
7
This often is used in conjunction with the rule that the sum of an event and its complement is 1:
Finally, there are often many ways to solve a problem. Try and come up with more than one. In all cases
sanity check your answers. All probabilities should be positive and less than or equal to 1.
10 Case Studies
Serendipity
I am absolutely going to write this up. But in the mean time check out the demo on the website :-).