0% found this document useful (0 votes)
53 views

8 Probability As A Variable

This document discusses representing probabilities as distributions rather than single values to capture uncertainty. It introduces the beta distribution as a way to represent the probability of outcomes like coin flips. The beta distribution takes on different shapes depending on its parameters to reflect the likelihood of different probabilities. It is commonly used as a prior in Bayesian analysis and updates in a conjugate way when observing new data. An example shows calculating the probability of a student's grade being below the average from their class's grade distribution, which is modeled as a beta distribution.

Uploaded by

hamkarim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

8 Probability As A Variable

This document discusses representing probabilities as distributions rather than single values to capture uncertainty. It introduces the beta distribution as a way to represent the probability of outcomes like coin flips. The beta distribution takes on different shapes depending on its parameters to reflect the likelihood of different probabilities. It is commonly used as a prior in Bayesian analysis and updates in a conjugate way when observing new data. An example shows calculating the probability of a student's grade being below the average from their class's grade distribution, which is modeled as a beta distribution.

Uploaded by

hamkarim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

8.

Distributions of Probabilities
Chris Piech and Mehran Sahami

May 2017

In this chapter we are going to have a very meta discussion about how we represent probabilities. Until
now probabilities have just been numbers in the range 0 to 1. However, if we have uncertainty about our
probability, it would make sense to represent our probabilities as random variables (and thus articulate the
relative likelihood of our belief).

1 Estimating Probabilities
Imagine we have a coin and we would like to know its probability of coming up heads (p). We flip the
coin (n + m) times and it comes up head n times. One way to calculate the probability is to assume that it
n
is exactly p = n+m . That number, however, is a coarse estimate, especially if n + m is small. Intuitively it
doesn’t capture our uncertainty about the value of p. Just like with other random variables, it often makes
sense to hold a distributed belief about the value of p.
To formalize the idea that we want a distribution for p we are going to use a random variable X to represent
the probability of the coin coming up heads. Before flipping the coin, we could say that our belief about the
coin’s success probability is uniform: X ∼ Uni(0, 1).
If we let N be the number of heads that came up, given that the coin flips are independent, (N|X) ∼
Bin(n + m, x). We want to calculate the probability density function for X|N. We can start by applying
Bayes Theorem:
P(N = n|X = x) fX (x)
fX|N (x|n) = Bayes Theorem
P(N = n)
n+m n
x (1 − x)m
= n Binomial PMF, Uniform PDF
P(N = n)
n+m
= n
xn (1 − x)m Moving terms around
P(N = n)
Z 1
1
= · xn (1 − x)m where c = xn (1 − x)m dx
c 0

2 Beta Distribution
The equation that we arrived at when using a Bayesian approach to estimating our probability defines a
probability density function and thus a random variable. The random variable is called a Beta distribution,
and it is defined as follows:
The Probability Density Function (PDF) for a Beta X ∼ Beta(a, b) is:
(
1
xa−1 (1 − x)b−1 if 0 < x < 1
Z 1
f (x) = B(a,b) where B(a, b) = xa−1 (1 − x)b−1 dx
0 otherwise 0

1
a
A Beta distribution has E[X] = a+b and Var(X) = (a+b)2ab
(a+b+1)
. All modern programming languages have a
package for calculating Beta CDFs. You will not be expected to compute the CDF by hand in CS109.
To model our estimate of the probability of a coin coming up heads as a beta set a = n + 1 and b = m + 1. Beta
is used as a random variable to represent a belief distribution of probabilities in contexts beyond estimating
coin flips. It has many desirable properties: it has a support range that is exactly (0, 1), matching the values
that probabilities can take on and it has the expressive capacity to capture many different forms of belief
distributions.
Let’s imagine that we had observed n = 4 heads and m = 2 tails. The probability density function for X ∼
Beta(5, 3) is:

Notice how the most likely belief for the probability of our coin is when the random variable, which represents
the probability of getting a heads, is 4/6, the fraction of heads observed. This distribution shows that we hold
a non-zero belief that the probability could be something other than 4/6. It is unlikely that the probability is
0.01 or 0.09, but reasonably likely that it could be 0.5.
It works out that Beta(1, 1) = Uni(0, 1). As a result the distribution of our belief about p before (“prior”)
and after (“posterior”) can both be represented using a Beta distribution. When that happens we call Beta a
“conjugate” distribution. Practically conjugate means easy update.

Beta as a Prior
You can set X ∼ Beta(a, b) as a prior to reflect how biased you think the coin is apriori to flipping it. This is
a subjective judgment that represent a + b − 2 “imaginary” trials with a − 1 heads and b − 1 tails. If you then
observe n+m real trials with n heads you can update your belief. Your new belief would be, X|(n heads in n+
m trials) ∼ Beta(a + n, b + m). Using the prior Beta(1, 1) = Uni(0, 1) is the same as saying we haven’t seen
any “imaginary” trials, so apriori we know nothing about the coin. This form of thinking about probabilities is
representative of the “Bayesian” field of thought where computer scientists explicitly represent probabilities
as distributions (with prior beliefs). That school of thought is separate from the “Frequentest” school which
tries to calculate probabilities as single numbers evaluated by the ratio of successes to experiments.

Assignment Example
In class we talked about reasons why grade distributions might be well suited to be described as a Beta
distribution. Let’s say that we are given a set of student grades for a single exam and we find that it is best
fit by a Beta distribution: X ∼ Beta(a = 8.28, b = 3.16). What is the probability that a student is below the
mean (i.e. expectation)?
The answer to this question requires two steps. First calculate the mean of the distribution, then calculate the
probability that the random variable takes on a value less than the expectation.
a 8.28
E[X] = =
a + b 8.28 + 3.16
≈ 0.7238

2
Now we need to calculate P(X < E[X]). That is exactly the CDF of X evaluated at E[X]. We don’t have
a formula for the CDF of a Beta distribution but all modern programming languages will have a Beta CDF
function. In JavaScript we can execute: jStat.beta.cdf which takes the x parameter first followed by the alpha
and beta parameters of your Beta distribution.

P(X < E[X]) = FX (0.7238) = jStat.beta.cdf(0.7238, 8.28, 3.16)


≈ 0.46

You might also like