Lecture 2
Lecture 2
In some sense, the entire subject of probability and statistics is about distributions of random variables.
Random variables, as the very name suggests, are quantities that vary, over time, or from individual to
individual, and the reason for the variability is some underlying random process. Depending on exactly
how an underlying experiment ends, the random variable takes different values. In other words, the
value of the random variable is determined by the sample point that prevails, when the underlying
experiment is actually conducted. We cannot know a priori the value of the random variable, because
we do not know a priori which sample point will prevail when the experiment is conducted. We try to
understand the behaviour of a random variable by analysing the probability structure of that underlying
random experiment. Random variables, like probabilities, originated in gambling. Therefore, the random
variables that come to us more naturally, are integer-valued random variables; for examples, the sum of
the two rolls when a die is rolled twice. Integer-valued random variables are special cases of what are
known as discrete random variables. Discrete or not, a common mathematical definition of all random
variables is the following.
Definition: Let be a sample space corresponding to some experiment and let X: ℝ be a function
from the sample space to the real line. Then X is called a random variable.
Discrete random variables are those that take a finite or a countably infinite number of possible values.
In particular, all integer-valued random variables are discrete. From the point of view of understanding
the behaviour of a random variable, the important thing is to know the probabilities with which X takes
its different possible values.
Definition: Let X: ℝ be a discrete random variable taking a finite or countably infinite number of
values x1, x2, x3, … The probability distribution or the probability mass function (PMF) of X is the function
p(x) = P(X = x), x = x1, x2, x3, … and p(x) = 0, otherwise. Note that it is common to not explicitly mention
the phrase “p(x) = 0, otherwise.”
Independence
Definition: Let X1, X2, …, Xk be k 2 discrete random variables defined on the same sample space . We
say that X1, X2, …, Xk are independent if P(X1 = x1, X2 = x2, …, Xk = xk) = P(X1 = x1) P (X2 = x2) … P(Xk = xk) for all
x1, x2, …, xk.
Example: Consider the experiment of tossing a fair coin (or any coin) four times. Suppose X1 is the
number of heads in the first two tosses, and X2 is the number of heads in the last two tosses. Then, it is
intuitively clear that X1 and X2 are independent, because the last two tosses carry no information
regarding the first two tosses. The independence can be easily mathematically verified by using the
definition of independence.
Example: Consider the experiment of drawing 13 cards at random from a deck of 52 cards. Suppose X1 is
the number of aces and X2 is the number of clubs among the 13 cards. Then, X1 and X2 are not
independent. For example, P(X1 = 4, X2 = 0) = 0, but P(X1 = 4) and P(X2 = 0) are both > 0, and so
P(X1 = 4) P(X2 = 0) > 0. So, X1, X2 cannot be independent.
Expected Value
By definition, a random variable takes different values on different occasions. It is natural to want to
know what value it takes on average. Averaging is a very primitive concept. A simple average of just the
possible values of the random variable will be misleading because some values may have so little
probability that they are relatively inconsequential. The average or the mean value, also called the
expected value of a random variable is a weighted average of the different values of X, weighted
according to how important the value is. Here is the definition.
Definition: Let X be a discrete random variable. We say that the expected value of X exists if
μ=E ( X )=∑ x i P (x i)
i
Example: Let X be the number of heads obtained in two tosses of a fair coin.
Because the coin is fair, we expect it to show heads 50% of the number of times it is tossed, which is 50%
of 2, that is, 1.
Example: Let X be the sum of the two rolls when a fair die is rolled twice.
Definition: Let a random variable X have a finite mean . The variance of X is defined as
σ 2=E [ ( X−μ )2 ]and the standard deviation of X is defined as = √ σ 2 .
Example: Consider the experiment of two tosses of a fair coin and let X be the number of heads
obtained.
Bayesian Networks
A Bayesian network is a data structure that represents the dependencies among random variables.
Bayesian networks have the following properties:
Example: Consider an example of a Bayesian network that involves variables that affect whether we get
to our appointment on time.
Let’s describe this Bayesian network from the top down:
Rain (R) is the root node in this network. This means that its probability distribution is not reliant on any
prior event. In our example, Rain is a random variable that can take the values {none, light, heavy} with
the following probability distribution:
Maintenance (M), in our example, encodes whether there is train track maintenance, taking the values
{yes, no}. Rain is a parent node of Maintenance, which means that the probability distribution of
Maintenance is affected by Rain.
R yes no
none 0.4 0.6
light 0.2 0.8
heavy 0.1 0.9
Train (T) is the variable that encodes whether the train is on time or delayed, taking the values {on time,
delayed}. Note that Train has arrows pointing to it from both Maintenance and Rain. This means that
both are parents of Train, and their values affect the probability distribution of Train.
R M on time delayed
none yes 0.8 0.2
none no 0.9 0.1
light yes 0.6 0.4
light no 0.7 0.3
no 0.5 0.5
heavy
Appointment (A) is a random variable that represents whether we attend our appointment, taking the
values {attend, miss}. Note that its only parent is Train. This point about Bayesian network is
noteworthy: parents include only direct relations. It is true that maintenance affects whether the train is
on time, and whether the train is on time affects whether we attend the appointment. However, in the
end, what directly affects our chances of attending the appointment is whether the train came on time,
and this is what is represented in the Bayesian network. For example, if the train came on time, it could
be heavy rain and track maintenance, but that has no effect over whether we made it to our
appointment.
T attend miss
on
0.9 0.1
time
T attend miss
delayed 0.6 0.4
Example: If, while working for the company, you are expected to attend 100 meetings, calculate the
expected number of meetings that will be missed.
Bayesian Inference
Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the
probability for a hypothesis as more evidence or information becomes available.
Example: We want to compute the probability distribution of the Appointment variable given the
evidence that there is light rain and no track maintenance. That is, we know that there is light rain and
no track maintenance, and we want to figure out what are the probabilities that we attend the
appointment and that we miss the appointment.
Note: In this area of inference, there has been a debate for centuries as to whether the Bayesian
approach or the frequentist approach is better. The difference between the two approaches will be
outlined in the next two examples.
Example: We have a population of M&M’s, and in this population the percentage of yellow M&M’s is
either 10% or 20%. You have been hired as a statistical consultant to decide whether the true percentage
of yellow M&M’s is 10% or 20%. If you make the correct decision, your boss gives you a bonus. On the
other hand, if you make the wrong decision, you lose your job.
Let’s start with the frequentist inference. Using standard statistical notation, we can write
Note: The p-value is the probability of observed or more extreme outcome given that the null
hypothesis is true.
Therefore, we fail to reject H0 and conclude that the data do not provide convincing evidence that the
proportion of yellow M&M’s is greater than 10%. This means that if we had to pick between 10% and
20% for the proportion of M&M’s, even though this hypothesis testing procedure does not actually
confirm the null hypothesis, we would likely stick with 10% since we couldn’t find evidence that the
proportion of yellow M&M’s is greater than 10%.
¿ 0.5 × 0.33
Posterior: P ( H 1|k=1 ¿=P ( H 1) P ( k =1| H 1 ¿ = ≈ 0.45
P (k =1) ( 0.5 ×0.33 )+(0.5 ×0.41)
P ( H 2|k=1 ¿=1−0.45=0.55
The posterior probabilities of whether H1 or H2 is correct are close to each other. As a result, with equal
priors and a low sample size, it is difficult to make a decision with a strong confidence, given the
observed data. However, H2 has a higher posterior probability than H1, so if we had to make a decision at
this point, we should pick H2, i.e., the proportion of yellow M&Ms is 20%, which is different to the
decision based on the frequentist approach.
The table below summarizes what the results would look like if we had chosen larger sample sizes.
Under each of these scenarios, the frequentist method yields a higher p-value than our significance
level, so we would fail to reject the null hypothesis with any of these samples. On the other hand, the
Bayesian method always yields a higher posterior for the second model where p is equal to 0.20. So, the
decisions that we would make are contradictory to each other.
Observed Data P(k or more | 10% yellow) P(10% yellow | n, k) P(20% yellow | n, k)
However, if we had set up our framework differently in the frequentist method and set our null
hypothesis to be p = 0.20 and our alternative to be p < 0.20, we would obtain different results. This
shows that the frequentist method is highly sensitive to the null hypothesis, while in the Bayesian
method, our results would be the same regardless of which order we evaluate our models.
Example: Suppose that in all the ICC tournament cricket matches between India and Pakistan, India has
won 15 times while Pakistan won 5 times.
So, if you were to bet on the winner of the next match, who would it be? You would likely say India, as
they have won 75% of the matches.
However, what if you are told that in rain affected matches, India won 5 times and Pakistan won 5 times,
and that it will definitely rain during the next match? Intuitively, it is easy to see that the chances of
winning for Pakistan have increased. But by how much?
Suppose B is the event of Pakistan winning and A is the event of rain. Therefore,
P(B) is 1/4, since Pakistan has won five matches out of 20.