0% found this document useful (0 votes)
36 views

M3 - FDS

This document defines key probability concepts such as procedures, events, sample spaces, and probability calculations. It discusses the frequentist and Bayesian approaches to probability. The frequentist approach uses past experimentation to predict future probabilities, while the Bayesian approach uses theoretical reasoning. Compound events and conditional probability are also covered. Examples are provided to illustrate confusion matrices, conditional probability calculations, and an application of Bayes' theorem.

Uploaded by

Raghu C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

M3 - FDS

This document defines key probability concepts such as procedures, events, sample spaces, and probability calculations. It discusses the frequentist and Bayesian approaches to probability. The frequentist approach uses past experimentation to predict future probabilities, while the Bayesian approach uses theoretical reasoning. Compound events and conditional probability are also covered. Examples are provided to illustrate confusion matrices, conditional probability calculations, and an application of Bayes' theorem.

Uploaded by

Raghu C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

FDS - Module 3

Advanced Probability
Basic definitions
• Procedure: A procedure is an act that leads to a result. For example, throwing a
dice or visiting a website.
• Event: An event is a collection of the outcomes of a procedure, such as getting a
head on a coin flip or leaving a website after only 4 seconds.
• Simple Event: A simple event is an outcome/ event of a procedure that cannot be
broken down further. For example, rolling two dice can be broken down into two
simple events: rolling die 1 and rolling die 2.
• Sample Space: The sample space of a procedure is the set of all possible simple
events. For example, an experiment is performed, in which a coin is flipped three
times in succession. The sample space would be:
{HHH, HHT, HTT, HTH, TTT, TTH, THH, THT}.

Probability
• The probability of an event is the frequency, or chance, that the event will happen.
• If 𝐴 is an event, 𝑃(𝐴) is the probability of the occurrence of the event.
• We can define the actual probability of an event, A, as follows:
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑎𝑦𝑠 𝐴 𝑜𝑐𝑐𝑢𝑟
𝑃(𝐴) =
𝑠𝑖𝑧𝑒 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑝𝑎𝑐𝑒
• Here, A is the event in question. Think of an entire universe of
events where anything is possible, and let's represent it as a circle.
• We can think of a single event, A, as being a smaller circle within that larger
universe, as shown in the diagram.

Bayes vs Frequentist Approach


• In a Frequentist approach, the probability of an event is calculated through
experimentation. It uses the past in order to predict the future chance of an event.
• We observe several instances of the event and count the number of times A was
satisfied. The division of these numbers is an approximation of the probability.
• The Bayesian approach differs by dictating that probabilities must be discerned
using theoretical means.
• Using the Bayes approach, we would have to think a bit more critically about
events and why they occur.

Frequentist Approach

• In a Frequentist approach, the probability of an event is calculated through


experimentation. It uses the past in order to predict the future chance of an event.
• The basic formula is as follows:
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑎𝑦𝑠 𝐴 𝑜𝑐𝑐𝑢𝑟
𝑃(𝐴) =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡ℎ𝑒 𝑝𝑟𝑜𝑐𝑒𝑑𝑢𝑟𝑒 𝑤𝑎𝑠 𝑟𝑒𝑝𝑒𝑎𝑡𝑒𝑑
• We observe several instances of the event and count the number of times A was
satisfied. The division of these numbers is an approximation of the probability.
• The Frequentist approach is based on relative frequency.
• The relative frequency of an event is how often an event occurs divided by the
total number of observations.

Compound Event
• A compound event is any event that combines two or more simple events.
• Given events A and B:
o The probability that A and B occur is 𝑃(𝑨 ⋂ 𝑩) = 𝑃(𝑨 𝑎𝑛𝑑 𝑩)
o The probability that either A or B occurs is 𝑃(𝑨 ⋃ 𝑩) = 𝑃(𝑨 𝑜𝑟 𝑩)
Conditional Probability
• Conditional Probability 𝑃(𝐴|𝐵) is the probability of an event 𝐴 given that another
event 𝐵 has already happened.
Rules/Axioms of Probability
• Addition rule: 𝑃 (𝐴 ⋃ 𝐵 ) = 𝑃 (𝐴) + 𝑃(𝐵 ) − 𝑃(𝐴 ⋂ 𝐵)
• Mutual Exclusivity: 𝑃(𝐴 ⋃ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵 ) − 𝑃(𝐴 ⋂ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵)
where, A and B are mutually exclusive events and cannot occur at the same time,
hence 𝑃(𝐴 ⋂ 𝐵) = 0
• Multiplication rule: 𝑃(𝐴 ⋂ 𝐵) = 𝑃(𝐴) ∗ 𝑃(𝐵|𝐴)
• Independence: 𝑃(𝐴 ⋂ 𝐵) = 𝑃(𝐴) ⋅ 𝑃(𝐵|𝐴) = 𝑃(𝐴) ∗ 𝑃(𝐵), where A and B are
independent events, i.e., one event does not affect the outcome of the other.
Hence, 𝑃 (𝐵 |𝐴) = 𝑃(𝐵)
• Complementary Events: 𝑃 (𝐴) = 1 − 𝑃(𝐴′ )
Confusion Matrix
• When a Binary Classifier (one that predicts
between 2 choices) is used, we can draw
what are called confusion matrices, which
are 2 x 2 matrices that house all the four
possible outcomes of our experiment.
• False positives are sometimes called a Type 1 error whereas false negatives are
called a Type 2 error.
• Performance Measures:
𝑇𝑃 + 𝑇𝑁
1. 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁

𝑇𝑃
2. 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 (𝑃) =
𝑇𝑃 + 𝐹𝑃
𝑇𝑃
3. 𝑅𝑒𝑐𝑎𝑙𝑙 (𝑅) =
𝑇𝑃 + 𝐹𝑁

2
4. 𝐹1 𝑆𝑐𝑜𝑟𝑒 =
1 1
𝑅+𝑃
𝑇𝑁
5. 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁 + 𝐹𝑃

𝑇𝑃
6. 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = (𝑠𝑎𝑚𝑒 𝑎𝑠 𝑅𝑒𝑐𝑎𝑙𝑙)
𝑇𝑃 + 𝐹𝑁
Example of Confusion Matrix
Consider the following matrix formed from a multi-class classifier. Calculate Precision
and Recall per class, weighted average Precision and Recall.
Predicted
Apple Banana Cherry Total
Apple 15 2 3 20
Actual Banana 7 15 8 30
Cherry 2 3 45 50
Total 24 20 56 100
Ans: Individual Class Accuracy and Recall
𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑
Precision = Recall =
𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑡𝑜𝑡𝑎𝑙 𝑎𝑐𝑡𝑢𝑎𝑙 𝑡𝑜𝑡𝑎𝑙
15 15
Class 𝐴𝑝𝑝𝑙𝑒 precision = = 0.625 Class 𝐴𝑝𝑝𝑙𝑒 recall = = 0.75
24 20
15 15
Class 𝐵𝑎𝑛𝑎𝑛𝑎 precision = = 0.75 Class 𝐵𝑎𝑛𝑎𝑛𝑎 recall = = 0.5
20 30
45 45
Class 𝐶ℎ𝑒𝑟𝑟𝑦 precision = = 0.80 Class 𝐶ℎ𝑒𝑟𝑟𝑦 recall = = 0.90
56 50
Overall Classifier Accuracy:
𝑡𝑜𝑡𝑎𝑙 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 15 + 15 + 45
Overall Precision = = = 0.75
𝑡𝑜𝑡𝑎𝑙 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 100

(𝑎𝑐𝑡𝑢𝑎𝑙 𝑐𝑙𝑎𝑠𝑠 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠)


Weighted Average Precision = ∑ ∗ (𝑐𝑙𝑎𝑠𝑠 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛)
𝑡𝑜𝑡𝑎𝑙 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
20 30 50
= ∗ 0.625 + ∗ 0.75 + ∗ 0.80 = 0.75
100 100 100

(𝑎𝑐𝑡𝑢𝑎𝑙 𝑐𝑙𝑎𝑠𝑠 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠)


Weighted Average Recall = ∑ ∗ (𝑐𝑙𝑎𝑠𝑠 𝑟𝑒𝑐𝑎𝑙𝑙 )
𝑡𝑜𝑡𝑎𝑙 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
20 30 50
= ∗ 0.75 + ∗ 0.5 + ∗ 0.90 = 0.75
100 100 100
Example of Conditional Probability:
In a population of 64, 17 like KitKat, 16 like Bounty, 15 like both and 16 don’t like
either of them.
Step 1: Draw a Contingency Table

Likes Bounty Dislikes Bounty Total

Likes KitKat 15 17 32

Dislikes KitKat 16 16 32

Total 31 33

Step 2: Draw a Venn Diagram

Q1) What is the probability that someone likes both given they like KitKat?
15
Ans: 𝑃(𝑙𝑖𝑘𝑒𝑠 𝑏𝑜𝑡ℎ | 𝑙𝑖𝑘𝑒𝑠 𝐾𝑖𝑡𝐾𝑎𝑡) = = 0.46
32

Q2) What is the probability that a person dislikes Bounty given that they like KitKat?
17
Ans: 𝑃(𝑑𝑖𝑠𝑙𝑖𝑘𝑒𝑠 𝐵𝑜𝑢𝑛𝑡𝑦 | 𝑙𝑖𝑘𝑒𝑠 𝐾𝑖𝑡𝐾𝑎𝑡) = = 0.53
32

Bayesian Approach

Collectively exhaustive events


• When given a set of two or more events, if at least one of the events must occur,
then such a set of events is said to be collectively exhaustive.
• Example 1: Given a set of events {temperature < 60, temperature > 90}, these
events are not collectively exhaustive because there is a third option that is not
given in this set of events: The temperature could be between 60 and 90. However,
they are mutually exhaustive because both cannot happen at the same time
• Example 2: In a dice roll, the set of events of rolling a {1, 2, 3, 4, 5, or 6} are
collectively exhaustive because these are the only possible events, and at least
one of them must happen.

Bayes Theorem
• Recalling previously defined formulas:
o 𝑃(𝐴) = The probability that event A occurs
o 𝑃(𝐴|𝐵) = The probability that A occurs, given that B occurred
o 𝑃(𝐴 and 𝐵 ) = 𝑃(𝐴, 𝐵 ) = 𝑃(𝐴 ∪ 𝐵 ) = The probability that A and B occurs
o 𝑃(𝐴 and 𝐵 ) = 𝑃(𝐴, 𝐵 ) = 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) ∗ 𝑃(𝐵|𝐴)
• We also know that,
𝑃(𝐴, 𝐵) = 𝑃(𝐴) ∗ 𝑃(𝐵|𝐴)
𝑃(𝐵, 𝐴) = 𝑃(𝐵) ∗ 𝑃(𝐴|𝐵)
𝑃(𝐴, 𝐵) = 𝑃(𝐵, 𝐴)
Hence, 𝑃(𝐵) ∗ 𝑃(𝐴|𝐵) = 𝑃(𝐴) ∗ 𝑃(𝐵|𝐴)
Dividing both sides by 𝑃(𝐵), we get the Bayes Theorem:
𝑃(𝐴) ∗ 𝑃 (𝐵 |𝐴)
𝑃 (𝐴|𝐵) =
𝑃(𝐵 )
Now, rewriting it as
𝑃(𝐻) ∗ 𝑃(𝐷 |𝐻 )
( |
𝑃 𝐻𝐷 = )
𝑃 (𝐷)
Where, H = your hypothesis about a given data and D = data that is given to you
Now,
• P(H) is the probability of the hypothesis before we observe the data, called the
prior probability or just prior
• P(H|D) is what we want to compute, the probability of the hypothesis after we
observe the data, called the posterior
• P(D|H) is the probability of the data under the given hypothesis, called the
likelihood
• P(D) is the probability of the data under any hypothesis, called the normalizing
constant.
Example 1: Consider that you have two people in charge of writing blog posts for your
company—Lucy and Avinash. From past performances, you have liked 80% of Lucy's
work and only 50% of Avinash's work. A new blog post comes to your desk in the
morning, but the author isn't mentioned. You love the article. A+. What is the
probability that it came from Avinash? Each blogger blogs at a very similar rate.
Ans: Let
𝐻 = hypothesis = the blog came from Avinash
𝐷 = data = you loved the blog post. Therefore,
𝑃(𝐻|𝐷) = the chance that it came from Avinash, given that you loved it
𝑃(𝐷|𝐻) = the chance that you loved it, given that it came from Avinash
𝑃(𝐻) = the chance that an article came from Avinash
𝑃(𝐷) = the chance that you love an article
So, we want to know P(H|D). Using Bayes theorem, as shown, here:
𝑃(𝐻) ∗ 𝑃 (𝐷 |𝐻 )
𝑃(𝐻 |𝐷 ) =
𝑃(𝐷)
Now,
𝑃 (𝐻) = 0.5 as both Avinash and Lucy write at similar rates A/Q.
𝑃 (𝐷 |𝐻) = 0.5 as given in the question
𝑃 (𝐷) = It means that we must take into account the scenario if the post came from
Lucy or Avinash. Now, if the hypothesis forms a suite, then we can use our laws of
probability.
• A suite is formed when a set of hypotheses is both collectively exhaustive and
mutually exclusive. In laymen's terms, in a suite of events, exactly one and only
one hypothesis can occur.
• In our case, the two hypotheses are that the article came from Lucy, or that the
article came from Avinash. This is a suite because of the following reasons:
o At least one of them wrote it
o At most one of them wrote it
o Therefore, only one of them wrote it
When we have a suite, we can use our multiplication and addition rules, as follows:
𝑃 (𝐷) = 𝑃(𝑓𝑟𝑜𝑚 𝐴𝑣𝑖𝑛𝑎𝑠ℎ 𝐀𝐍𝐃 𝐿𝑜𝑣𝑒𝑑 𝑖𝑡 ) 𝐎𝐑 𝑃(𝑓𝑟𝑜𝑚 𝐿𝑢𝑐𝑦 𝐀𝐍𝐃 𝐿𝑜𝑣𝑒𝑑 𝑖𝑡 )
Using the multiplication rules, we get:
𝑃(𝑓𝑟𝑜𝑚 𝐴𝑣𝑖𝑛𝑎𝑠ℎ 𝐀𝐍𝐃 𝐿𝑜𝑣𝑒𝑑 𝑖𝑡) = 𝑃(𝑓𝑟𝑜𝑚 𝐴𝑣𝑖𝑛𝑎𝑠ℎ) ∗ 𝑃(𝐿𝑜𝑣𝑒𝑑 𝑖𝑡 |𝑓𝑟𝑜𝑚 𝐴𝑣𝑖𝑛𝑎𝑠ℎ)
∴ 𝑃(𝑓𝑟𝑜𝑚 𝐴𝑣𝑖𝑛𝑎𝑠ℎ 𝐀𝐍𝐃 𝐿𝑜𝑣𝑒𝑑 𝑖𝑡) = 0.5 ∗ 0.5 = 0.25
Similarly,
𝑃(𝑓𝑟𝑜𝑚 𝐿𝑢𝑐𝑦 𝐀𝐍𝐃 𝐿𝑜𝑣𝑒𝑑 𝑖𝑡) = 𝑃(𝑓𝑟𝑜𝑚 𝐿𝑢𝑐𝑦) ∗ 𝑃(𝐿𝑜𝑣𝑒𝑑 𝑖𝑡 | 𝑓𝑟𝑜𝑚 𝐿𝑢𝑐𝑦)
∴ 𝑃(𝑓𝑟𝑜𝑚 𝐿𝑢𝑐𝑦 𝐀𝐍𝐃 𝐿𝑜𝑣𝑒𝑑 𝑖𝑡) = 0.5 ∗ 0.8 = 0.4
Hence, 𝑃(𝐷) = 0.25 + 0.4 = 0.65
𝑃(𝐻) ∗ 𝑃(𝐷|𝐻) 0.5 ∗ 0.5
∴ 𝑃(𝐻|𝐷) = = = 0.38
𝑃(𝐷) 0.65

Example 2: Titanic Problem - We will use an application of


probability in order to figure out if there were any demographic
features that showed a relationship to passenger survival.
First, let's read in the data, as shown here:
titanic = pd.read_csv(data/titanic.csv')
titanic = titanic[['Sex', 'Survived']]
titanic.head()
Let's start by calculating the probability that any given person on the ship survived,
regardless of their gender.
num_rows = float(titanic.shape[0]) # == 891 rows
p_survived = (titanic.Survived=="yes").sum() / num_rows # == 0.38
p_notsurvived = 1 - p_survived # == 0.61
Now, let's calculate the probability that any single passenger is male or female:
p_male = (titanic.Sex=="male").sum() / num_rows # == 0.65
p_female = 1 - p_male # == 0.35
Did having a certain gender affect the survival rate? For this, we can estimate
𝑃 (𝑆𝑢𝑟𝑣𝑖𝑣𝑒𝑑|𝐹𝑒𝑚𝑎𝑙𝑒) by tweaking the multiplication rule to get:
𝑃(𝐹𝑒𝑚𝑎𝑙𝑒 𝐀𝐍𝐃 𝑆𝑢𝑟𝑣𝑖𝑣𝑒𝑑 )
𝑃 (𝑆𝑢𝑟𝑣𝑖𝑣𝑒𝑑|𝐹𝑒𝑚𝑎𝑙𝑒) =
𝑃(𝐹𝑒𝑚𝑎𝑙𝑒 )
number_of_women = titanic[titanic.Sex=='female'].shape[0] # == 314
women_who_lived = titanic[(titanic.Sex=='female') & (titanic.
Survived=='yes')].shape[0] # == 233
p_survived_given_woman = women_who_lived / float(number_of_women)
p_survived_given_woman # == 0.74
Hence, gender plays a big part in this dataset.

Example 3: A classic use of Bayes theorem is the interpretation of medical trials.


Routine testing for illegal drug use is increasingly common in workplaces and schools.
• The companies that perform these tests maintain that the tests have a high:
o Sensitivity: which means that they are likely to produce a positive result if
there are drugs in their system.
o Specific: which means that they are likely to yield a negative result if there
are no drugs.
• On average, let's assume that the sensitivity of common drug tests is about 60%
and the specificity is about 99%.
• It means that if an employee is using drugs, the test has a 60% chance of being
positive, while if an employee is not on drugs, the test has a 99% chance of being
negative.
• Now, suppose these tests are applied to a workforce where the actual rate of drug
use is 5%.
Q) Of the people who test positive, how many actually use drugs?
Ans: Let D = the event that drugs are in use
Let N = the event that drugs are NOT in use
Let E = the event that the test is positive
Given, 𝑃(𝐷) = 0.05, 𝑃(𝐸 |𝐷 ) = 0.60, 𝑃(𝐸 ) = ? We need to find 𝑃(𝐷 |𝐸 )

• To find 𝑃(𝐸) we have to consider two things: 𝑃(𝐸 𝑎𝑛𝑑 𝐷) as well as 𝑃(𝐸 𝑎𝑛𝑑 𝑁).
𝑃 (𝐸 ) = 𝑃(𝐸 𝑎𝑛𝑑 𝐷) 𝑜𝑟 𝑃(𝐸 𝑎𝑛𝑑 𝑁)
𝑃 (𝐸 ) = 𝑃(𝐷) ∗ 𝑃 (𝐸 |𝐷 ) + 𝑃(𝑁 ) ∗ 𝑃 (𝐸 |𝑁 )
𝑃 (𝐸 ) = 0.05 ∗ 0.60 + 0.95 ∗ 0.01 = 0.0395
Hence,
0.6∗0.05
𝑃 (𝐷 |𝐸 ) = = 0.76 ⇒ 76%
0.0395
Example 4: We are given a sentence “A very
close game”, a training set of five sentences
(as shown), and their corresponding
category (Sports or Not Sports).
Predict which category the given sentence
would fall under: “A very close game”
• Calculate the probability of both “A very
close game” is Sports, as well as “A very
close game” is Not Sports.
o P(Sports | A very close game)
o P(Not Sports | A very close game)
• The one with the higher probability will be the result.
Feature Engineering:
P(a very close game | Sports) = P(a | Sports) * P(very | Sports) * P(close | Sports) *
P(game | Sports) [1]

P(a very close game | Not Sports) = P(a | Not Sports) * P(very | Not Sports) *
P(close | Not Sports) * P(game | Not Sports) [2]
• Calculating the probabilities: ------------->
Note: The word “close” does not exist in
the category Sports,
∴ P(close | Sports) = 0, leading to
P(a very close game | Sports) = 0 (Eqn. [1]).
• It is problematic when a frequency-
based probability is zero, because it will
wipe out the information in the other probabilities.
• Hence, we add 1 to the numerator and 14 to the denominator.
Finally,
𝑃(𝑎 𝑣𝑒𝑟𝑦 𝑐𝑙𝑜𝑠𝑒 𝑔𝑎𝑚𝑒 | 𝑆𝑝𝑜𝑟𝑡𝑠) = 3/25 ∗ 2/25 ∗ 1/25 ∗ 3/25 = 4.6 × 10−5
𝑃 (𝑎 𝑣𝑒𝑟𝑦 𝑐𝑙𝑜𝑠𝑒 𝑔𝑎𝑚𝑒 | 𝑁𝑜𝑡 𝑆𝑝𝑜𝑟𝑡𝑠) = 2/23 ∗ 1/23 ∗ 2/23 ∗ 1/23 = 1.4 × 10−5
∴ as 4.6 × 10−5 > 1.4 × 10−5 ⇒ the given sentence can be categorized as ‘sports’
Random variables
• A random variable uses real numerical values to describe a probabilistic event.
• A random variable is a function that maps values from the sample space of an
event (the set of all possible outcomes) to a probability value (between 0 and 1).
• For example, we might have:
o X = the outcome of a dice roll
o Y = the revenue earned by a company this year
o Z = the score of an applicant on an interview coding quiz (0-100%)
• There are two main types of random variables: discrete and continuous.

Discrete random variables


• A discrete random variable only takes on a countable/finite number of possible
values. For example, the outcome of a dice roll, as shown here:

• Examples: the number of children in a family, the Friday night attendance at a


cinema, the number of patients in a doctor's surgery, the number of defective light
bulbs in a box of ten.
• Random variables have many properties, two of which are their expected value
and the variance.
• For a discrete random variable, we can also use a simple formula, shown as
follows, to calculate the expected value:
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 = 𝐸 [𝑋 ] = 𝝁𝒙 = ∑ 𝑥𝑖 ∗ 𝑝(𝑥𝑖 )
• So, for our dice roll, we can find the exact expected value as being as follows:

• The variance of a random variable represents the spread of the variable. It


quantifies the variability of the expected value.
• The formula for the variance of a discrete random variable is expressed as follows:
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑉 [𝑋] = 𝝈𝟐𝒙 = ∑(𝑥𝑖 − 𝜇𝑥 )2 ∗ 𝑝(𝑥𝑖 ) = ∑ 𝑥𝑖2 ∗ 𝑝(𝑥𝑖 ) − 𝜇𝑥2
• Here, 𝜎𝑥 is the Standard Deviation and 𝜎𝑥2 is the Variance.
Is
Example 1: Find the Expected Value, Variance and Standard Deviation of the
following discrete probability distribution.
𝑥𝑖 0 1 2 3 4
𝑝(𝑥𝑖 ) 1/5 1/5 1/5 1/5 1/5
Expected value: (0 ∗ 0.2) + (1 ∗ 0.2) + (2 ∗ 0.2) + (3 ∗ 0.2) + (4 ∗ 0.2) = 𝜇𝑥 = 2
Variance: ∑ 𝑥𝑖2 ∗ 𝑝(𝑥𝑖 ) − 𝜇𝑥2 = [(02 ∗ 0.2) + (12 ∗ 0.2) + (22 ∗ 0.2) + (32 ∗ 0.2) +
(42 ∗ 0.2)] − 22 = 6 − 4 = 𝝈𝟐𝒙 = 2

Standard deviation: √𝝈𝟐𝒙 = 𝜎𝑥 = √2 = 1.414

Example 2: Let the X random variable represent the success of a product. X is indeed
a discrete random variable because the X variable can only take on one of five
options: 0, 1, 2, 3, or 4 (a value of 0 represents a failure and 4 represents a success)
The following is the probability distribution of our random variable, X:

𝐸[𝑋] = 0(0.02) + 1(0.07) + 2(0.25) + 3(0.4) + 4(0.26) = 2.81


𝑉 [𝑋 ] = [02 ∗ 0.02 + 12 ∗ 0.07 + 22 ∗ 0.25 + 32 ∗ 0.4 + 42 ∗ 0.26] − 2.812 = 0.93
We could say that our project will have an expected score of 2.81 ± 0.93 meaning
that can expect something between 1.88 and 3.74

Types of discrete random variables


n
1. Binomial Random Variables: 𝑃 (𝑋 = 𝑘) = Ck 𝑝𝑘 (1 − 𝑝)𝑛−𝑘

2. Geometric Random Variables: 𝑃 (𝑋 = 𝑘) = (1 − 𝑝)𝑘−1 ∗ 𝑝


𝑒 −𝜇 𝜇𝑥
3. Poisson Random Variables: 𝑃 (𝑋 = 𝑥 ) = 𝑥!

1. Binomial random variables


• A binomial random variable is a discrete random variable, X, that counts the
number of successes in a binomial setting.
• The parameters are
o n = the number of trials
o p = the chance of success of each trial.
• A binomial setting has the following four conditions:
o The possible outcomes are either success or failure
o The outcomes of trials cannot affect the outcome of another trial
o The number of trials was set (a fixed sample size)
o The chance of success of each trial must always be p (i.e., fixed)

• The probability mass function (PMF) for a binomial random variable is as follows:

𝑃(𝑋 = 𝑘) = nCk 𝑝𝑘 (1 − 𝑝)𝑛−𝑘


n 𝑛!
where Ck = 𝑘! ∗ (𝑛−𝑘)!
Shortcuts for finding 𝑬[𝑿] and 𝑽[𝑿] where 𝑿 is a Binomial RV:
• 𝐸 [𝑋 ] = 𝑛𝑝 ⇒ 5 ∗ 0.25 = 1.25
• 𝑉 [𝑋 ] = 𝑛𝑝 ∗ (1 − 𝑝) ⇒ 1.25 ∗ (0.75) = 0.9375

Example 1 of Binomial RV: A new restaurant in a town has a 20% chance of surviving
its first year. If 14 restaurants open this year, find the probability that exactly 4
restaurants survive their first year of being open to the public.
Ans: First, we should prove that this is a binomial setting:
• The possible outcomes are either success or failure (the restaurants either survive
or does not survive)
• The outcomes of trials cannot affect the outcome of another trial (assume that the
opening of a restaurant doesn't affect another restaurant's opening and survival)
• The number of trials was set (14 restaurants opened)
• The chance of success of each trial must always be p (we assume that it is always
20%)
Here, we have our two parameters of n = 14 and p = 0.2.
n
𝑃 (𝑋 = 𝑘) = Ck 𝑝𝑘 (1 − 𝑝)𝑛−𝑘
14
𝑃 (𝑋 = 4) = C4 ∗ 0.24 ∗ (1 − 0.2)14−4 = 0.17
Example 2 of Binomial RV: A couple has a 25% chance of a having a child with type O
blood. What is the chance that 3 of their 5 kids have type O blood?
Ans: Let X = the number of children with type O blood with n = 5 and p = 0.25, as
shown here:
5
𝑃 (𝑋 = 3) = C3 ∗ 0.253 ∗ (1 − 0.25)5−3 = 0.087

We can calculate this probability for the values of 0, 1, 2, 3, 4, and 5 to get a sense of
the probability distribution:

From here, we can calculate an expected value and the variance of this variable:
𝐸 [𝑋 ] = 0(0.23) + 1(0.4) + 2(0.26) + 3(0.09) + 4(0.01) + 5(0.0009) = 1.25
𝑉 [𝑋 ] = [02 ∗ 0.23 + 12 ∗ 0.4 + 22 ∗ 0.26 + 32 ∗ 0.09 + 42 ∗ 0.01 + 52 ∗ 0.0009]
− 1.252 = 0.93
Hence, the number of expected children with O type blood is 1.25 ± 0.93! Which
mean, 1 or 2 kids! What if we want to know the probability that at least 3 of their kids
have type O blood?

= 0.0098 + 0.01465 + 0.08789 = 0.103


• So, there is about a 10% chance that three of their kids have type O blood.

2. Geometric random variables


• A geometric random variable is a discrete random variable, X, that counts the
number of trials needed to obtain one success.
• Specifically, a geometric setting has the following four conditions:
o The possible outcomes are either success or failure
o The outcomes of trials cannot affect the outcome of another trial
o The number of trials was not set
o The chance of success of each trial must always be p
• Note: these are the same conditions as a Binomial RV, except the 3rd condition.
• The parameters are
o p = the chance of success of each trial
o (1 − p) = the chance of failure of each trial
• The formula for the PMF is as follows:
𝑃(𝑋 = 𝑘) = (1 − 𝑝)𝑘−1 ∗ 𝑝
Shortcuts for finding 𝑬[𝑿] and 𝑽[𝑿] where 𝑿 is a Geometric RV:
1 1−𝑝
• 𝐸 [𝑋 ] = ; 𝑉 [𝑋 ] =
𝑝 𝑝2

Example 1 of Geometric RV: There is a 34% chance that it will rain on any day in April.
Find the probability that the first day of rain in April will occur on April 4th. Also find
the probability that the first rain of the month will happen within the first four days.
Ans: 𝑃(𝑋 = 4) = (1 − 0.34)4−1 ∗ (0.34) = 0.0977 ≈ 0.1 ⇒ 10% chance it will rain
on April 4th.
The probability that it will rain by the fourth of April is as follows:
𝑃(𝑋 ≤ 4) = 𝑃 (1) + 𝑃(2) + 𝑃(3) + 𝑃(4) = 0.34 + 0.22 + 0.14 + 0.1 = 0.8
So, there is an 80% chance that the first rain of the month will happen within the first
four days.

Example 2 of Geometric RV: Suppose Max owns a lightbulb manufacturing company


and determines that 3 out of every 75 bulbs are defective. What is the probability
that Max will find the first faulty lightbulb on the 6th one that he tested?
3
Ans: 𝑝 = 75 = 0.04, 𝑘 = 6

𝑃 (𝑋 = 6) = (1 − 0.04)6−1 ∗ (0.04) = 0.0326 ≈ 0.03 ⇒ 3% chance that Max


would find the first faulty lightbulb on the 6th one that he tested.
What if Max wants to know the likelihood that it takes at least six trials until he finds
the first defective lightbulb?
Ans: 𝑃(𝑋 ≤ 6) = 𝑃(1) + 𝑃 (2) + 𝑃 (3) + 𝑃(4) + 𝑃(5) + 𝑃(6) = 0.815 ⇒ 80%
3. Poisson Random Variable
• The Poisson distribution is a discrete probability distribution that counts the
number of events that occur in a given interval of time. Formula:
𝑒 −𝜇 𝜇 𝑥
𝑃 (𝑋 = 𝑥 ) =
𝑥!
where 𝜇 is the average number of events per interval
• Consider the following examples of Poisson random variables:
o Finding the probability of having a certain number of visitors on your site
within an hour, knowing the past performance of the site
o Estimating the number of car crashes at an intersection based on past police
reports
Shortcut for finding 𝑬[𝑿] and 𝑽[𝑿] where 𝑿 is a Poisson RV:
• 𝐸 [𝑋] = 𝜇; 𝑉 [𝑋] = 𝜇

Example 1 of Poisson RV: The number of calls arriving at your call center follows a
Poisson distribution at the rate of 5 calls/hour. What is the probability that exactly 6
calls will come in between 10 and 11 PM?
Ans: Let X be the number of calls that arrive between 10 and 11 PM. Given 𝜇 = 5.
𝑒 −5 56
∴ 𝑃(𝑋 = 6) = = 0.146 ⇒ 14.6% chance that exactly 6 calls will come
6!
between 10 and 11 PM.

Example 2 of Poisson RV: The average number of homes sold by the Acme Realty
company is 2 homes per day. What is the probability that exactly 3 homes will be sold
tomorrow?
Ans: Let X be the number of homes sold in a day.
Given 𝜇 = 2.
𝑒 −2 23
∴ 𝑃 (𝑋 = 3) = = 0.180
3!
⇒ 18% chance that exactly 3 homes will be sold tomorrow

Note: If the problem has the word ‘average’ in it, it is mostly a Poisson distribution.
On the other hand, if an exact probability is given, it is mostly a Binomial distribution.
Example 3 of Poisson RV: A life insurance salesman sells on the average 3 life
insurance policies per week. Use Poisson's law to calculate the probability that in a
given week he will sell
a) Some policies
b) 2 or more policies but less than 5 policies.
c) Assuming that there are 5 working days per week, what is the probability that in a
given day he will sell one policy?
Ans: Given, 𝜇 = 3
a) "Some policies" means "1 or more policies". We can work this out by finding 1
minus the "zero policies" probability:
𝑒 −3 30
𝑃 (𝑋 > 0) = 1 − 𝑃(𝑋 = 0) = 1 − = 0.95 ⇒ 95%
0!
b) The probability of selling 2 or more but less than 5 policies:
𝑃(2 ≤ 𝑋 < 5) = 𝑃(2) + 𝑃 (3) + 𝑃 (4) = 0.616 ⇒ 61.6%
d) Average number of policies sold per day: 3/5 = 0.6
𝑒 −0.6 0.61
So on a given day, 𝑃 (𝑋 = 1) = = 0.329 ⇒ 32.9%
1!

Example 4 of Poisson RV: 20 sheets of aluminium alloy were examined for surface
flaws. The frequency of the number of sheets with a given number of flaws per sheet
was as follows:
Number of flaws 0 1 2 3 4 5 6
Frequency 4 3 5 2 4 1 1
What is the probability of finding a sheet chosen at random which contains 3 or more
surface flaws?
Ans: Total nos. of flaws: (0*4)+(1*3)+(2*5)+(3*2)+(4*4)+(5*1)+(6*1) = 46
Average number of flaws per sheet: 46/20 = 𝜇 = 2.3
Required probability 𝑃 (𝑋 ≥ 3) = 1 − (𝑃 (0) + 𝑃(1) + 𝑃(2))
Example 5 of Poisson RV: If electricity power failures occur according to a Poisson
distribution with an average of 3 failures every twenty weeks, calculate the
probability that there will not be more than one failure during a particular week. Ans:

Example 6 of Poisson RV: Vehicles pass through a junction on a busy road at an


average rate of 300 per hour.
a. Find the probability that none passes in a given minute.
b. What is the expected number passing in two minutes?
c. Find the probability that this expected number actually pass through in a given
two-minute period. Ans:

Example 7 of Poisson RV: A company makes electric motors. The probability an


electric motor is defective is 0.01. What is the probability that a sample of 300 electric
motors will contain exactly 5 defective motors? Ans:

NOTE: This problem looks similar to a binomial distribution problem.


We see that the result is very similar. We can use binomial distribution to
approximate Poisson distribution (and vice-versa) under certain circumstances.

Continuous Random Variables


• Switching gears entirely, unlike a discrete random variable, a continuous random
variable can take on an infinite number of possible values, not just a few
countable ones.
• Consider the following examples of continuous variables:
o The length of a sales representative's phone call (not the number of calls)
o The actual amount of oil in a drum marked 20 gallons (not the number of oil
drums)
• If X is a continuous random variable, then there is a function, f(x), such that for any
constants a and b:

• The preceding 𝑓(𝑥) function is known as the probability density function (PDF).
The PDF is the continuous random variable version of the PMF for discrete random
variables.
• In general, quantities such as pressure, height, mass, weight, density, volume,
temperature, and distance are examples of continuous random variables
• The most important continuous distribution is the standard normal distribution.
The PDF of this distribution is as follows:
where,
o 𝜇 is the mean of the variable
o 𝜎 is the standard deviation
• When plotted on a graph, the standard normal distribution gives a bell-shaped
curve.
Properties of a Normal Distribution
o The normal curve is symmetrical about the mean μ;
o The mean is at the middle and divides the area into halves;
o The total area under the curve is equal to 1;
o It is completely determined by its mean μ and standard deviation σ (or
variance σ2) – only 2 parameters
Correlation and Covariance
• Both the terms measure the relationship and the dependency between two
variables
Covariance:
• Covariance shows how the two variables differ, whereas correlation shows how
the two variables are related. In other words, it quantifies the dependence
between two random variables X and Y.
• While variance measures how a single variable deviates from its mean, covariance
measures how two variables vary in tandem from their means.

Example 1: Finding Covariance and Correlation for a Joint Probability Distribution


Q) Suppose that 𝑋 and 𝑌 have the following joint probability mass function:

What is the covariance of X and Y? Given: 𝜇𝑥 = 1.5, 𝜇𝑦 = 2, 𝜎𝑥 = 0.5, 𝜎𝑦 = √0.5


Ans: 𝑐𝑜𝑣(𝑥, 𝑦) = ∑(𝑥 − 𝜇𝑥 ) ∗ (𝑦 − 𝜇𝑦 ) ∗ 𝑓(𝑥, 𝑦)
= (1 − 1.5)(1 − 2)(0.25) + (1 − 1.5)(2 − 2)(0.25) + (1 − 1.5)(3 − 2)(0) +
(2 − 1.5)(1 − 2)(0) + (2 − 1.5)(2 − 2)(0.25) + (2 − 1.5)(3 − 2)(0.25) = 0.25
As 𝑐𝑜𝑣(𝑥, 𝑦) > 0 ⇒ 𝑥 𝑎𝑛𝑑 𝑦 𝑎𝑟𝑒 positively 𝐜𝐨𝐫𝐫𝐞𝐚𝐥𝐭𝐞𝐝.
The correlation coefficient of X and Y, denoted 𝑐𝑜𝑟𝑟(𝑋, 𝑌) or 𝜌𝑥𝑦 is defined as:
𝑐𝑜𝑣(𝑥, 𝑦)
𝑐𝑜𝑟𝑟(𝑥, 𝑦) =
𝜎𝑥 𝜎𝑦
where 𝜎𝑥 , 𝜎𝑦 are the SDs of 𝑥 and 𝑦
0.25
𝑐𝑜𝑟𝑟(𝑥, 𝑦) = = 0.71
0.5 ∗ √0.5
Note: above problem for finding covariance of X and Y, can also be found using:

[∑ 𝑥 ∗ 𝑦 ∗ 𝑓(𝑥, 𝑦)] − 𝜇𝑥 𝜇𝑦 = [∑ 𝑥𝑦 𝑓(𝑥, 𝑦)] − 𝜇𝑥 𝜇𝑦


Interpretation of Correlation
• If 𝜌𝑥𝑦 = 1, then X and Y are perfectly, positively, linearly correlated.
• If 𝜌𝑥𝑦 = −1, then X and Y are perfectly, negatively, linearly correlated.
• If 𝜌𝑥𝑦 = 0, then X and Y are completely, un-linearly correlated.
• If 𝜌𝑥𝑦 > 0, then X and Y are positively, linearly correlated, but not perfectly so.
• If 𝜌𝑥𝑦 < 0, then X and Y are negatively, linearly correlated, but not perfectly so.

Example 2: Finding Covariance and Correlation for a Dataset


Find the covariance and correlation for this dataset (only X and Y will be given)

𝒙 𝒚 ̅
𝒙−𝒙 ̅
𝒚−𝒚 (𝒙 − 𝒙
̅) ∗ (𝒚 − 𝒚
̅)
1 3 -2 -6.2 12.4
2 5 -1 -4.2 4.2
3 11 0 1.8 0
4 11 1 1.8 1.8
5 16 2 6.8 13.6
̅) ∗ (𝒚 − 𝒚
a∑(𝒙 − 𝒙 ̅) 32

̅ = 𝟑 𝑎𝑛𝑑 𝒚
From dataset, we get 𝒙 ̅ = 𝟗. 𝟐 Confirm the cov value
1. If sample, do
Use the calculator’s Statistics mode to get the following values:
𝑐𝑜𝑣 = 𝑠𝑥 ∗ 𝑠𝑦 ∗ 𝑟
𝝈𝒙 = 𝟏. 𝟒𝟏𝟒 𝝈𝒚 = 𝟒. 𝟔𝟔𝟒 (𝝈 is used for population) 2. If population, do

𝒔𝒙 = 𝟏. 𝟓𝟖𝟏 𝒔𝒚 = 𝟓. 𝟐𝟏𝟓 (𝒔 is used for sample) 𝑐𝑜𝑣 = 𝜎𝑥 ∗ 𝜎𝑦 ∗ 𝑟

∑(𝒙 − 𝒙 ̅) ∗ (𝒚 − 𝒚
̅) 32 32
𝑖𝑓 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 𝑖𝑠 𝑎 𝑠𝑎𝑚𝑝𝑙𝑒: 𝑐𝑜𝑣(𝑥, 𝑦)𝑠𝑎𝑚𝑝𝑙𝑒 = = = =8
𝑛−1 5−1 4
∑(𝒙 − 𝒙 ̅) ∗ (𝒚 − 𝒚 ̅) 32
𝑖𝑓 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 𝑖𝑠 𝑎 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛: 𝑐𝑜𝑣(𝑥, 𝑦)𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 = = = 6.4
𝑛 5
Hence, Confirm the corr coeff value
𝒄𝒐𝒗(𝒙, 𝒚) 8 1. Enter Statistic mode (6)
𝑐𝑜𝑟𝑟(𝑥, 𝑦)𝑠𝑎𝑚𝑝𝑙𝑒 = = = 0.97 2. Select 𝑦 = 𝑎 + 𝑏𝑥 (2)
𝒔𝒙 𝒔𝒚 1.581 ∗ 5.215 3. Enter data
4. Click ‘OPTN’ and select
𝒄𝒐𝒗(𝒙, 𝒚) 6.4 'Regression Calculator'
𝑐𝑜𝑟𝑟(𝑥, 𝑦)𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 = = = 0.97
𝝈𝒙 𝝈𝒚 1.414 ∗ 4.664 5. The '𝑟' value is the corr coef
a
Python Implementation of Covariance and Correlation
import numpy as np #np.cov and np.corrcoef
x = [1, 2, 3, 4, 5] #returns a 2D array
y = [3, 5, 11, 11, 16]
cov_s = np.cov(x, y)[0][1] #cov of sample
cov_p = np.cov(x, y, bias=True)[0][1] #cov of population
corr = np.corrcoef(x, y)[0][1]

Population and Sample


• A population is the entire group that you want to draw conclusions about.
• Usually, populations are so large that one cannot examine the entire group.
• Therefore, a sample is selected to represent the population in a research study.
• The goal is to use the results obtained from the sample to help answer questions
about the population.

Point estimates
• A point estimate is an estimate of a population parameter based on sample data.
• We use point estimates to estimate population means, variances etc.
• To obtain these estimates, we simply apply the function that we wish to measure
for our population to a sample of the data.
• For example, suppose there is a company of 9,000 employees and we are
interested in ascertaining the average length of breaks taken by employees in a
single day.
• As we probably cannot ask every single person, we will take a sample of the 9,000
people and take a mean of the sample.
• This sample mean will be our point estimate.

Sampling distributions
• Many statistical tests rely on data that follows a normal pattern, and for the most
part, a lot of real-world data is not normal.
• Since most real-world data is not normal, many of the most popular statistics tests
may not apply.
• However, if we follow the given procedure, we can create normal data!
• We utilize what is known as a sampling distribution, which is a distribution of
point estimates of several samples of the same size.
• An example of the procedure for creating a sampling distribution:
o Take 500 samples from the population, each sample having 100 points each
o Find the mean of each sample and add it to the list of point estimates.
o Plot the histogram of the point estimates (would be a normal distribution)
• The data converges to a normal distribution because of something called the
central limit theorem (CLT).
• CLT states that the sampling distribution (the distribution of point estimates) will
approach a normal distribution as we increase the number of samples taken.
• As we take more and more samples, the mean of the sampling distribution will
approach the true population mean.
𝑥̅ − 𝜇
𝑍𝑠𝑐𝑜𝑟𝑒 =
𝜎/√𝑛
̅ → 𝑚𝑒𝑎𝑛 𝝁 → 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛 𝝈 → 𝑝𝑜𝑝. 𝑆𝐷 𝒏 → 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒
𝑤ℎ𝑒𝑟𝑒 𝒙

Numerical 1 on Central Limit Theorem


Q) A certain group of welfare recipients receives SNAP benefits of $110 per week with
a standard deviation of $20. If a random sample of 25 people is taken, what is the
probability their mean benefit will be greater than $120 per week?
̅ → 120
Ans: Given, 𝒙 𝝁 → 110 𝝈 → 20 𝒏 → 25
120 − 110
𝑍𝑠𝑐𝑜𝑟𝑒 = = 2.5
20/√25
Note: The area under a normal distribution is always 1. By looking up the z-score in a
z-score table, we get the amount of area to the left of the Z-score. The Z-score 2.5
corresponds to 0.99379 → 99.38 %. Hence, 100-99.38 = 0.62%
Thus, if a random sample of 25
people are taken, the probability
that their mean benefit will be
greater than $120 per week is
0.62%
Numerical 2 on Central Limit Theorem
Q) Let the average tennis serve be 110 mph with a standard deviation of 5. If a
random sample of 40 serves are selected, what is the probability that the mean of
this sample is between 109.5 mph and 112 mph.
̅ → 109.5, 112
Ans: Given, 𝒙 𝝁 → 110 𝝈→5 𝒏 → 40
109.5 − 110
𝑍𝑠𝑐𝑜𝑟𝑒 𝑓𝑜𝑟 109.5 𝑚𝑝ℎ = = −0.6325
5/√40
112 − 110
𝑍𝑠𝑐𝑜𝑟𝑒 𝑓𝑜𝑟 112 𝑚𝑝ℎ = = 2.53
5/√40
The Z-score (-0.6325) corresponds to 0.2643 and for 2.53 is 0.9943.
As these represent the area to the left of the Z-scores, to get the required area as
shown below, we do: 0.9943 - 0.2643 = 0.73 → 73%

Descriptive Statistics
• Descriptive statistics are methods for organizing and summarizing data.
• For example, tables or graphs are used to organize data, and descriptive values
such as the mean, median and mode are used to summarize data.
• A descriptive value for a population is called a parameter
• A descriptive value for a sample is called a statistic.

Identify the population, sample, population parameters, and sample statistics:


a) In a USA Today Internet poll, readers responded voluntarily to the question “Do
you consume at least one caffeinated beverage every day?”
Ans: Population: all readers of USA Today
Sample: readers that responded to the question
Population parameter: percentage of all readers of USAT that have at least one drink
Sample parameter: percentage of readers who responded to the survey that " " " "
b) Astronomers typically determine the distance to a galaxy (a galaxy is a huge
collection of billions of stars) by measuring the distances to just a few stars within it
and taking the mean (average) of these distance measurements.
Ans: Population: all stars in that galaxy
Sample: the few stars selected for measurements
Population parameter: the average of distances between all stars and the Earth
Sample parameter: average of distances between the stars in the sample and Earth

Inferential Statistics
• Inferential statistics are methods for using sample data to make general
conclusions (inferences) about populations.

Identify whether the statement describes inferential statistics or descriptive statistics:


a) The average age of the students in a statistics class is 21 years.
Ans: Descriptive - since it’s the actual mean of the entire population and not a sample
b) The chances of winning the California Lottery are one chance in twenty-two
million.
Ans: Inferential - since it’s a probability calculated from a sample from the entire
population (22 million) to make an inference about the population
c) There is a relationship between smoking cigarettes and getting emphysema.
Ans: Inferential - we’re inferring emphysema arises due to smoking cigarettes
d) From past figures, it is predicted that 39% of the registered voters in California
will vote in the June primary.
Ans: Inferential - since it’s a probability based on historical data
Note: If it is a probability, it is inferential and if it is a central tendency like mean,
median and mode of the entire population, then it is descriptive. If the mean, median
or mode of a sample is used to make a conclusion/inference over the population,
then it is again inferential.

Determine whether the data are qualitative or quantitative:


a) the colours of automobiles on a used car lot: Qualitative
b) the numbers on the shirts of a girls’ soccer team: Qualitative
c) the number of seats in a movie theatre: Quantitative
d) a list of house numbers on your street: Qualitative
e) the ages of a sample of 350 employees of a large hospital: Quantitative

Sampling Error
• The discrepancy between a sample statistic and its population parameter is called
sampling error.
• Defining and measuring sampling error is a large part of inferential statistics.

Types of Sampling
1. Simple Random Sampling: Under Random sampling,
every element of the population has an equal
probability of getting selected.
2. Stratified Random Sampling:
o Proportionate stratified sample
▪ We group the entire population into
subgroups by some common property like
class labels.
▪ We then randomly sample from those
groups individually, such that the groups are
still maintained in the same ratio as they were in the entire population.
o Disproportionate stratified sample
▪ The size of the sample selected from each subgroup is disproportional to
the size of that subgroup in the population.
▪ It requires weights to match the population’s ratio of each subgroup
3. Systematic Sampling
o Systematic sampling is about sampling
items from the population at regular
predefined intervals (basically fixed
and periodic intervals).
o For example — Every 5th element,
21st element and so on.
o The figure shows a pictorial view of the same — We sample every 9th and 7th
element in order and then repeat this pattern.
4. Cluster Sampling
o In Cluster sampling, we divide the entire
population into subgroups, wherein,
each of those subgroups has similar
characteristics to that of the population
when considered in totality.
o Also, instead of sampling individuals,
we randomly select the entire subgroups.
o Example: Class of 120 students divided into groups of 12 for a common class
project. Clustering parameters like (Designation, Class, Topic) are all similar
over here as well.

Identify the sampling technique used


1. Every fifth person boarding a plane is searched thoroughly. - Systematic
2. At a local community College, five math classes are randomly selected out of 20
and all of the students from each class are interviewed - Cluster (NOT Random)
3. A researcher randomly selects and interviews fifty male and fifty female teachers -
Stratified
4. A researcher for an airline, interviews all of the passengers on five randomly
selected flights - Cluster (NOT Random)
5. Based on 12,500 responses from 42,000 surveys sent to its alumni, a major
university estimated that the annual salary of its alumni was 92,500 - Random
6. All of the teachers from 85 randomly selected nation’s middle schools were
interviewed - Cluster (NOT Random)
7. The names of 70 contestants are written on 70 cards, The cards are placed in a bag,
and three names are picked from the bag - Random

Confidence intervals
• A confidence level does not represent a "probability of being correct"; instead, it
represents the frequency that the obtained answer will be accurate.
• For example, if you want to have a 95% chance of capturing the true population
parameter using only a single point estimate, we would have to set our confidence
level to 95%.
• Calculating a confidence interval involves finding a point estimate, and then,
incorporating a margin of error to create a range.
• The margin of error is a value that represents our certainty that our point estimate
is accurate and is based on our:
o desired confidence level,
o variance of the data, and
o how big your sample is
• There are many ways to calculate confidence intervals. We will look at a single way
of taking the confidence interval of a population mean. For this confidence
interval, we need the following:
o A point estimate
o An estimate of the population standard deviation, which represents the
variance in the data. This is calculated by taking the sample standard
deviation (the standard deviation of the sample data) and dividing that
number by the square root of the population size.
o The degrees of freedom, which is usually = (sample size - 1)
sample_size = 100
# the size of the sample we wish to take

sample = np.random.choice(a = dataset, size = sample_size)


# a sample of sample_size taken from the population (dataset of heights)

sample_mean = sample.mean()
# the sample mean
sample_stdev = sample.std()
# sample standard deviation

pop_stdev = sample_stdev / math.sqrt(sample_size)


# population standard deviation estimate

stats.t.interval(alpha = 0.95, df = sample_size - 1, loc = sample_mean, scale =


pop_stdev)
# Confidence level 95%, Degrees of freedom, Sample mean, Standard deviation

• If, after executing the last line, we get: (171.36, 183.44), then it represents a
confidence interval for the average height with a 95% confidence.
• Our hypothesis will be that as we make our confidence level larger, we will likely
see larger confidence intervals to be surer that we catch the true population
parameter: Example:
o confidence 0.5 has an interval of size 2.56
o confidence 0.8 has an interval of size 4.88
o confidence 0.9 has an interval of size 6.29
o confidence 0.99 has an interval of size 9.94
• We can see that as we wish to be "more confident" in our interval, our interval
expands in order to compensate.

Hypothesis tests
• A hypothesis test is a statistical test that is used to ascertain whether we are
allowed to assume that a certain condition is true for the entire population, given a
data sample.
• Basically, a hypothesis test is a test for a certain hypothesis that we have about an
entire population. The result of the test then tells us whether we should believe
the hypothesis or reject it for an alternative one.
• A hypothesis test generally looks at two opposing hypotheses about a population.
We call them the null hypothesis and the alternative hypothesis.
• The null hypothesis is the statement being tested and is the default correct
answer; it is our starting point and our original hypothesis.
• The alternative hypothesis is the statement that opposes the null hypothesis.
• Our test will tell us which hypothesis we should trust and which we should reject.
• Based on sample data from a population, a hypothesis test determines whether or
not to reject the null hypothesis.
• We usually use a p-value (which is based on our significance level) to make this
conclusion.

Conducting a Hypothesis Test


1. Specify the hypotheses: Here, we formulate our 2 hypotheses: the null (𝐻0 ) and
the alternative (𝐻1 )
2. Determine the sample size for the test sample: This calculation depends on the
chosen test.
3. Choose a significance level (α): A significance level of 0.05 is common
4. Collect the data: Then collect a sample of data to conduct the test
5. Decide whether to reject or fail the null hypothesis:
o This step changes slightly based on the type of test being used.
o The final result will either yield in rejecting the null hypothesis in favour of
the alternative or failing to reject the null hypothesis.
We will look at the following three types of hypothesis tests:
o T-tests
o Chi-square goodness of fit
o Chi-square test for association/independence

T-tests
• The one sample t-test is a statistical test used to determine whether a quantitative
(numerical) data sample differs significantly from another dataset (the population
or another sample).
• Ex: Our objective here is to ascertain whether there is a difference between the
overall population's (company employees) break times and break times of
employees in the engineering department.
• Let us now conduct a t-test at a 95% confidence level in order to find a difference
(or not!).
• Assumptions:
o The population distribution should be normal, or the sample should be large
(n ≥ 30).
o Population size should be at least 10 times larger than the sample size
1. Specify the hypotheses:
o 𝐻0 : the engineering department takes the same breaks as the company as a
whole
o 𝐻1 : the engineering department takes different breaks from the company as
a whole
2. Determine the sample size for the test sample:
o The sample is at least 30 points (it is 400)
o The sample is less than 10% of the population (which would be 900 people)
3. Choose a significance level (α): We will choose a 95% significance level, which
means that our α would actually be 1 - 0.95 = 0.05
4. Collect the data.
5. Decide whether to reject or fail to reject the null hypothesis: For a one sample t-
test, we must calculate two numbers: the test statistic and our p value. Luckily, we
can do this in one line in Python:
t_statistic, p_value = stats.ttest_1samp (a = engineering_breaks, popmean=
allbreaks.mean())

We obtain the following numbers: t_statistic = -5.742, p_value = 0.00000018


• The test result shows that the t value is -5.742. This is a standardized metric that
reveals the deviation of the sample mean from the null hypothesis.
• The p value is what gives us our final answer. Our p-value is telling us how often
our result would appear by chance.
• So, for example, if our p-value was .06, then that would mean we should expect to
observe this data by chance about 6% of the time. This means that about 6% of
samples would yield results like this.
• We are interested in how our p-value compares to our significance level:
o If the p-value is less than the significance level, then we can reject the null
hypothesis
o If the p-value is greater than the significance level, then we accept the null
hypothesis
• Our p value is way lower than .05 (our chosen significance level), which means that
we may reject our null hypothesis in favour for the alternative.
• This means that the engineering department seems to take different break lengths
than the company as a whole!
• Type 1 Errors: occurs if we reject the null hypothesis when it is actually true.
• Type 2 Errors: occurs if we fail to reject the null hypothesis when it is actually false.

Formula for One Sample T-test


𝑥̅ − 𝜇𝑜
𝑡=
𝑠 / √𝑛
• 𝑥̅ ⟶ Sample mean • 𝑠 ⟶ Sample standard deviation
• 𝜇𝑜 ⟶ Population mean • 𝑛 ⟶ Sample size
Q) Let us take the example of a classroom of students that appeared for a test
recently. Out of the total 150 students, a sample of 10 students has been picked. If
the mean score of the entire class is 78 and the mean score of the sample is 74 with a
standard deviation of 3.5, then calculate the sample’s t-test score. Also, comment on
whether the sample statistics are significantly different from the population at a
99.5% confidence interval.
Ans) Given, 𝑥̅ ⟶ 74, 𝜇𝑜 ⟶ 78, 𝑠 ⟶ 3.5, 𝑛 ⟶ 10
𝑥̅ − 𝜇𝑜 74 − 78
𝑡= = = −3.61 = |−3.61| = 3.61
𝑠 / √𝑛 3.5 / √10
• Degree of Freedom, DoF = 10 - 1 = 9
• Checking T Table, for DoF = 9 and Confidence Interval of 99.5%, we have critical
value as 3.25.
• As 3.61 > 3.25, the hypothesis that the sample statistic is different from the
population can be accepted.
Q) Two samples have means of 10 and 12, standard deviations of 1.2 and 1.4, and
sample sizes of 17 and 15. Determine if the sample’s statistics are different at a 99.5%
confidence interval.
Ans) T-test formula for 2 samples:
𝑥̅1 − 𝑥̅2
𝑡=
(𝑠1 )2 (𝑠2 )2

𝑛1 + 𝑛2
• Given, 𝑥̅1 ⟶ 10, 𝑥̅2 ⟶ 12, 𝑠1 ⟶ 1.2, 𝑠2 ⟶ 1.4, 𝑛1 ⟶ 17, 𝑛2 ⟶ 15
𝑥̅1 − 𝑥̅2 10 − 12
𝑡= = = −4.31 = |−4.31| = 4.31
)2
(𝑠1 (𝑠2 )2 1.22 1.42
√ + √
𝑛1 𝑛2 17 + 15
• In 2 sample T-tests, DoF = 𝑛1 + 𝑛2 − 2 = 15 + 17 - 2 = 30
• For DoF = 30 and Confidence Interval of 99.5%, the critical value from the T table is
2.750. As 4.31 > 2.750, the hypothesis that the statistics of the two samples are
significantly different can be accepted
Chi-square goodness of fit test (NOT IN SYLLABUS)
• The one-sample t-test was used to check whether a sample mean differed from the
population mean.
• The chi-square goodness of fit test is very similar to the one sample t-test in that it
tests whether the distribution of the sample data matches an expected
distribution, while the big difference is that it is testing for categorical variables.
• Assumptions:
o All the expected counts are at least 5
o The population should be at least 10 times as large as the sample
• Example: The CDC categorizes adult BMIs into four classes: Under/Normal, Over
weight, Obesity, and Extreme Obesity. A 2009 survey showed the distribution for
adults in the U.S. to be 31.2%, 33.1%, 29.4%, and 6.3% respectively.
• A total of 500 adults are randomly sampled and their BMI categories are recorded:

Q) Is there evidence to suggest that BMI trends have changed since 2009? Test at the
0.05 significance level.
Ans: First, let's calculate our expected values. In a sample of 500, we expect 156 to be
Under/Normal (that's 31.2% of 500), and we fill in the remaining boxes similarly:

• First, check the conditions:


o All of the expected counts are greater than 5
o Each observation is independent and our population is very large
• Next, carry out a goodness of fit test. We will set our null and alternative
hypotheses:
𝐻0 : The 2009 BMI distribution is still correct.
𝐻1 : The 2009 BMI distribution is no longer correct
• Using Python:
observed = [102, 178, 186, 34]
expected = [156, 165.5, 147, 31.5]
chi_squared, p_value = stats.chisquare(f_obs = observed, f_exp = expected)
print(chi_squared, p_value)
# 30.1817679275599, 1.26374310311106e-06

• Our p-value is lower than .05; therefore, we may reject the null hypothesis in favor
of the fact that the BMI trends today are different from what they were in 2009.

Chi-square test for association/independence (NOT IN SYLLABUS)


• The chi-square test for association/independence helps us ascertain whether two
categorical variables are independent of one another.
• Assumptions:
o All expected counts are at least 5
o Individual observations are independent and the population should be at
least 10 times as large as the sample
• Ex: We ran a test and exposed half of our users to a certain landing page (Website
A), exposed the other half to a different landing page (Website B), and then,
measured the sign up rates for both sites. We obtained the following results:

Q) Is there is a difference between the two variables: a) The website the user was
exposed to b) Whether the user signed up
A) For this, we will use our chi-square test. Let's set up our hypotheses:
• 𝐻0 : There is no association between 2 categorical variables and these are
independent in the population of interest. (The two variables are independent)
• 𝐻1 : There is an association between two categorical variables and these are not
independent in the population of interest. (The two variables are dependent)
To calculate the expected values, we do the following:
(𝑅𝑜𝑤 𝑡𝑜𝑡𝑎𝑙) × (𝐶𝑜𝑙𝑢𝑚𝑛 𝑡𝑜𝑡𝑎𝑙 )
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑉𝑎𝑙𝑢𝑒 =
𝐺𝑟𝑎𝑛𝑑 𝑇𝑜𝑡𝑎𝑙
Did not sign up Signed Up Row Total
188∗244 188∗102
Website A 134 (𝐸 = = 132.5) 54 (𝐸 = = 55.4) 188
346 346

158∗244 158∗102
Website B 110 (𝐸 = = 111.4) 48 (𝐸 = = 46.5) 158
346 346

Column Total 244 102 346

Using the above formulas:


(134 − 132.5)2 (54 − 55.4)2 (110 − 111.4)2 (48 − 46.5)2
𝜒2 =[ + + + ] = 0.118
132.5 55.4 111.4 46.5
𝑑𝑓 = (2 − 1) ∗ (2 − 1) = 1
From the Chi-square distribution table, for α = 0.05 and df = 1, we get Chi-square
value as 3.84.
As 0.118 < 3.84, we accept the null hypothesis (i.e., we fail to reject our null
hypothesis).
Hence, we cannot say for sure that seeing a particular website has any effect on a
user's sign up. There is no association between these variables.
Python Implementation

observed = np.array([[134, 54],[110, 48]])


chi_squared, p_value, df, matrix = stats.chi2_ contingency (observed= observed)

Anova
• Analysis of Variance (ANOVA) is a hypothesis-testing technique used to test the
equality of two or more population (or treatment) means by examining the
variance of samples that are taken.
• It determines whether the differences between the samples are simply due to
random error (sampling errors) or whether there are systematic treatment effects
that causes the mean in one group to differ from the mean in another.
• Most of the time ANOVA is used to compare the equality of three or more means,
however when the means from two samples are compared using ANOVA it is
equivalent to using a t-test to compare the means of independent samples.

Anova Example
Suppose the National Transportation Safety Board (NTSB) wants to examine the
safety of compact cars, midsize cars, and full-size cars. It collects a sample of three for
each of the treatments (cars types). Using the hypothetical data provided below, test
whether the mean pressure applied to the driver’s head during a crash test is equal
for each type of car. Use α = 5%.
Compact Cars Midsize Cars Full-size Cars
643 469 484
655 427 456
702 525 402
̅
𝒙 666.67 473.67 447.33
𝒔 31.18 49.17 41.68
Note: for sample S.D. (𝒔), use sample standard deviation (𝒔𝒙 ) in calculator
A) 1. State the null and alternative hypotheses:
The null hypothesis for an ANOVA always assumes the population means are equal.
Hence, we may write the null hypothesis as:
𝑯𝟎 : The mean head pressure is statistically equal across the three types of cars
(i.e., 𝜇1 = 𝜇2 = 𝜇3 ) where 𝜇𝑖 is the population mean
Since the null hypothesis assumes all the means are equal, we could reject the null
hypothesis if only mean is not equal. Thus, the alternative hypothesis is:
𝑯𝟏 : At least one mean pressure is not statistically equal.
2. Calculate the appropriate test statistic
The test statistic in ANOVA is the ratio of the between and within variation in the
data. It follows an F distribution.
𝟐
Total Sum of Squares, SST = ∑𝑟𝑖=1 ∑𝑐𝑗=1 (𝒙𝒊𝒋 − 𝒙
̿)
where, 𝑟 is the number of rows in the table, 𝑐 is the number of columns,
𝑥̿ is the grand mean, and 𝑥𝑖𝑗 is the 𝑖 𝑡ℎ observation in the 𝑗𝑡ℎ column.
(643 + 655 + 702 + 469 + 427 + 525 + 484 + 456 + 402)
𝐺𝑟𝑎𝑛𝑑 𝑚𝑒𝑎𝑛, 𝑥̿ =
9
𝑥̿ = 529.22

Between Sum of Squares (or Treatment Sum of Squares) SSTR = ∑ 𝑟𝑗 ∗ (𝒙 ̿)𝟐


̅−𝒙
where, 𝑟𝑗 is the no. of rows in the 𝑗𝑡ℎ treatment and 𝑥𝑗 is the mean of the 𝑗𝑡ℎ column.

2
Within variation (or Error Sum of Squares) SSE = ∑ ∑ (𝑥𝑖𝑗 − 𝑥̅𝑗 )

Note: SST = SSTR + SSE (96303.55 = 86049.55 + 10254). Hence, we only need to find
any 2 among (SST, SSTR and SSE)
3. The next step in an ANOVA is to compute the “average” sources of variation in the
data using SST, SSTR, and SSE.
𝑆𝑆𝑇
Total Mean Squares, MST =
𝑁−1
It is the average total variation in the data (N is the total number of observations)
𝑆𝑆𝑇𝑅
Mean Square Treatment, MSTR =
𝑐−1
It is the average between variation (c is the number of columns in the data table)
𝑆𝑆𝐸
Mean Square Error, MSE =
𝑁−𝑐
It is the average within variation.
4. The test statistic may now be calculated.
For a one-way ANOVA the test statistic is equal to the ratio of MSTR and MSE.
5. Obtain the Critical Value
𝒅𝒇𝟏 = 𝒄 − 𝟏 = 3 - 1 = 2
𝒅𝒇𝟐 = 𝑵 − 𝒄 = 9 - 3 = 6
Hence, we need to find 𝐹𝑐𝑣 = 𝐹2,6 from the F-distribution table for numerator df = 2,
denominator df = 6. From the table, 𝐹2,6 = 5.143
6. Decision Rule
Reject hypothesis if 𝐹𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 > 𝐹𝑐𝑣
As 25.17 > 5.143, we reject the null hypothesis.
7. Interpretation
Since we rejected the null hypothesis, we are 95% confident (1 - α) that the mean
head pressure is not statistically equal for compact, midsize, and full-size cars.

One way and two-way classifications of Anova

One-Way ANOVA Two-Way ANOVA

A test that allows one to A test that allows one to make


make comparisons comparisons between the
Definition between the means of means of three or more groups
three or more groups of of data, where two independent
data. variables are considered.

Number of
Independent One. Two.
Variables

The means of three or The effect of multiple groups of


What is Being more groups of an two independent variables on a
Compared? independent variable on a dependent variable and on each
dependent variable. other.

Number of Groups Each variable should have


Three or more.
of Samples multiple samples.

You might also like