M3 - FDS
M3 - FDS
Advanced Probability
Basic definitions
• Procedure: A procedure is an act that leads to a result. For example, throwing a
dice or visiting a website.
• Event: An event is a collection of the outcomes of a procedure, such as getting a
head on a coin flip or leaving a website after only 4 seconds.
• Simple Event: A simple event is an outcome/ event of a procedure that cannot be
broken down further. For example, rolling two dice can be broken down into two
simple events: rolling die 1 and rolling die 2.
• Sample Space: The sample space of a procedure is the set of all possible simple
events. For example, an experiment is performed, in which a coin is flipped three
times in succession. The sample space would be:
{HHH, HHT, HTT, HTH, TTT, TTH, THH, THT}.
Probability
• The probability of an event is the frequency, or chance, that the event will happen.
• If 𝐴 is an event, 𝑃(𝐴) is the probability of the occurrence of the event.
• We can define the actual probability of an event, A, as follows:
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑎𝑦𝑠 𝐴 𝑜𝑐𝑐𝑢𝑟
𝑃(𝐴) =
𝑠𝑖𝑧𝑒 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑝𝑎𝑐𝑒
• Here, A is the event in question. Think of an entire universe of
events where anything is possible, and let's represent it as a circle.
• We can think of a single event, A, as being a smaller circle within that larger
universe, as shown in the diagram.
Frequentist Approach
Compound Event
• A compound event is any event that combines two or more simple events.
• Given events A and B:
o The probability that A and B occur is 𝑃(𝑨 ⋂ 𝑩) = 𝑃(𝑨 𝑎𝑛𝑑 𝑩)
o The probability that either A or B occurs is 𝑃(𝑨 ⋃ 𝑩) = 𝑃(𝑨 𝑜𝑟 𝑩)
Conditional Probability
• Conditional Probability 𝑃(𝐴|𝐵) is the probability of an event 𝐴 given that another
event 𝐵 has already happened.
Rules/Axioms of Probability
• Addition rule: 𝑃 (𝐴 ⋃ 𝐵 ) = 𝑃 (𝐴) + 𝑃(𝐵 ) − 𝑃(𝐴 ⋂ 𝐵)
• Mutual Exclusivity: 𝑃(𝐴 ⋃ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵 ) − 𝑃(𝐴 ⋂ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵)
where, A and B are mutually exclusive events and cannot occur at the same time,
hence 𝑃(𝐴 ⋂ 𝐵) = 0
• Multiplication rule: 𝑃(𝐴 ⋂ 𝐵) = 𝑃(𝐴) ∗ 𝑃(𝐵|𝐴)
• Independence: 𝑃(𝐴 ⋂ 𝐵) = 𝑃(𝐴) ⋅ 𝑃(𝐵|𝐴) = 𝑃(𝐴) ∗ 𝑃(𝐵), where A and B are
independent events, i.e., one event does not affect the outcome of the other.
Hence, 𝑃 (𝐵 |𝐴) = 𝑃(𝐵)
• Complementary Events: 𝑃 (𝐴) = 1 − 𝑃(𝐴′ )
Confusion Matrix
• When a Binary Classifier (one that predicts
between 2 choices) is used, we can draw
what are called confusion matrices, which
are 2 x 2 matrices that house all the four
possible outcomes of our experiment.
• False positives are sometimes called a Type 1 error whereas false negatives are
called a Type 2 error.
• Performance Measures:
𝑇𝑃 + 𝑇𝑁
1. 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁
𝑇𝑃
2. 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 (𝑃) =
𝑇𝑃 + 𝐹𝑃
𝑇𝑃
3. 𝑅𝑒𝑐𝑎𝑙𝑙 (𝑅) =
𝑇𝑃 + 𝐹𝑁
2
4. 𝐹1 𝑆𝑐𝑜𝑟𝑒 =
1 1
𝑅+𝑃
𝑇𝑁
5. 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁 + 𝐹𝑃
𝑇𝑃
6. 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = (𝑠𝑎𝑚𝑒 𝑎𝑠 𝑅𝑒𝑐𝑎𝑙𝑙)
𝑇𝑃 + 𝐹𝑁
Example of Confusion Matrix
Consider the following matrix formed from a multi-class classifier. Calculate Precision
and Recall per class, weighted average Precision and Recall.
Predicted
Apple Banana Cherry Total
Apple 15 2 3 20
Actual Banana 7 15 8 30
Cherry 2 3 45 50
Total 24 20 56 100
Ans: Individual Class Accuracy and Recall
𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑
Precision = Recall =
𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑡𝑜𝑡𝑎𝑙 𝑎𝑐𝑡𝑢𝑎𝑙 𝑡𝑜𝑡𝑎𝑙
15 15
Class 𝐴𝑝𝑝𝑙𝑒 precision = = 0.625 Class 𝐴𝑝𝑝𝑙𝑒 recall = = 0.75
24 20
15 15
Class 𝐵𝑎𝑛𝑎𝑛𝑎 precision = = 0.75 Class 𝐵𝑎𝑛𝑎𝑛𝑎 recall = = 0.5
20 30
45 45
Class 𝐶ℎ𝑒𝑟𝑟𝑦 precision = = 0.80 Class 𝐶ℎ𝑒𝑟𝑟𝑦 recall = = 0.90
56 50
Overall Classifier Accuracy:
𝑡𝑜𝑡𝑎𝑙 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 15 + 15 + 45
Overall Precision = = = 0.75
𝑡𝑜𝑡𝑎𝑙 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 100
Likes KitKat 15 17 32
Dislikes KitKat 16 16 32
Total 31 33
Q1) What is the probability that someone likes both given they like KitKat?
15
Ans: 𝑃(𝑙𝑖𝑘𝑒𝑠 𝑏𝑜𝑡ℎ | 𝑙𝑖𝑘𝑒𝑠 𝐾𝑖𝑡𝐾𝑎𝑡) = = 0.46
32
Q2) What is the probability that a person dislikes Bounty given that they like KitKat?
17
Ans: 𝑃(𝑑𝑖𝑠𝑙𝑖𝑘𝑒𝑠 𝐵𝑜𝑢𝑛𝑡𝑦 | 𝑙𝑖𝑘𝑒𝑠 𝐾𝑖𝑡𝐾𝑎𝑡) = = 0.53
32
Bayesian Approach
Bayes Theorem
• Recalling previously defined formulas:
o 𝑃(𝐴) = The probability that event A occurs
o 𝑃(𝐴|𝐵) = The probability that A occurs, given that B occurred
o 𝑃(𝐴 and 𝐵 ) = 𝑃(𝐴, 𝐵 ) = 𝑃(𝐴 ∪ 𝐵 ) = The probability that A and B occurs
o 𝑃(𝐴 and 𝐵 ) = 𝑃(𝐴, 𝐵 ) = 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) ∗ 𝑃(𝐵|𝐴)
• We also know that,
𝑃(𝐴, 𝐵) = 𝑃(𝐴) ∗ 𝑃(𝐵|𝐴)
𝑃(𝐵, 𝐴) = 𝑃(𝐵) ∗ 𝑃(𝐴|𝐵)
𝑃(𝐴, 𝐵) = 𝑃(𝐵, 𝐴)
Hence, 𝑃(𝐵) ∗ 𝑃(𝐴|𝐵) = 𝑃(𝐴) ∗ 𝑃(𝐵|𝐴)
Dividing both sides by 𝑃(𝐵), we get the Bayes Theorem:
𝑃(𝐴) ∗ 𝑃 (𝐵 |𝐴)
𝑃 (𝐴|𝐵) =
𝑃(𝐵 )
Now, rewriting it as
𝑃(𝐻) ∗ 𝑃(𝐷 |𝐻 )
( |
𝑃 𝐻𝐷 = )
𝑃 (𝐷)
Where, H = your hypothesis about a given data and D = data that is given to you
Now,
• P(H) is the probability of the hypothesis before we observe the data, called the
prior probability or just prior
• P(H|D) is what we want to compute, the probability of the hypothesis after we
observe the data, called the posterior
• P(D|H) is the probability of the data under the given hypothesis, called the
likelihood
• P(D) is the probability of the data under any hypothesis, called the normalizing
constant.
Example 1: Consider that you have two people in charge of writing blog posts for your
company—Lucy and Avinash. From past performances, you have liked 80% of Lucy's
work and only 50% of Avinash's work. A new blog post comes to your desk in the
morning, but the author isn't mentioned. You love the article. A+. What is the
probability that it came from Avinash? Each blogger blogs at a very similar rate.
Ans: Let
𝐻 = hypothesis = the blog came from Avinash
𝐷 = data = you loved the blog post. Therefore,
𝑃(𝐻|𝐷) = the chance that it came from Avinash, given that you loved it
𝑃(𝐷|𝐻) = the chance that you loved it, given that it came from Avinash
𝑃(𝐻) = the chance that an article came from Avinash
𝑃(𝐷) = the chance that you love an article
So, we want to know P(H|D). Using Bayes theorem, as shown, here:
𝑃(𝐻) ∗ 𝑃 (𝐷 |𝐻 )
𝑃(𝐻 |𝐷 ) =
𝑃(𝐷)
Now,
𝑃 (𝐻) = 0.5 as both Avinash and Lucy write at similar rates A/Q.
𝑃 (𝐷 |𝐻) = 0.5 as given in the question
𝑃 (𝐷) = It means that we must take into account the scenario if the post came from
Lucy or Avinash. Now, if the hypothesis forms a suite, then we can use our laws of
probability.
• A suite is formed when a set of hypotheses is both collectively exhaustive and
mutually exclusive. In laymen's terms, in a suite of events, exactly one and only
one hypothesis can occur.
• In our case, the two hypotheses are that the article came from Lucy, or that the
article came from Avinash. This is a suite because of the following reasons:
o At least one of them wrote it
o At most one of them wrote it
o Therefore, only one of them wrote it
When we have a suite, we can use our multiplication and addition rules, as follows:
𝑃 (𝐷) = 𝑃(𝑓𝑟𝑜𝑚 𝐴𝑣𝑖𝑛𝑎𝑠ℎ 𝐀𝐍𝐃 𝐿𝑜𝑣𝑒𝑑 𝑖𝑡 ) 𝐎𝐑 𝑃(𝑓𝑟𝑜𝑚 𝐿𝑢𝑐𝑦 𝐀𝐍𝐃 𝐿𝑜𝑣𝑒𝑑 𝑖𝑡 )
Using the multiplication rules, we get:
𝑃(𝑓𝑟𝑜𝑚 𝐴𝑣𝑖𝑛𝑎𝑠ℎ 𝐀𝐍𝐃 𝐿𝑜𝑣𝑒𝑑 𝑖𝑡) = 𝑃(𝑓𝑟𝑜𝑚 𝐴𝑣𝑖𝑛𝑎𝑠ℎ) ∗ 𝑃(𝐿𝑜𝑣𝑒𝑑 𝑖𝑡 |𝑓𝑟𝑜𝑚 𝐴𝑣𝑖𝑛𝑎𝑠ℎ)
∴ 𝑃(𝑓𝑟𝑜𝑚 𝐴𝑣𝑖𝑛𝑎𝑠ℎ 𝐀𝐍𝐃 𝐿𝑜𝑣𝑒𝑑 𝑖𝑡) = 0.5 ∗ 0.5 = 0.25
Similarly,
𝑃(𝑓𝑟𝑜𝑚 𝐿𝑢𝑐𝑦 𝐀𝐍𝐃 𝐿𝑜𝑣𝑒𝑑 𝑖𝑡) = 𝑃(𝑓𝑟𝑜𝑚 𝐿𝑢𝑐𝑦) ∗ 𝑃(𝐿𝑜𝑣𝑒𝑑 𝑖𝑡 | 𝑓𝑟𝑜𝑚 𝐿𝑢𝑐𝑦)
∴ 𝑃(𝑓𝑟𝑜𝑚 𝐿𝑢𝑐𝑦 𝐀𝐍𝐃 𝐿𝑜𝑣𝑒𝑑 𝑖𝑡) = 0.5 ∗ 0.8 = 0.4
Hence, 𝑃(𝐷) = 0.25 + 0.4 = 0.65
𝑃(𝐻) ∗ 𝑃(𝐷|𝐻) 0.5 ∗ 0.5
∴ 𝑃(𝐻|𝐷) = = = 0.38
𝑃(𝐷) 0.65
• To find 𝑃(𝐸) we have to consider two things: 𝑃(𝐸 𝑎𝑛𝑑 𝐷) as well as 𝑃(𝐸 𝑎𝑛𝑑 𝑁).
𝑃 (𝐸 ) = 𝑃(𝐸 𝑎𝑛𝑑 𝐷) 𝑜𝑟 𝑃(𝐸 𝑎𝑛𝑑 𝑁)
𝑃 (𝐸 ) = 𝑃(𝐷) ∗ 𝑃 (𝐸 |𝐷 ) + 𝑃(𝑁 ) ∗ 𝑃 (𝐸 |𝑁 )
𝑃 (𝐸 ) = 0.05 ∗ 0.60 + 0.95 ∗ 0.01 = 0.0395
Hence,
0.6∗0.05
𝑃 (𝐷 |𝐸 ) = = 0.76 ⇒ 76%
0.0395
Example 4: We are given a sentence “A very
close game”, a training set of five sentences
(as shown), and their corresponding
category (Sports or Not Sports).
Predict which category the given sentence
would fall under: “A very close game”
• Calculate the probability of both “A very
close game” is Sports, as well as “A very
close game” is Not Sports.
o P(Sports | A very close game)
o P(Not Sports | A very close game)
• The one with the higher probability will be the result.
Feature Engineering:
P(a very close game | Sports) = P(a | Sports) * P(very | Sports) * P(close | Sports) *
P(game | Sports) [1]
P(a very close game | Not Sports) = P(a | Not Sports) * P(very | Not Sports) *
P(close | Not Sports) * P(game | Not Sports) [2]
• Calculating the probabilities: ------------->
Note: The word “close” does not exist in
the category Sports,
∴ P(close | Sports) = 0, leading to
P(a very close game | Sports) = 0 (Eqn. [1]).
• It is problematic when a frequency-
based probability is zero, because it will
wipe out the information in the other probabilities.
• Hence, we add 1 to the numerator and 14 to the denominator.
Finally,
𝑃(𝑎 𝑣𝑒𝑟𝑦 𝑐𝑙𝑜𝑠𝑒 𝑔𝑎𝑚𝑒 | 𝑆𝑝𝑜𝑟𝑡𝑠) = 3/25 ∗ 2/25 ∗ 1/25 ∗ 3/25 = 4.6 × 10−5
𝑃 (𝑎 𝑣𝑒𝑟𝑦 𝑐𝑙𝑜𝑠𝑒 𝑔𝑎𝑚𝑒 | 𝑁𝑜𝑡 𝑆𝑝𝑜𝑟𝑡𝑠) = 2/23 ∗ 1/23 ∗ 2/23 ∗ 1/23 = 1.4 × 10−5
∴ as 4.6 × 10−5 > 1.4 × 10−5 ⇒ the given sentence can be categorized as ‘sports’
Random variables
• A random variable uses real numerical values to describe a probabilistic event.
• A random variable is a function that maps values from the sample space of an
event (the set of all possible outcomes) to a probability value (between 0 and 1).
• For example, we might have:
o X = the outcome of a dice roll
o Y = the revenue earned by a company this year
o Z = the score of an applicant on an interview coding quiz (0-100%)
• There are two main types of random variables: discrete and continuous.
Example 2: Let the X random variable represent the success of a product. X is indeed
a discrete random variable because the X variable can only take on one of five
options: 0, 1, 2, 3, or 4 (a value of 0 represents a failure and 4 represents a success)
The following is the probability distribution of our random variable, X:
• The probability mass function (PMF) for a binomial random variable is as follows:
Example 1 of Binomial RV: A new restaurant in a town has a 20% chance of surviving
its first year. If 14 restaurants open this year, find the probability that exactly 4
restaurants survive their first year of being open to the public.
Ans: First, we should prove that this is a binomial setting:
• The possible outcomes are either success or failure (the restaurants either survive
or does not survive)
• The outcomes of trials cannot affect the outcome of another trial (assume that the
opening of a restaurant doesn't affect another restaurant's opening and survival)
• The number of trials was set (14 restaurants opened)
• The chance of success of each trial must always be p (we assume that it is always
20%)
Here, we have our two parameters of n = 14 and p = 0.2.
n
𝑃 (𝑋 = 𝑘) = Ck 𝑝𝑘 (1 − 𝑝)𝑛−𝑘
14
𝑃 (𝑋 = 4) = C4 ∗ 0.24 ∗ (1 − 0.2)14−4 = 0.17
Example 2 of Binomial RV: A couple has a 25% chance of a having a child with type O
blood. What is the chance that 3 of their 5 kids have type O blood?
Ans: Let X = the number of children with type O blood with n = 5 and p = 0.25, as
shown here:
5
𝑃 (𝑋 = 3) = C3 ∗ 0.253 ∗ (1 − 0.25)5−3 = 0.087
We can calculate this probability for the values of 0, 1, 2, 3, 4, and 5 to get a sense of
the probability distribution:
From here, we can calculate an expected value and the variance of this variable:
𝐸 [𝑋 ] = 0(0.23) + 1(0.4) + 2(0.26) + 3(0.09) + 4(0.01) + 5(0.0009) = 1.25
𝑉 [𝑋 ] = [02 ∗ 0.23 + 12 ∗ 0.4 + 22 ∗ 0.26 + 32 ∗ 0.09 + 42 ∗ 0.01 + 52 ∗ 0.0009]
− 1.252 = 0.93
Hence, the number of expected children with O type blood is 1.25 ± 0.93! Which
mean, 1 or 2 kids! What if we want to know the probability that at least 3 of their kids
have type O blood?
Example 1 of Geometric RV: There is a 34% chance that it will rain on any day in April.
Find the probability that the first day of rain in April will occur on April 4th. Also find
the probability that the first rain of the month will happen within the first four days.
Ans: 𝑃(𝑋 = 4) = (1 − 0.34)4−1 ∗ (0.34) = 0.0977 ≈ 0.1 ⇒ 10% chance it will rain
on April 4th.
The probability that it will rain by the fourth of April is as follows:
𝑃(𝑋 ≤ 4) = 𝑃 (1) + 𝑃(2) + 𝑃(3) + 𝑃(4) = 0.34 + 0.22 + 0.14 + 0.1 = 0.8
So, there is an 80% chance that the first rain of the month will happen within the first
four days.
Example 1 of Poisson RV: The number of calls arriving at your call center follows a
Poisson distribution at the rate of 5 calls/hour. What is the probability that exactly 6
calls will come in between 10 and 11 PM?
Ans: Let X be the number of calls that arrive between 10 and 11 PM. Given 𝜇 = 5.
𝑒 −5 56
∴ 𝑃(𝑋 = 6) = = 0.146 ⇒ 14.6% chance that exactly 6 calls will come
6!
between 10 and 11 PM.
Example 2 of Poisson RV: The average number of homes sold by the Acme Realty
company is 2 homes per day. What is the probability that exactly 3 homes will be sold
tomorrow?
Ans: Let X be the number of homes sold in a day.
Given 𝜇 = 2.
𝑒 −2 23
∴ 𝑃 (𝑋 = 3) = = 0.180
3!
⇒ 18% chance that exactly 3 homes will be sold tomorrow
Note: If the problem has the word ‘average’ in it, it is mostly a Poisson distribution.
On the other hand, if an exact probability is given, it is mostly a Binomial distribution.
Example 3 of Poisson RV: A life insurance salesman sells on the average 3 life
insurance policies per week. Use Poisson's law to calculate the probability that in a
given week he will sell
a) Some policies
b) 2 or more policies but less than 5 policies.
c) Assuming that there are 5 working days per week, what is the probability that in a
given day he will sell one policy?
Ans: Given, 𝜇 = 3
a) "Some policies" means "1 or more policies". We can work this out by finding 1
minus the "zero policies" probability:
𝑒 −3 30
𝑃 (𝑋 > 0) = 1 − 𝑃(𝑋 = 0) = 1 − = 0.95 ⇒ 95%
0!
b) The probability of selling 2 or more but less than 5 policies:
𝑃(2 ≤ 𝑋 < 5) = 𝑃(2) + 𝑃 (3) + 𝑃 (4) = 0.616 ⇒ 61.6%
d) Average number of policies sold per day: 3/5 = 0.6
𝑒 −0.6 0.61
So on a given day, 𝑃 (𝑋 = 1) = = 0.329 ⇒ 32.9%
1!
Example 4 of Poisson RV: 20 sheets of aluminium alloy were examined for surface
flaws. The frequency of the number of sheets with a given number of flaws per sheet
was as follows:
Number of flaws 0 1 2 3 4 5 6
Frequency 4 3 5 2 4 1 1
What is the probability of finding a sheet chosen at random which contains 3 or more
surface flaws?
Ans: Total nos. of flaws: (0*4)+(1*3)+(2*5)+(3*2)+(4*4)+(5*1)+(6*1) = 46
Average number of flaws per sheet: 46/20 = 𝜇 = 2.3
Required probability 𝑃 (𝑋 ≥ 3) = 1 − (𝑃 (0) + 𝑃(1) + 𝑃(2))
Example 5 of Poisson RV: If electricity power failures occur according to a Poisson
distribution with an average of 3 failures every twenty weeks, calculate the
probability that there will not be more than one failure during a particular week. Ans:
• The preceding 𝑓(𝑥) function is known as the probability density function (PDF).
The PDF is the continuous random variable version of the PMF for discrete random
variables.
• In general, quantities such as pressure, height, mass, weight, density, volume,
temperature, and distance are examples of continuous random variables
• The most important continuous distribution is the standard normal distribution.
The PDF of this distribution is as follows:
where,
o 𝜇 is the mean of the variable
o 𝜎 is the standard deviation
• When plotted on a graph, the standard normal distribution gives a bell-shaped
curve.
Properties of a Normal Distribution
o The normal curve is symmetrical about the mean μ;
o The mean is at the middle and divides the area into halves;
o The total area under the curve is equal to 1;
o It is completely determined by its mean μ and standard deviation σ (or
variance σ2) – only 2 parameters
Correlation and Covariance
• Both the terms measure the relationship and the dependency between two
variables
Covariance:
• Covariance shows how the two variables differ, whereas correlation shows how
the two variables are related. In other words, it quantifies the dependence
between two random variables X and Y.
• While variance measures how a single variable deviates from its mean, covariance
measures how two variables vary in tandem from their means.
𝒙 𝒚 ̅
𝒙−𝒙 ̅
𝒚−𝒚 (𝒙 − 𝒙
̅) ∗ (𝒚 − 𝒚
̅)
1 3 -2 -6.2 12.4
2 5 -1 -4.2 4.2
3 11 0 1.8 0
4 11 1 1.8 1.8
5 16 2 6.8 13.6
̅) ∗ (𝒚 − 𝒚
a∑(𝒙 − 𝒙 ̅) 32
̅ = 𝟑 𝑎𝑛𝑑 𝒚
From dataset, we get 𝒙 ̅ = 𝟗. 𝟐 Confirm the cov value
1. If sample, do
Use the calculator’s Statistics mode to get the following values:
𝑐𝑜𝑣 = 𝑠𝑥 ∗ 𝑠𝑦 ∗ 𝑟
𝝈𝒙 = 𝟏. 𝟒𝟏𝟒 𝝈𝒚 = 𝟒. 𝟔𝟔𝟒 (𝝈 is used for population) 2. If population, do
∑(𝒙 − 𝒙 ̅) ∗ (𝒚 − 𝒚
̅) 32 32
𝑖𝑓 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 𝑖𝑠 𝑎 𝑠𝑎𝑚𝑝𝑙𝑒: 𝑐𝑜𝑣(𝑥, 𝑦)𝑠𝑎𝑚𝑝𝑙𝑒 = = = =8
𝑛−1 5−1 4
∑(𝒙 − 𝒙 ̅) ∗ (𝒚 − 𝒚 ̅) 32
𝑖𝑓 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 𝑖𝑠 𝑎 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛: 𝑐𝑜𝑣(𝑥, 𝑦)𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 = = = 6.4
𝑛 5
Hence, Confirm the corr coeff value
𝒄𝒐𝒗(𝒙, 𝒚) 8 1. Enter Statistic mode (6)
𝑐𝑜𝑟𝑟(𝑥, 𝑦)𝑠𝑎𝑚𝑝𝑙𝑒 = = = 0.97 2. Select 𝑦 = 𝑎 + 𝑏𝑥 (2)
𝒔𝒙 𝒔𝒚 1.581 ∗ 5.215 3. Enter data
4. Click ‘OPTN’ and select
𝒄𝒐𝒗(𝒙, 𝒚) 6.4 'Regression Calculator'
𝑐𝑜𝑟𝑟(𝑥, 𝑦)𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 = = = 0.97
𝝈𝒙 𝝈𝒚 1.414 ∗ 4.664 5. The '𝑟' value is the corr coef
a
Python Implementation of Covariance and Correlation
import numpy as np #np.cov and np.corrcoef
x = [1, 2, 3, 4, 5] #returns a 2D array
y = [3, 5, 11, 11, 16]
cov_s = np.cov(x, y)[0][1] #cov of sample
cov_p = np.cov(x, y, bias=True)[0][1] #cov of population
corr = np.corrcoef(x, y)[0][1]
Point estimates
• A point estimate is an estimate of a population parameter based on sample data.
• We use point estimates to estimate population means, variances etc.
• To obtain these estimates, we simply apply the function that we wish to measure
for our population to a sample of the data.
• For example, suppose there is a company of 9,000 employees and we are
interested in ascertaining the average length of breaks taken by employees in a
single day.
• As we probably cannot ask every single person, we will take a sample of the 9,000
people and take a mean of the sample.
• This sample mean will be our point estimate.
Sampling distributions
• Many statistical tests rely on data that follows a normal pattern, and for the most
part, a lot of real-world data is not normal.
• Since most real-world data is not normal, many of the most popular statistics tests
may not apply.
• However, if we follow the given procedure, we can create normal data!
• We utilize what is known as a sampling distribution, which is a distribution of
point estimates of several samples of the same size.
• An example of the procedure for creating a sampling distribution:
o Take 500 samples from the population, each sample having 100 points each
o Find the mean of each sample and add it to the list of point estimates.
o Plot the histogram of the point estimates (would be a normal distribution)
• The data converges to a normal distribution because of something called the
central limit theorem (CLT).
• CLT states that the sampling distribution (the distribution of point estimates) will
approach a normal distribution as we increase the number of samples taken.
• As we take more and more samples, the mean of the sampling distribution will
approach the true population mean.
𝑥̅ − 𝜇
𝑍𝑠𝑐𝑜𝑟𝑒 =
𝜎/√𝑛
̅ → 𝑚𝑒𝑎𝑛 𝝁 → 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛 𝝈 → 𝑝𝑜𝑝. 𝑆𝐷 𝒏 → 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒
𝑤ℎ𝑒𝑟𝑒 𝒙
Descriptive Statistics
• Descriptive statistics are methods for organizing and summarizing data.
• For example, tables or graphs are used to organize data, and descriptive values
such as the mean, median and mode are used to summarize data.
• A descriptive value for a population is called a parameter
• A descriptive value for a sample is called a statistic.
Inferential Statistics
• Inferential statistics are methods for using sample data to make general
conclusions (inferences) about populations.
Sampling Error
• The discrepancy between a sample statistic and its population parameter is called
sampling error.
• Defining and measuring sampling error is a large part of inferential statistics.
Types of Sampling
1. Simple Random Sampling: Under Random sampling,
every element of the population has an equal
probability of getting selected.
2. Stratified Random Sampling:
o Proportionate stratified sample
▪ We group the entire population into
subgroups by some common property like
class labels.
▪ We then randomly sample from those
groups individually, such that the groups are
still maintained in the same ratio as they were in the entire population.
o Disproportionate stratified sample
▪ The size of the sample selected from each subgroup is disproportional to
the size of that subgroup in the population.
▪ It requires weights to match the population’s ratio of each subgroup
3. Systematic Sampling
o Systematic sampling is about sampling
items from the population at regular
predefined intervals (basically fixed
and periodic intervals).
o For example — Every 5th element,
21st element and so on.
o The figure shows a pictorial view of the same — We sample every 9th and 7th
element in order and then repeat this pattern.
4. Cluster Sampling
o In Cluster sampling, we divide the entire
population into subgroups, wherein,
each of those subgroups has similar
characteristics to that of the population
when considered in totality.
o Also, instead of sampling individuals,
we randomly select the entire subgroups.
o Example: Class of 120 students divided into groups of 12 for a common class
project. Clustering parameters like (Designation, Class, Topic) are all similar
over here as well.
Confidence intervals
• A confidence level does not represent a "probability of being correct"; instead, it
represents the frequency that the obtained answer will be accurate.
• For example, if you want to have a 95% chance of capturing the true population
parameter using only a single point estimate, we would have to set our confidence
level to 95%.
• Calculating a confidence interval involves finding a point estimate, and then,
incorporating a margin of error to create a range.
• The margin of error is a value that represents our certainty that our point estimate
is accurate and is based on our:
o desired confidence level,
o variance of the data, and
o how big your sample is
• There are many ways to calculate confidence intervals. We will look at a single way
of taking the confidence interval of a population mean. For this confidence
interval, we need the following:
o A point estimate
o An estimate of the population standard deviation, which represents the
variance in the data. This is calculated by taking the sample standard
deviation (the standard deviation of the sample data) and dividing that
number by the square root of the population size.
o The degrees of freedom, which is usually = (sample size - 1)
sample_size = 100
# the size of the sample we wish to take
sample_mean = sample.mean()
# the sample mean
sample_stdev = sample.std()
# sample standard deviation
• If, after executing the last line, we get: (171.36, 183.44), then it represents a
confidence interval for the average height with a 95% confidence.
• Our hypothesis will be that as we make our confidence level larger, we will likely
see larger confidence intervals to be surer that we catch the true population
parameter: Example:
o confidence 0.5 has an interval of size 2.56
o confidence 0.8 has an interval of size 4.88
o confidence 0.9 has an interval of size 6.29
o confidence 0.99 has an interval of size 9.94
• We can see that as we wish to be "more confident" in our interval, our interval
expands in order to compensate.
Hypothesis tests
• A hypothesis test is a statistical test that is used to ascertain whether we are
allowed to assume that a certain condition is true for the entire population, given a
data sample.
• Basically, a hypothesis test is a test for a certain hypothesis that we have about an
entire population. The result of the test then tells us whether we should believe
the hypothesis or reject it for an alternative one.
• A hypothesis test generally looks at two opposing hypotheses about a population.
We call them the null hypothesis and the alternative hypothesis.
• The null hypothesis is the statement being tested and is the default correct
answer; it is our starting point and our original hypothesis.
• The alternative hypothesis is the statement that opposes the null hypothesis.
• Our test will tell us which hypothesis we should trust and which we should reject.
• Based on sample data from a population, a hypothesis test determines whether or
not to reject the null hypothesis.
• We usually use a p-value (which is based on our significance level) to make this
conclusion.
T-tests
• The one sample t-test is a statistical test used to determine whether a quantitative
(numerical) data sample differs significantly from another dataset (the population
or another sample).
• Ex: Our objective here is to ascertain whether there is a difference between the
overall population's (company employees) break times and break times of
employees in the engineering department.
• Let us now conduct a t-test at a 95% confidence level in order to find a difference
(or not!).
• Assumptions:
o The population distribution should be normal, or the sample should be large
(n ≥ 30).
o Population size should be at least 10 times larger than the sample size
1. Specify the hypotheses:
o 𝐻0 : the engineering department takes the same breaks as the company as a
whole
o 𝐻1 : the engineering department takes different breaks from the company as
a whole
2. Determine the sample size for the test sample:
o The sample is at least 30 points (it is 400)
o The sample is less than 10% of the population (which would be 900 people)
3. Choose a significance level (α): We will choose a 95% significance level, which
means that our α would actually be 1 - 0.95 = 0.05
4. Collect the data.
5. Decide whether to reject or fail to reject the null hypothesis: For a one sample t-
test, we must calculate two numbers: the test statistic and our p value. Luckily, we
can do this in one line in Python:
t_statistic, p_value = stats.ttest_1samp (a = engineering_breaks, popmean=
allbreaks.mean())
Q) Is there evidence to suggest that BMI trends have changed since 2009? Test at the
0.05 significance level.
Ans: First, let's calculate our expected values. In a sample of 500, we expect 156 to be
Under/Normal (that's 31.2% of 500), and we fill in the remaining boxes similarly:
• Our p-value is lower than .05; therefore, we may reject the null hypothesis in favor
of the fact that the BMI trends today are different from what they were in 2009.
Q) Is there is a difference between the two variables: a) The website the user was
exposed to b) Whether the user signed up
A) For this, we will use our chi-square test. Let's set up our hypotheses:
• 𝐻0 : There is no association between 2 categorical variables and these are
independent in the population of interest. (The two variables are independent)
• 𝐻1 : There is an association between two categorical variables and these are not
independent in the population of interest. (The two variables are dependent)
To calculate the expected values, we do the following:
(𝑅𝑜𝑤 𝑡𝑜𝑡𝑎𝑙) × (𝐶𝑜𝑙𝑢𝑚𝑛 𝑡𝑜𝑡𝑎𝑙 )
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑉𝑎𝑙𝑢𝑒 =
𝐺𝑟𝑎𝑛𝑑 𝑇𝑜𝑡𝑎𝑙
Did not sign up Signed Up Row Total
188∗244 188∗102
Website A 134 (𝐸 = = 132.5) 54 (𝐸 = = 55.4) 188
346 346
158∗244 158∗102
Website B 110 (𝐸 = = 111.4) 48 (𝐸 = = 46.5) 158
346 346
Anova
• Analysis of Variance (ANOVA) is a hypothesis-testing technique used to test the
equality of two or more population (or treatment) means by examining the
variance of samples that are taken.
• It determines whether the differences between the samples are simply due to
random error (sampling errors) or whether there are systematic treatment effects
that causes the mean in one group to differ from the mean in another.
• Most of the time ANOVA is used to compare the equality of three or more means,
however when the means from two samples are compared using ANOVA it is
equivalent to using a t-test to compare the means of independent samples.
Anova Example
Suppose the National Transportation Safety Board (NTSB) wants to examine the
safety of compact cars, midsize cars, and full-size cars. It collects a sample of three for
each of the treatments (cars types). Using the hypothetical data provided below, test
whether the mean pressure applied to the driver’s head during a crash test is equal
for each type of car. Use α = 5%.
Compact Cars Midsize Cars Full-size Cars
643 469 484
655 427 456
702 525 402
̅
𝒙 666.67 473.67 447.33
𝒔 31.18 49.17 41.68
Note: for sample S.D. (𝒔), use sample standard deviation (𝒔𝒙 ) in calculator
A) 1. State the null and alternative hypotheses:
The null hypothesis for an ANOVA always assumes the population means are equal.
Hence, we may write the null hypothesis as:
𝑯𝟎 : The mean head pressure is statistically equal across the three types of cars
(i.e., 𝜇1 = 𝜇2 = 𝜇3 ) where 𝜇𝑖 is the population mean
Since the null hypothesis assumes all the means are equal, we could reject the null
hypothesis if only mean is not equal. Thus, the alternative hypothesis is:
𝑯𝟏 : At least one mean pressure is not statistically equal.
2. Calculate the appropriate test statistic
The test statistic in ANOVA is the ratio of the between and within variation in the
data. It follows an F distribution.
𝟐
Total Sum of Squares, SST = ∑𝑟𝑖=1 ∑𝑐𝑗=1 (𝒙𝒊𝒋 − 𝒙
̿)
where, 𝑟 is the number of rows in the table, 𝑐 is the number of columns,
𝑥̿ is the grand mean, and 𝑥𝑖𝑗 is the 𝑖 𝑡ℎ observation in the 𝑗𝑡ℎ column.
(643 + 655 + 702 + 469 + 427 + 525 + 484 + 456 + 402)
𝐺𝑟𝑎𝑛𝑑 𝑚𝑒𝑎𝑛, 𝑥̿ =
9
𝑥̿ = 529.22
2
Within variation (or Error Sum of Squares) SSE = ∑ ∑ (𝑥𝑖𝑗 − 𝑥̅𝑗 )
Note: SST = SSTR + SSE (96303.55 = 86049.55 + 10254). Hence, we only need to find
any 2 among (SST, SSTR and SSE)
3. The next step in an ANOVA is to compute the “average” sources of variation in the
data using SST, SSTR, and SSE.
𝑆𝑆𝑇
Total Mean Squares, MST =
𝑁−1
It is the average total variation in the data (N is the total number of observations)
𝑆𝑆𝑇𝑅
Mean Square Treatment, MSTR =
𝑐−1
It is the average between variation (c is the number of columns in the data table)
𝑆𝑆𝐸
Mean Square Error, MSE =
𝑁−𝑐
It is the average within variation.
4. The test statistic may now be calculated.
For a one-way ANOVA the test statistic is equal to the ratio of MSTR and MSE.
5. Obtain the Critical Value
𝒅𝒇𝟏 = 𝒄 − 𝟏 = 3 - 1 = 2
𝒅𝒇𝟐 = 𝑵 − 𝒄 = 9 - 3 = 6
Hence, we need to find 𝐹𝑐𝑣 = 𝐹2,6 from the F-distribution table for numerator df = 2,
denominator df = 6. From the table, 𝐹2,6 = 5.143
6. Decision Rule
Reject hypothesis if 𝐹𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 > 𝐹𝑐𝑣
As 25.17 > 5.143, we reject the null hypothesis.
7. Interpretation
Since we rejected the null hypothesis, we are 95% confident (1 - α) that the mean
head pressure is not statistically equal for compact, midsize, and full-size cars.
Number of
Independent One. Two.
Variables