Introduction To Probability
Introduction To Probability
©–
If you torture the data long enough, it will confess!
- Ronald Coase
©–
Probability Theory - Terminologies
©–
Random Experiment
–
Sample Space
• It is the universal set that consist of all possible
outcomes of an experiment.
–
Event
• Event(E) is a subset of a sample space and probability is
usually calculated with respect to an event.
–
Probability Estimation using Relative
Frequency
• The classical approach to probability estimation of an event is
based on the relative frequency of the occurrence of that event
–
Example 3.1
A website displays 10 advertisements and the revenue generated by the website
depends on the number of visitors to the site clicking on any of the advertisements
displayed in the website. The data collected by the company has revealed that out
of 2500 visitors, 30 people clicked on 1 advertisement, 15 clicked on 2
advertisements, and 5 clicked on 3 advertisements. Remaining did not click on any
of the advertisements. Calculate
(a) The probability that a visitor to the website will click on an advertisement.
(b) The probability that the visitor will click on at least two advertisements.
(c) The probability that a visitor will not click on any advertisements.
–
Solution
(a) Number of customers clicking an advertisement is 50 and the
total number of visitors is 2500. Thus, the probability that a
visitor to the website will click on an advertisement is
50
0.02
2500
–
Algebra of Events
• Assume that X, Y and Z are three events of a sample space. Then the
following algebraic relationships are valid and are useful while deriving
probabilities of events:
• Distributive rule: X (Y Z) = (X Y) (X Z)
X (Y Z) = (X Y) (X Z)
–
Contd…
–
Axioms of Probability
–
The elementary rules of probability are directly deduced from the original three
axioms of probability, using the set theory relationships
1. For any event A, the probability of the complementary event, written AC, is given
by
P(A) = 1 – P(AC)
P( ) 0
–
3. If occurrence of an event A implies that an event B also occurs, so that the
event class A is a subset of event class B, then the probability of A is less than
or equal to the probability of B:
P ( A) P ( B )
4. The probability that either events A or B occur or both occur is given by
P( A B ) P ( A) P( B) P ( A B)
5. If A and B are mutually exclusive events, so that P( A B) , 0then
P ( A B) P ( A) P ( B )
6. If A1, A2, …, An are n events that form a partition of sample space S, then
their probabilities must add up to 1:
n
P( A1 ) P( A2 ) P( A n ) P( Ai ) 1
i 1
–
Joint Probability
Number of observations in A B
P( A B)
Total number of observations
–
Example 3.2
At an e-commerce customer service centre a total of 112
complaints were received. 78 customers complained about late
delivery of the items and 40 complained about poor product
quality.
–
Solution to Example 3.2
• Let A = Late delivery and B = Poor quality of the product. Let
n(A) and n(B) be the number of events in favour of A and B.
So n(A) = 78 and n(B) = 40. Since the total number of
complaints is 112, hence
n(A B) = 118 – 112 = 6
• Probability of a complaint about both delivery and poor
product quality is
n(A B ) 6
P(A B) 0.0535
Total number of complaints 112
• Probability that the complaint is only about poor quality =
1P(A) = 1 78 0.3035
112
–
• Marginal probability is simply a probability of an event X, denoted by P(X),
without any conditions
P( A B)
P( B | A) , P( A) 0
P( A)
–
Application of Simple Probability Rules in Analytics
–
Association Rule Mining
–
Association rule learning Example - Binary
representation of point of sale data
1 1 1 1 0 1 1 1
2 0 1 0 0 0 1 1
3 0 0 0 0 0 1 1
4 1 0 0 0 1 0 0
5 1 0 0 0 1 1 1
6 0 1 1 0 0 0 1
7 0 1 1 0 0 0 1
–
• In Table , transaction ID is the transaction reference number and apple,
orange, etc. are the different SKUs sold by the store. Binary code is used to
represent whether the SKU was purchased (equal to 1) or not (equal to 0)
during a transaction. The strength of association between two mutually
exclusive subsets can be measured using ‘support’, ‘confidence’, and ‘lift’
• Support between two sets (of products purchased) is calculated using the
joint probability of those events:
n( X Y )
Support P( X Y )
N
–
Association Rule Leaning Cont…
P( X Y )
Confidence = P(Y | X )
P( X )
• Lift: The third measure in association rule mining is lift, which is given by
P( X Y )
Lift =
P ( X ) P (Y )
Association rules can be generated based on threshold values of support,
confidence and lift. For example, assume that the cut-off for support is 0.25
and confidence is 0.5 (Lift should be more than 1)
–
Bayes Theorem
• Bayes theorem is one of the most important concepts in analytics since
several problems are solved using Bayesian statistics
P( A B) P( A B)
P( A | B) and P( B | A)
P( B) P( A)
P ( A | B) P( B )
P( B | A)
P( A)
–
Terminologies used to describe various components in
Bayes Theorem
1. P(B) is called the prior probability (estimate of the probability without
any additional information).
P( A | B) P( B)
P ( B | A)
P ( A)
2. P(B|A) is called the posterior probability (that is, given that the event A
has occurred, what is the probability of occurrence of event B). That is,
post the additional information (or additional evidence) that A has
occurred, what is estimated probability of occurrence of B.
–
Monty Hall Problem
–
Monty Hall Problem Using Bayes Theorem
• Let C1, C2, and C3 be the events that the car is behind door 1, 2, and 3,
respectively. Let D1, D2, and D3 be the events that Monty opens door
1, 2, and 3, respectively. Prior probabilities of C1, C2, and C3 are
• Assume that the player has chosen door 1 and Monty opens door 2 to
reveal a goat. Now we would like to calculate the posterior
probability P(C1|D2), that is, the probability that the car is behind
door 1 (door chosen initially by the player) when Monty has provided
the additional information that the car is not behind door 2
–
• Using, Bayes theorem
P( D2 | C1 ) P (C1 ) (1 / 2) (1 / 3)
P (C1 | D2 ) 1/ 3
P( D2 ) (1 / 2)
• P(D2|C1) = 1(if the car is behind door 1, then Monty can open
2
either door 2 or 3)
1 1
P(D2) = 2 3
2
Note that P(C2|D2) = 0. Thus P(C3|D2) = 1 – P(C1|D2) = 1 – = 3
P(D2|C3) = 1 (if the car is behind door 3 and the player has
chosen door 1, Monty has to open door 2 with probability 1)
–
• Bayes theorem
Using,
P( D2 | C1 ) P(C1 ) (1/ 2) (1/ 3)
P(C1 | D2 ) 1/ 3
P( D2 ) (1 / 2)
• P(D2|C1) = (if the car is behind door 1, then Monty can open either
door 2 or 3)
P(
Note that P(C2|D2) = 0. Thus P(C3|D2) = 1 – P(C1|D2) = 1– =
Thus, changing the initial choice will increase the probability of
winning the car. Alternatively,
P( D2 | C3 ) P(C3 ) 1 (1/ 3)
P(C3 | D2 ) 2/3
P( D2 ) (1/ 2)
• P(D2|C3) = 1 (if the car is behind door 3 and the player has chosen
door 1, Monty has to open door 2 with probability 1)
–
GENERALIZATION OF BAYES THEOREM
Event generated from mutually exclusive subsets
©–
Example 3.4
–
Solution to Example 3.4
• Let P(A), P(B), P(C) be events corresponding to the black box
being manufactured by companies A, B, and C, respectively,
and P(D) be the probability of defective black box. We are
interested in calculating the probability P(A|D).
P ( D | A) P ( A)
P( A | D)
P( D)
• Now P(D|A) = 0.04 and P(A) = 0.75. Using Eq.
P(D) = 0.75 × 0.04 + 0.15 × 0.06 + 0.10 × 0.08 = 0.047
–
Random Variables
• Random variable is a
function that maps every
outcome in the sample
space to a real number.
• Random variable is a
robust and convenient
way of representing the
outcome of a random
experiment
–
Discrete Random Variables
• If the random variable X can assume only a finite or countably infinite set of
values, then it is called a discrete random variable.
• Examples of discrete random variables are:
– Credit rating (usually classified into different categories such as low,
medium and high or using labels such as AAA, AA, A, BBB, etc.).
– Number of orders received at an e-commerce retailer which can be
countably infinite.
– Customer churn (the random variables take binary values, 1. Churn and 2.
Do not churn).
– Fraud (the random variables take binary values, 1. Fraudulent transaction
and 2. Genuine transaction).
– Any experiment that involves counting (for example, number of returns in
a day from customers of e-commerce portals such as Amazon, Flipkart;
number of customers not accepting job offers from an organization).
–
Probability mass function
• For a discrete random variable,
the probability that a random
variable X taking a specific value
xi, P(X = xi), is called the
probability mass function P(xi).
–
Expected Value
• Expected value (or mean) of a discrete random
variable is given by
n
E ( X ) xi P ( xi )
i 1
–
Variance and Standard Deviation
i 1
VAR (X )
–
Probability Density Function (pdf)
• The probability density function, f(xi), is defined as
probability that the value of random variable X lies
between an infinitesimally small interval defined by
xi and xi + x
P( xi X xi x)
f ( x) lim
x 0 x
–
Cumulative Distribution Function (CDF)
• The cumulative distribution function (CDF) of a
continuous random variable is defined by
a
F (a ) P( X a )
f ( x)dx
–
Probability density function The probability between two
and cumulative distribution values a and b, P(a X b), is the
function of a continuous area between the values a and b
random variable satisfy the under the probability density
following properties function
f(x) 0
F () f ( x ) dx 1
b
P(a X b) f ( x)dx F (b) F (a)
a
–
• The expected value of a continuous random variable,
E(X), is given by
E( X ) xf ( x)dx
x E ( x)
2
Var( X ) f ( x)dx
–
Binomial Distribution
• A random variable X is said to follow a Binomial
distribution when
– The random variable can have only two outcomes success
and failure (also known as Bernoulli trials).
– The objective is to find the probability of getting k
successes out of n trials.
– The probability of success is p and thus the probability of
failure is (1 p).
– The probability p is constant and does not change between
trials.
–
Probability Mass Function (PMF) of Binomial
Distribution
• The PMF of the Binomial distribution (probability that the
number of success will be exactly x out of n trials) is given by
n x
PMF ( x) P( X x) p (1 p ) n x , 0 xn
x
Where n
x
n!
x!( n x )!
–
Mean and Variance of Binomial Distribution
The Mean of a binomial distribution is given by:
n
n x n
Mean E ( X ) x PMF( x) x p (1 p) n x np
x 0 x 0 x
n n
n x
Var( X ) ( x E ( X )) PMF( x) ( x E ( X )) p (1 p) n x np(1 p)
2 2
x 0 x 0 x
If the number of trials (n) in a binomial distribution is large, then it can be approximated
by normal distribution with mean np and variance npq.
–
Example 3.5
–
Solution
In this case, the value of n = 20 and p = 0.1.
(a) Probability that exactly 5 customers will return the items purchased is
20
P( X 5) (0.1)5 (0.9)15 0.03192
5
(b) Probability that a maximum of 5 customers will return the items purchased is
5 20
P( X 5) (0.1) k (0.9) 20 k 0.9887
k 0 k
(c) Probability that more than 5 customers will return the product is
5 20
P ( X 5) 1 P( X 5) 1 (0.1) k (0.9)20 k 1 0.9887 0.0113
k 0 k
(d) The average number of customers who are likely to return the items is
E(X) = n × p = 20 × 0.1 = 2
(e) Variance of a binomial distribution is given by
–
Poisson Distribution
• Poisson distribution is used when we have to find the
probability of number of events
• The probability mass function of a Poisson distribution is
given by
e k
P( X k ) , k 0, 1, 2, ...
k!
• where is the rate of occurrence of the events per unit of
measurement
• Cumulative distribution function of a Poisson distribution is
given by
k
e k
P[ X k ]
i 0 k!
–
• The mean and variance of a Poisson random variable are given by E ( X )
and Var( X )
–
Example
On average, about 20 customers per day cancel their order placed at Fashion
Trends Online. Calculate the probability that the number of cancellations on a
day is exactly 20 and the probability that the maximum number of
cancellations is 25
Solution
The probability that the number of cancellations is exactly 20 is given by
e 20 20 20
P ( X 20) 0.0888
20!
Probability that the maximum number of cancellation will be 25 is given by
e20 20k
25
P( X 25) 0.8878
k 0 k!
–
Geometric Distribution
• Geometric distribution represents a random experiment in which the
random variable predicts the number of failures before the success
F ( x ) P ( X x ) 1 (1 p ) x
• Mean and variance of a geometric distribution are given by E ( X ) 1
p
and Var( X )
(1 p )
p2
–
Probability mass function of a geometric Cumulative distribution function of a
distribution (p = 0.3). geometric distribution (p = 0.3).
–
Memoryless Property of Geometric Distribution
• Memoryless property is a special property of a geometric distribution in
which the conditional probability, P( X i j | X depends
i ), only on the value j,
not on the value i. We know that
P( X i) 1 P( X i) 1 [1 (1 p)i ] (1 p)i
P( X i j X i) P( X i j ) (1 p)i j
P( X i j | X i ) (1 p ) j
P( X i ) P( X i) (1 p)i
–
Example
Local Dhaniawala (LD) is an online grocery store and has an
innovative feature which predicts whether the customer has forgotten to
buy an item which is very common among customers of grocery items.
The probability that a customer buys milk in each shopping visit is 0.2.
(a) Calculate the probability that the customer’s first purchase of milk
happens during the 5th visit.
(b) Calculate the average time between purchases of milk.
(c) If a customer has not purchased milk during the past 3 shopping
visits, what is the probability that the customer will not buy milk for
another 2 visits?
–
Solution
(a) Probability that the customer’s first purchase of milk happens
on 5th trip is given by
P( X 5) (1 0.2) 4 0.2 0.08192
(b) The average time between purchase of milk is
1 1
E( X ) 5
p 0.2
(c) Given that a customer has not purchased milk for the past 3
shopping visits, the probability that the customer will not buy
for another 2 visits is given by
P( X 3 2 | X 3) P( X 2) (1 p) 2 (1 0.2)2 0.64
–
Parameters of Continuous Distributions
• Scale parameter: Scale parameter defines the range of the
continuous distribution. The larger the scale parameter value,
larger is the spread of the distribution.
–
Uniform Distribution
Probability density function Cumulative distribution functions
0, xa
1 x a
, x [ a , b]
F ( x) a xb
f ( x) b a ,
b a
0, otherwise
1, xb
–
Exponential Distribution
• Exponential distribution is a single parameter continuous distribution that is
traditionally used for modelling time to failure of electronic components
f ( x) e x , 0
F ( x ) 1 e x
• The parameter is the scale parameter and represents the rate of occurrence
of the event, (1/) is the mean time between events.
–
Probability density function of an
exponential distribution
–
Memoryless Property of Exponential Distribution
• Exponential distribution is the only continuous probability
distribution that has the memoryless property. That is ,
P( X t s | X t ) P( X s)
P ( X t s X t ) P ( X t s ) e (t s ) s
P( X t s | X t ) t e
P( X t ) P( X t ) e
–
Example
The time to failure of an avionic system follows an exponential
distribution with a mean time between failures (MTBF) of 1000
hours.
(a) Calculate the probability that the system will fail before 1000
hours.
(b) Calculate the probability that it will not fail up to 2000 hours.
(c) Calculate the time by which 10% of the systems will fail (that
is calculate P10 life)
–
Solution
(a) The probability that the system will fail by 1000 hours is
F (1000) 1 et 1
1000
In this case 1 / 1000, t 1000 so , F (1000) 1 e 1000 1 e 1 0.6321
(b) The probability that the system will not fail up to 2000 hours
1
is P( X 2000) 1 P( X 2000) 1 F (t ) et e10002000 e2 0.1353
–
Normal Distribution
• Normal distribution, also known as Gaussian distribution, is
one of the most popular continuous distribution in the field of
analytics especially due to its use in multiple contexts
• The probability density function and the cumulative
distribution function are given by
2
1 x
1
f ( x) e 2
, x
2
2
x 1 t
1
F ( x) x
2
e dt ,
2
• Here and are the mean and standard deviation of the
normal distribution
–
NORM.DIST(x, , , true) can be used for calculating the probability density
function and cumulative distribution function of a normal distribution with
mean and standard deviation .
–
Properties of Normal Distribution
1. Theoretical normal density functions are defined between
and +.
–
4. For any normal distribution, the areas between specific values
measured in terms of and are given by:
Value of Random Variable Area under the Normal Distribution (CDF)
0.6828
X + (area between one
0.9973
3 X + 3 (area between
–
• If X1 and X2 are two independent normal random variables
with mean 1 and 2 and variance 12 and 22 respectively, then
X1 + X2 is also a normal distribution with mean 1 + 2 and
2 2
variance 1 2
–
Standard Normal Variable
z2
1
f ( z) e 2
2
x2
z 1
F ( z) e 2 dz
2
–
• By using the following transformation, any normal random
variable X can be converted into a standard normal variable
X
Z
• The random variable X can be written in the form of a
standard normal random variable using the relationship
X=+Z
–
• A simple approximation of standard normal CDF is given by
Tocher (1963)
e 2 kz
P( Z z ) F ( z )
1 e2 kz
where k 2/
z 2
A z A z2 / 2
P( Z z ) F ( z ) 1 1 2 e
2 z 3 B z 2 B z 2 A
1 2 2
–
Example
According to a survey on use of smart phones in India, the smart
phone users spend 68 minutes in a day on average in sending
messages and the corresponding standard deviation is 12
minutes. Assume that the time spent in sending messages
follows a normal distribution.
(a) What proportion of the smart phone users are spending more
than 90 minutes in sending messages daily?
(b) What proportion of customers are spending less than 20
minutes?
(c) What proportion of customers are spending between 50
minutes and 100 minutes?
–
Solution
It is given that = 68 minutes and = 12 minutes.
(a) Proportion of customers spending more than 90 minutes is given by P(X
90) = 1 P(X 90) = 1 F(90)
The standard normal random variable value for X = 120 is given by
x 90 68
Z 1.8333
12
That is, F(X = 90) = F(Z = 1.8333). From standard normal distribution table,
we get for Z = 1.8333. The area under the standard normal distribution curve
is 0.9666. Thus , P(X 90) = 1 P(X 90) = 1 F(90) = 1 – 0.9666 = 0.0334
–
(b) Proportion of customers spending less than 20 minutes is
P(X 20) = F(20)
Using Excel function, we have Normdist(20, 68, 12, true) =
3.1671 × 105
–
Chi-Square Distribution
• Chi-square distribution with k degrees of freedom [denoted as
2(k) distribution] is a non-parametric distribution which is
obtained by adding square of k independent standard normal
random variables.
• Consider a normal random variable X1 with mean 1 and
standard deviation 1. Then we can define Z1 (the standard
normal random variable) as
X1 1
Z1
1
• Then,
2
X 1
Z12 1
1
–
• Let X2 be a normal random variable with mean 2 and standard deviation
2 and Z2 is the corresponding standard normal variable. Then the random
2 2
variable Z1 Z 2 given by
2 2
X1 1 X 2 2
Z12 Z 22
1 2
–
The probability density function of 2(k) is given by
k x
1 1
f ( x) x2 e 2
2k 2 (k 2)
x
k 1 x
(k ) e dx
0
–
• The cumulative distribution function of a chi-square
distribution with k degrees of freedom is given by
k x
,
F ( x) 2 2
k
2
–
Probability density function of chi-
Cumulative distribution of chi-square
square distribution for different values
distribution with k degrees of freedom
of k
–
Properties of chi-square distribution
• The mean and standard deviation of a chi-square distribution
are k and 2k where k is the degrees of freedom
–
Student’s t-Distribution
• Student’s t-distribution (or simply t-distribution) arises while
estimating the population mean of a normal distribution using
sample which is either small and/or the population standard
deviation is unknown
–
• Assume that X1, X2, …, Xn are n observations (that is, sample of size n)
from a normal distribution with mean and standard deviation . Let
n
X Xi
i 1
1 n 2
S X i X
n 1 i 1
• where and S are mean and standard deviation estimated from the sample
X
X1, X2, …, Xn. Then the random variable t defined by
X
t
S/ n
follows a t-distribution with (n 1) degrees of freedom.
–
• The probability density function of t-distribution with n
degrees of freedom is given by
n 1
n 1
2 2
2 x
f ( x) 1
n n
n
2
–
Cumulative distribution function of student’s t-distribution
–
Properties of t-distribution:
• The mean of a t distribution with 2 or more degrees of freedom is 0.
• The standard deviation of t-distribution is for n > 2, where n is
the number of degrees of freedom. n
n2
–
F-Distribution
F-distribution (short form of Fisher’s distribution named after statistician
Ronald Fisher) is a ratio of two chi-square distributions. Let Y1 and Y2 be two
independent chi-square distributions with k1 and k2 degrees of freedom,
respectively. Then the random variable X is defined as
Y1 / k1
X
Y2 / k 2
is a F distribution. The probability density function of an F-distribution is
given by
k /2
k1 k 2 k1
1
k1
k 1
2 2 x 2
f ( x)
k k k1 k 2
1 2 k1 x 2
2 2
1
k2
–
Probability density function of F- Cumulative density function of F-
distribution distribution
–
Properties of F distribution:
k2
• Mean of F-distribution is k2 2
, for k2 > 2.
–
Summary
• The concept of probability, random variables and probability distributions
are foundations of data science. Knowledge of these concepts is important
for framing and solving analytics problems.
–
• Discrete probability distributions such as binomial distribution, Poisson
distribution and geometric distribution are used for modelling discrete
random variables.