0% found this document useful (0 votes)
441 views

Introduction To Probability

Uploaded by

mathewsujith31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
441 views

Introduction To Probability

Uploaded by

mathewsujith31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 88

INTRODUCTION TO PROBABILITY

©–
If you torture the data long enough, it will confess!
- Ronald Coase

©–
Probability Theory - Terminologies

©–
Random Experiment

• Random experiment is an experiment in which the outcome


is not known with certainty.

• Predictive analysis mainly deals with random experiment


like:
– Predicting quarterly revenue of an organization
– Customer churn
– Demand for a product at future time period etc.


Sample Space
• It is the universal set that consist of all possible
outcomes of an experiment.

• It is represented using letter “S”

• Individual outcomes are called elementary events

• Sample Space can be finite or infinite.


Event
• Event(E) is a subset of a sample space and probability is
usually calculated with respect to an event.

• The Venn diagram indicates that the event E is a subset


of the sample space S, that is, E  S (E is a subset of S)


Probability Estimation using Relative
Frequency
• The classical approach to probability estimation of an event is
based on the relative frequency of the occurrence of that event

• According to frequency estimation, the probability of an event


X, P(X), is given by

Number of observations in favour of event X n( X )


P( X )  
Total number of observations N


Example 3.1
A website displays 10 advertisements and the revenue generated by the website
depends on the number of visitors to the site clicking on any of the advertisements
displayed in the website. The data collected by the company has revealed that out
of 2500 visitors, 30 people clicked on 1 advertisement, 15 clicked on 2
advertisements, and 5 clicked on 3 advertisements. Remaining did not click on any
of the advertisements. Calculate

(a) The probability that a visitor to the website will click on an advertisement.
(b) The probability that the visitor will click on at least two advertisements.
(c) The probability that a visitor will not click on any advertisements.


Solution
(a) Number of customers clicking an advertisement is 50 and the
total number of visitors is 2500. Thus, the probability that a
visitor to the website will click on an advertisement is
50
 0.02
2500

(b) Number of customers clicking on at least 2 advertisements is


20. Thus, the probability that a visitor will click on at least 2
advertisements is 20
 0.008
2500

(c) Probability that a visitor will not click on any advertisement


is
2450
 0.98
2500


Algebra of Events
• Assume that X, Y and Z are three events of a sample space. Then the
following algebraic relationships are valid and are useful while deriving
probabilities of events:

• Commutative rule: X  Y = Y  X and X  Y = Y  X

• Associative rule: (X  Y)  Z = X  (Y  Z) and (X  Y)  Z = X  (Y  Z)

• Distributive rule: X  (Y  Z) = (X  Y)  (X  Z)
X  (Y  Z) = (X  Y)  (X  Z)


Contd…

• The following rules known as DeMorgan’s Laws on


complementary sets are useful while deriving
probabilities:
(X  Y)C = XC  YC
(X  Y)C = XC  YC

where XC and YC are the complementary events of X and


Y, respectively


Axioms of Probability

According to axiomatic theory of probability, the probability of an


event E satisfies the following axioms

1. The probability of event E always lies between 0 and 1. That


is, 0  P(E) 1.

2. The probability of the universal set S is 1. That is, P(S) = 1

3. P(X  Y) = P(X) + P(Y), where X and Y are two mutually


exclusive events.


The elementary rules of probability are directly deduced from the original three
axioms of probability, using the set theory relationships

1. For any event A, the probability of the complementary event, written AC, is given
by

P(A) = 1 – P(AC)

If P(A) is a probability of observing a fraudulent transaction at an e-commerce


portal, then P(AC) is the probability of observing a genuine transaction.

2. The probability of an empty or impossible event, , is zero:

P( )  0


3. If occurrence of an event A implies that an event B also occurs, so that the
event class A is a subset of event class B, then the probability of A is less than
or equal to the probability of B:
P ( A)  P ( B )
4. The probability that either events A or B occur or both occur is given by
P( A  B )  P ( A)  P( B)  P ( A  B)
5. If A and B are mutually exclusive events, so that P( A  B) , 0then

P ( A  B)  P ( A)  P ( B )
6. If A1, A2, …, An are n events that form a partition of sample space S, then
their probabilities must add up to 1:
n
P( A1 )  P( A2 )    P( A n )   P( Ai )  1
i 1


Joint Probability

• Let A and B be two events in a sample space. Then


the joint probability of the two events, written as P(A
 B), is given by

Number of observations in A  B
P( A  B) 
Total number of observations


Example 3.2
At an e-commerce customer service centre a total of 112
complaints were received. 78 customers complained about late
delivery of the items and 40 complained about poor product
quality.

(a) Calculate the probability that a customer will complain about


both late delivery and product quality.
(b) What is the probability that a complaint is only about poor
quality of the product?


Solution to Example 3.2
• Let A = Late delivery and B = Poor quality of the product. Let
n(A) and n(B) be the number of events in favour of A and B.
So n(A) = 78 and n(B) = 40. Since the total number of
complaints is 112, hence
n(A  B) = 118 – 112 = 6
• Probability of a complaint about both delivery and poor
product quality is
n(A  B ) 6
P(A  B)    0.0535
Total number of complaints 112
• Probability that the complaint is only about poor quality =
1P(A) = 1  78  0.3035
112


• Marginal probability is simply a probability of an event X, denoted by P(X),
without any conditions

• Independent Events : Two events A and B are independent when


occurrence of one event (say event A) does not affect the probability of
occurrence of the other event (event B). Mathematically, two events A
and B are independent when

P(A  B) = P(A)  P(B).

• Conditional Probability: If A and B are events in a sample space, then the


conditional probability of the event B given that the event A has already
occurred, denoted by P(B|A), is defined as

P( A  B)
P( B | A)  , P( A)  0
P( A)


Application of Simple Probability Rules in Analytics

• Association rule mining is one of the popular algorithms used


to solve problems such as market basket analysis and
recommender systems

• Market basket analysis (MBA) is used frequently by retailers to


predict products a customer is likely to buy together, which
further can be used for designing planogram and product
promotions


Association Rule Mining

• Association rule learning (also known as association rule mining)


is a method of finding association between different entities in a
database

• Association rule is a relationship of the form


X  Y (that is, X implies Y).


Association rule learning Example - Binary
representation of point of sale data

Transaction ID Apple Orange Grapes Strawberry Plums Green Apple Banana

1 1 1 1 0 1 1 1

2 0 1 0 0 0 1 1

3 0 0 0 0 0 1 1

4 1 0 0 0 1 0 0

5 1 0 0 0 1 1 1

6 0 1 1 0 0 0 1

7 0 1 1 0 0 0 1


• In Table , transaction ID is the transaction reference number and apple,
orange, etc. are the different SKUs sold by the store. Binary code is used to
represent whether the SKU was purchased (equal to 1) or not (equal to 0)
during a transaction. The strength of association between two mutually
exclusive subsets can be measured using ‘support’, ‘confidence’, and ‘lift’

• Support between two sets (of products purchased) is calculated using the
joint probability of those events:

n( X  Y )
Support  P( X  Y ) 
N

• Where n(X  Y) is the number of times both X and Y is purchased together


and N is the total number of transactions


Association Rule Leaning Cont…

• Confidence is the conditional probability of purchasing product Y given the


product X is purchased. It measures probability of event Y (customer
buying a product Y) given the event X has occurred (the customer has
already purchased product X). That is,

P( X  Y )
Confidence = P(Y | X ) 
P( X )

• Lift: The third measure in association rule mining is lift, which is given by

P( X  Y )
Lift =
P ( X )  P (Y )
Association rules can be generated based on threshold values of support,
confidence and lift. For example, assume that the cut-off for support is 0.25
and confidence is 0.5 (Lift should be more than 1)


Bayes Theorem
• Bayes theorem is one of the most important concepts in analytics since
several problems are solved using Bayesian statistics
P( A  B) P( A  B)
P( A | B)  and P( B | A) 
P( B) P( A)

• Using the two equations, we can show that

P ( A | B) P( B )
P( B | A) 
P( A)


Terminologies used to describe various components in
Bayes Theorem
1. P(B) is called the prior probability (estimate of the probability without
any additional information).
P( A | B) P( B)
P ( B | A) 
P ( A)
2. P(B|A) is called the posterior probability (that is, given that the event A
has occurred, what is the probability of occurrence of event B). That is,
post the additional information (or additional evidence) that A has
occurred, what is estimated probability of occurrence of B.

3. P(A|B) is called the likelihood of observing evidence A if B is true.

4. P(A) is the prior probability of A


Monty Hall Problem


Monty Hall Problem Using Bayes Theorem
• Let C1, C2, and C3 be the events that the car is behind door 1, 2, and 3,
respectively. Let D1, D2, and D3 be the events that Monty opens door
1, 2, and 3, respectively. Prior probabilities of C1, C2, and C3 are

P(C1) = P(C2) = P(C3) = 1/3

• Assume that the player has chosen door 1 and Monty opens door 2 to
reveal a goat. Now we would like to calculate the posterior
probability P(C1|D2), that is, the probability that the car is behind
door 1 (door chosen initially by the player) when Monty has provided
the additional information that the car is not behind door 2


• Using, Bayes theorem
P( D2 | C1 )  P (C1 ) (1 / 2)  (1 / 3)
P (C1 | D2 )   1/ 3
P( D2 ) (1 / 2)

• P(D2|C1) = 1(if the car is behind door 1, then Monty can open
2
either door 2 or 3)
1 1
P(D2) = 2 3
2
Note that P(C2|D2) = 0. Thus P(C3|D2) = 1 – P(C1|D2) = 1 – = 3

Thus, changing the initial choice will increase the probability of


winning the car. Alternatively,
P( D2 | C3 )  P (C3 ) 1  (1 / 3)
P(C3 | D2 )   2/3
P( D2 ) (1 / 2)

P(D2|C3) = 1 (if the car is behind door 3 and the player has
chosen door 1, Monty has to open door 2 with probability 1)


•   Bayes theorem
Using,
P( D2 | C1 )  P(C1 ) (1/ 2)  (1/ 3)
P(C1 | D2 )    1/ 3
P( D2 ) (1 / 2)

• P(D2|C1) = (if the car is behind door 1, then Monty can open either
door 2 or 3)
P(
Note that P(C2|D2) = 0. Thus P(C3|D2) = 1 – P(C1|D2) = 1– =
Thus, changing the initial choice will increase the probability of
winning the car. Alternatively,
P( D2 | C3 )  P(C3 ) 1  (1/ 3)
P(C3 | D2 )   2/3
P( D2 ) (1/ 2)
• P(D2|C3) = 1 (if the car is behind door 3 and the player has chosen
door 1, Monty has to open door 2 with probability 1)


GENERALIZATION OF BAYES THEOREM
Event generated from mutually exclusive subsets

©–
Example 3.4

Black boxes used in aircrafts manufactured by three companies


A, B and C. 75% are manufactured by A, 15% by B, and 10% by C.
The defect rates of black boxes manufactured by A, B, and C are
4%, 6%, and 8%, respectively. If a black box tested randomly is
found to be defective, what is the probability that it is
manufactured by company A?


Solution to Example 3.4
• Let P(A), P(B), P(C) be events corresponding to the black box
being manufactured by companies A, B, and C, respectively,
and P(D) be the probability of defective black box. We are
interested in calculating the probability P(A|D).
P ( D | A)  P ( A)
P( A | D) 
P( D)
• Now P(D|A) = 0.04 and P(A) = 0.75. Using Eq.
P(D) = 0.75 × 0.04 + 0.15 × 0.06 + 0.10 × 0.08 = 0.047

So, 0.04  0.75


P( A | D)   0.6382
0.047


Random Variables
• Random variable is a
function that maps every
outcome in the sample
space to a real number.

• A function that assigns a


real number to each
sample point in the
sample space S.

• Random variable is a
robust and convenient
way of representing the
outcome of a random
experiment


Discrete Random Variables
• If the random variable X can assume only a finite or countably infinite set of
values, then it is called a discrete random variable.
• Examples of discrete random variables are:
– Credit rating (usually classified into different categories such as low,
medium and high or using labels such as AAA, AA, A, BBB, etc.).
– Number of orders received at an e-commerce retailer which can be
countably infinite.
– Customer churn (the random variables take binary values, 1. Churn and 2.
Do not churn).
– Fraud (the random variables take binary values, 1. Fraudulent transaction
and 2. Genuine transaction).
– Any experiment that involves counting (for example, number of returns in
a day from customers of e-commerce portals such as Amazon, Flipkart;
number of customers not accepting job offers from an organization).


Probability mass function
• For a discrete random variable,
the probability that a random
variable X taking a specific value
xi, P(X = xi), is called the
probability mass function P(xi).

• That is, a probability mass


function is a function that maps
each outcome of a random
experiment to a probability


Expected Value
• Expected value (or mean) of a discrete random
variable is given by
n
E ( X )   xi P ( xi )
i 1

• Where xi is the specific value taken by a discrete


random variable X and P(xi) is the corresponding
probability, that is, P(X = xi).


Variance and Standard Deviation

Variance of a discrete random variable is given by


n
Var( X )    xi  E ( X )  P( xi )
2

i 1

Standard deviation of a discrete random variable is given by

  VAR (X )


Probability Density Function (pdf)
• The probability density function, f(xi), is defined as
probability that the value of random variable X lies
between an infinitesimally small interval defined by
xi and xi + x

P( xi  X  xi  x)
f ( x)  lim
x 0 x


Cumulative Distribution Function (CDF)
• The cumulative distribution function (CDF) of a
continuous random variable is defined by
a
F (a )  P( X  a )  

f ( x)dx

Cumulative distribution function


Probability density function The probability between two
and cumulative distribution values a and b, P(a  X  b), is the
function of a continuous area between the values a and b
random variable satisfy the under the probability density
following properties function
f(x)  0


F ()   f ( x ) dx  1


b
P(a  X  b)   f ( x)dx  F (b)  F (a)
a


• The expected value of a continuous random variable,
E(X), is given by

E( X )   xf ( x)dx


• The variance of a continuous random variable,


Var(X), is given by

  x  E ( x) 
2
Var( X )  f ( x)dx



Binomial Distribution
• A random variable X is said to follow a Binomial
distribution when
– The random variable can have only two outcomes success
and failure (also known as Bernoulli trials).
– The objective is to find the probability of getting k
successes out of n trials.
– The probability of success is p and thus the probability of
failure is (1  p).
– The probability p is constant and does not change between
trials.


Probability Mass Function (PMF) of Binomial
Distribution
• The PMF of the Binomial distribution (probability that the
number of success will be exactly x out of n trials) is given by
n x
PMF ( x)  P( X  x)    p (1  p ) n  x , 0 xn
 x
Where n
 
 x
 

n!
x!( n  x )!


Mean and Variance of Binomial Distribution
The Mean of a binomial distribution is given by:
n
n x n
Mean  E ( X )   x  PMF( x)   x    p (1  p) n  x  np
x 0 x 0  x

The variance of a binomial distribution is given by

n n
n x
Var( X )   ( x  E ( X ))  PMF( x)   ( x  E ( X ))    p (1  p) n  x  np(1  p)
2 2

x 0 x 0  x

If the number of trials (n) in a binomial distribution is large, then it can be approximated
by normal distribution with mean np and variance npq.


Example 3.5

Fashion Trends Online (FTO) is an e-commerce company that sells women


apparel. It is observed that about 10% of their customers return the items
purchased by them for many reasons (such as size, color, and material
mismatch). On a particular day, 20 customers purchased items from FTO.
Calculate:
(a) Probability that exactly 5 customers will return the items.
(b) Probability that a maximum of 5 customers will return the items.
(c) Probability that more than 5 customers will return the items
purchased by them.
(d) Average number of customers who are likely to return the items.
(e) The variance and the standard deviation of the number of returns.


Solution
In this case, the value of n = 20 and p = 0.1.
(a) Probability that exactly 5 customers will return the items purchased is
 20 
P( X  5)     (0.1)5  (0.9)15  0.03192
5 
(b) Probability that a maximum of 5 customers will return the items purchased is
5  20 
P( X  5)      (0.1) k  (0.9) 20 k  0.9887
k 0  k 

(c) Probability that more than 5 customers will return the product is
5  20 
P ( X  5)  1  P( X  5)  1      (0.1) k  (0.9)20 k  1  0.9887  0.0113
k 0  k 

(d) The average number of customers who are likely to return the items is
E(X) = n × p = 20 × 0.1 = 2
(e) Variance of a binomial distribution is given by

Var(X) = n × p × (1  p) = 20 × 0.1 × 0.9 = 1.8


and the corresponding standard deviation is 1.3416


Poisson Distribution
• Poisson distribution is used when we have to find the
probability of number of events
• The probability mass function of a Poisson distribution is
given by 
e  k
P( X  k )  , k  0, 1, 2, ...
k!
• where  is the rate of occurrence of the events per unit of
measurement
• Cumulative distribution function of a Poisson distribution is
given by
k
e    k
P[ X  k ]  
i 0 k!


• The mean and variance of a Poisson random variable are given by E ( X )  
and Var( X )  

Probability mass function of a Poisson Cumulative distribution function of a


random variable ( = 4). Poisson random variable ( = 4).


Example
On average, about 20 customers per day cancel their order placed at Fashion
Trends Online. Calculate the probability that the number of cancellations on a
day is exactly 20 and the probability that the maximum number of
cancellations is 25

Solution
The probability that the number of cancellations is exactly 20 is given by
e 20 20 20
P ( X  20)   0.0888
20!
Probability that the maximum number of cancellation will be 25 is given by

e20 20k
25
P( X  25)    0.8878
k 0 k!


Geometric Distribution
• Geometric distribution represents a random experiment in which the
random variable predicts the number of failures before the success

• The probability density function of a geometric distribution is given by

P( X  x)  P(success at xth trial)  (1  p) x 1 p, where x  1, 2, 3, ...

• The cumulative distribution function is given by:

F ( x )  P ( X  x )  1  (1  p ) x
• Mean and variance of a geometric distribution are given by E ( X )  1
p

and Var( X ) 
(1  p )
p2


Probability mass function of a geometric Cumulative distribution function of a
distribution (p = 0.3). geometric distribution (p = 0.3).


Memoryless Property of Geometric Distribution
• Memoryless property is a special property of a geometric distribution in
which the conditional probability, P( X  i  j | X depends
 i ), only on the value j,
not on the value i. We know that
P( X  i)  1  P( X  i)  1  [1  (1  p)i ]  (1  p)i
P( X  i  j  X  i) P( X  i  j ) (1  p)i  j
P( X  i  j | X  i )     (1  p ) j

P( X  i ) P( X  i) (1  p)i

• Note that, P ( X  j )  (1  p ) j Thus,P( X  i  j | X  i)  P( X  j ).

• Memoryless property is an important property that simplifies calculations


associated with conditional probabilities


Example
Local Dhaniawala (LD) is an online grocery store and has an
innovative feature which predicts whether the customer has forgotten to
buy an item which is very common among customers of grocery items.
The probability that a customer buys milk in each shopping visit is 0.2.

(a) Calculate the probability that the customer’s first purchase of milk
happens during the 5th visit.
(b) Calculate the average time between purchases of milk.
(c) If a customer has not purchased milk during the past 3 shopping
visits, what is the probability that the customer will not buy milk for
another 2 visits?


Solution
(a) Probability that the customer’s first purchase of milk happens
on 5th trip is given by
P( X  5)  (1  0.2) 4  0.2  0.08192
(b) The average time between purchase of milk is
1 1
E( X )   5
p 0.2
(c) Given that a customer has not purchased milk for the past 3
shopping visits, the probability that the customer will not buy
for another 2 visits is given by

P( X  3  2 | X  3)  P( X  2)  (1  p) 2  (1  0.2)2  0.64


Parameters of Continuous Distributions
• Scale parameter: Scale parameter defines the range of the
continuous distribution. The larger the scale parameter value,
larger is the spread of the distribution.

• Shape parameter: Shape parameter defines the shape of the


probability distribution. The changes to the value of shape
parameter will change the shape of the distribution.

• Location parameter: Location parameter locates (or shifts)


the distribution on the horizontal axis.


Uniform Distribution
Probability density function Cumulative distribution functions
0, xa
 1 x a
 , x  [ a , b] 
F ( x)   a xb
f ( x)   b  a ,
b  a

0, otherwise 
1, xb

Mean and variance of uniform distribution are


1 and 1
E ( X )  (a  b ) Var( X )  (b  a)2
2 12


Exponential Distribution
• Exponential distribution is a single parameter continuous distribution that is
traditionally used for modelling time to failure of electronic components

• The probability density function and cumulative distribution of exponential


distribution are given by

f ( x)  e x , 0

F ( x )  1  e  x

• The parameter  is the scale parameter and represents the rate of occurrence
of the event, (1/) is the mean time between events.


Probability density function of an
exponential distribution

The mean and variance of an exponential distribution are given by


1 1
) 
E ( X and Var( X ) 
 2

The expected value (1/) is the mean time between events.


Memoryless Property of Exponential Distribution
• Exponential distribution is the only continuous probability
distribution that has the memoryless property. That is ,

P( X  t  s | X  t )  P( X  s)

P ( X  t  s  X  t ) P ( X  t  s ) e   (t  s )  s
P( X  t  s | X  t )     t  e
P( X  t ) P( X  t ) e


Example
The time to failure of an avionic system follows an exponential
distribution with a mean time between failures (MTBF) of 1000
hours.

(a) Calculate the probability that the system will fail before 1000
hours.
(b) Calculate the probability that it will not fail up to 2000 hours.
(c) Calculate the time by which 10% of the systems will fail (that
is calculate P10 life)


Solution
(a) The probability that the system will fail by 1000 hours is
F (1000)  1  et 1
 1000
In this case   1 / 1000, t  1000 so , F (1000)  1  e 1000  1  e 1  0.6321
(b) The probability that the system will not fail up to 2000 hours
1
is P( X  2000)  1  P( X  2000)  1  F (t )  et  e10002000  e2  0.1353

(c) The time by which 10% of the systems will fail is


F (t )  0.10  1  et  0.1 e t  0.9
So ,
1 hours
t     ln(0.9)  1000  ln(0.9)  105.61


That is, by 105.61 hours, 10% of items will fail.


Normal Distribution
• Normal distribution, also known as Gaussian distribution, is
one of the most popular continuous distribution in the field of
analytics especially due to its use in multiple contexts
• The probability density function and the cumulative
distribution function are given by
2
1  x  
1   
f ( x)  e 2  
,    x  
 2
2
x 1  t  
1   
F ( x)      x  
2  
e dt ,
 2
• Here  and  are the mean and standard deviation of the
normal distribution


NORM.DIST(x, , , true) can be used for calculating the probability density
function and cumulative distribution function of a normal distribution with
mean  and standard deviation .

Probability density function of a normal Cumulative distribution function of a


distribution normal distribution.


Properties of Normal Distribution
1. Theoretical normal density functions are defined between 
and +.

2. It is a two parameter distribution, where the parameter  is


the mean (location parameter) and the parameter  is the
standard deviation (scale parameter).

3. All normal distributions have symmetrical bell shape around


mean  (thus it is also median).  is also the mode of the
normal distribution, that is,  is the mean, median as well as
the mode.


4. For any normal distribution, the areas between specific values
measured in terms of  and  are given by:
Value of Random Variable Area under the Normal Distribution (CDF)

0.6828
    X   +  (area between one

sigma from the mean)

  2  X   + 2 (area between 0.9545

two sigma from the mean)

0.9973
  3  X   + 3 (area between

three sigma from the mean)

5. Any linear transformation of a normal random variable is also


normal random variable. That is, if X is a normal random
variable, then the linear transformation AX + B (where A and B
are two constants) is also a normal random variable.


• If X1 and X2 are two independent normal random variables
with mean 1 and 2 and variance 12 and  22 respectively, then
X1 + X2 is also a normal distribution with mean 1 + 2 and
2 2
variance  1 2

• Sampling distribution of mean values a large sample drawn


form a population of any distribution is likely to follow a
normal distribution, this result is known as the central limit
theorem


Standard Normal Variable

• A normal random variable with mean  = 0 and  = 1 is called


the standard normal variable and usually represented by Z
• The probability density function and cumulative distribution
function of a standard normal variable are given by

z2
1 
f ( z)  e 2
2

x2
z 1 
F ( z)   e 2 dz
 2


• By using the following transformation, any normal random
variable X can be converted into a standard normal variable

X 
Z 

• The random variable X can be written in the form of a
standard normal random variable using the relationship
X=+Z


• A simple approximation of standard normal CDF is given by
Tocher (1963)
e 2 kz
P( Z  z )  F ( z ) 
1  e2 kz

where k  2/

Another more accurate approximation is provided by Byrc


(2002):

 z 2
 A z  A  z2 / 2
P( Z  z )  F ( z )  1   1 2 e
 2  z 3  B z 2  B z  2 A 
 1 2 2


Example
According to a survey on use of smart phones in India, the smart
phone users spend 68 minutes in a day on average in sending
messages and the corresponding standard deviation is 12
minutes. Assume that the time spent in sending messages
follows a normal distribution.

(a) What proportion of the smart phone users are spending more
than 90 minutes in sending messages daily?
(b) What proportion of customers are spending less than 20
minutes?
(c) What proportion of customers are spending between 50
minutes and 100 minutes?


Solution
It is given that  = 68 minutes and  = 12 minutes.
(a) Proportion of customers spending more than 90 minutes is given by P(X 
90) = 1  P(X  90) = 1  F(90)
The standard normal random variable value for X = 120 is given by

x   90  68
Z   1.8333
 12
That is, F(X = 90) = F(Z = 1.8333). From standard normal distribution table,
we get for Z = 1.8333. The area under the standard normal distribution curve
is 0.9666. Thus , P(X  90) = 1 P(X  90) = 1  F(90) = 1 – 0.9666 = 0.0334

Alternatively, using Excel, we get


P(X  90) = 1  P(X  90) = 1 – Normdist (90, 68, 12, true) = 0.0334


(b) Proportion of customers spending less than 20 minutes is
P(X  20) = F(20)
Using Excel function, we have Normdist(20, 68, 12, true) =
3.1671 × 105

(c) Proportion of customers spending between 50 and 100


minutes is given by
P(50  X  100)  F (100)  F (50)
 Normdist(1 00,68,12, true)  Normdist(5 0,68,12, true)
 0.9293


Chi-Square Distribution
• Chi-square distribution with k degrees of freedom [denoted as
2(k) distribution] is a non-parametric distribution which is
obtained by adding square of k independent standard normal
random variables.
• Consider a normal random variable X1 with mean 1 and
standard deviation 1. Then we can define Z1 (the standard
normal random variable) as
X1  1
Z1 
1

• Then,
2
 X  1 
Z12   1 
  1 

is a chi-square distribution with one degree of freedom [2(1)]


• Let X2 be a normal random variable with mean 2 and standard deviation
2 and Z2 is the corresponding standard normal variable. Then the random
2 2
variable Z1  Z 2 given by
2 2
 X1  1   X 2  2 
Z12  Z 22 
 
 
 

  1   2 

is a chi-square distribution with 2 degrees of freedom.

• A chi-square distribution with k degrees of freedom is given by sum of


squares of standard normal random variables Z1, Z2, …, Zk obtained by
transforming normal random variables X1, X2, …, Xk with mean values
1, 2, …, k and corresponding standard deviations 1, 2, …, k. That
is
2 2 2
 X  1   X  2   X  k 
 2 (k )  Z12  Z 22  ...  Z k2   1    2   ...   k 
 1   2   k 


The probability density function of 2(k) is given by

k x
1 1 
f ( x)  x2 e 2
2k 2 (k 2)

where (k / 2) is a Gamma function given by

x
k 1 x
(k )  e dx
0


• The cumulative distribution function of a chi-square
distribution with k degrees of freedom is given by
 k x 
  , 
F ( x)   2 2 
 k 
 
 2 

• Where   k2 , 2xis the lower incomplete Gamma function. It is


given by
x
 (k , x)   t k 1e  t dt
0


Probability density function of chi-
Cumulative distribution of chi-square
square distribution for different values
distribution with k degrees of freedom
of k


Properties of chi-square distribution
• The mean and standard deviation of a chi-square distribution
are k and 2k where k is the degrees of freedom

• As the degrees of freedom k increases the probability density


function of a chi-square distribution approaches normal
distribution.

• Chi-square goodness of fit test is one of the popular tests for


checking whether a data follows a specific probability
distribution.


Student’s t-Distribution
• Student’s t-distribution (or simply t-distribution) arises while
estimating the population mean of a normal distribution using
sample which is either small and/or the population standard
deviation is unknown

• The distribution was developed by William Gosset under the


pseudo name ‘student’ while working for Guinness Brewery in
Dublin, Ireland (Student, 1908) and thus is called student’s
distribution.


• Assume that X1, X2, …, Xn are n observations (that is, sample of size n)
from a normal distribution with mean  and standard deviation . Let
 n
X   Xi
i 1

1 n  2
S   X i  X 
n  1 i 1 

• where  and S are mean and standard deviation estimated from the sample
X
X1, X2, …, Xn. Then the random variable t defined by


X
t
S/ n
follows a t-distribution with (n  1) degrees of freedom.


• The probability density function of t-distribution with n
degrees of freedom is given by
 n 1
  
n 1
2  2
 2  x 
f ( x)  1  
 
n n 
   n 
2


Cumulative distribution function of student’s t-distribution


Properties of t-distribution:
• The mean of a t distribution with 2 or more degrees of freedom is 0.
• The standard deviation of t-distribution is for n > 2, where n is
the number of degrees of freedom. n
n2

• As the degrees of freedom n increases the probability density


function of a t-distribution approaches the density function of
standard normal distribution. For n > 120, the difference between
the area under probability density function of a t-distribution is very
close to the area under a standard normal distribution.
• t-distribution is an important distribution for hypothesis testing of
means of a population and for comparing means of two populations.


F-Distribution
F-distribution (short form of Fisher’s distribution named after statistician
Ronald Fisher) is a ratio of two chi-square distributions. Let Y1 and Y2 be two
independent chi-square distributions with k1 and k2 degrees of freedom,
respectively. Then the random variable X is defined as
Y1 / k1
X 
Y2 / k 2
is a F distribution. The probability density function of an F-distribution is
given by
k /2
 k1  k 2  k1 
1
k1
 
k   1
 2  2  x 2
f ( x)  
k  k  k1  k 2
 1  2   k1 x 2
 2   2  
 1  
 k2  


Probability density function of F- Cumulative density function of F-
distribution distribution


Properties of F distribution:
k2
• Mean of F-distribution is k2  2
, for k2 > 2.

• Standard deviation of F-distribution is 2k 22 ( k1  k 2  2) for k2 > 4.


k1 ( k 2  2) 2 ( k 2  4)

• F-distribution is non-symmetrical and the shape of the distribution depends


on the values of k1 and k2.

• F-distribution is used in Analysis of Variance to test the mean values of


multiple groups


Summary
• The concept of probability, random variables and probability distributions
are foundations of data science. Knowledge of these concepts is important
for framing and solving analytics problems.

• Random variable is a function that maps an outcome of a random


experiment to a real number and plays an important role in analytics since
many key performance indicators used across industries are random
variables.

• Basic probability concepts such as joint events, independent events,


conditional probability and Bayes’ theorem are useful for predicting
probability of an event of importance. These concepts are used in
algorithms such as association rule learning which is used in solving
analytics problems such as market basket analysis and recommender
systems.


• Discrete probability distributions such as binomial distribution, Poisson
distribution and geometric distribution are used for modelling discrete
random variables.

• Continuous distributions such as normal distribution, chi-square


distribution, t-distribution and F-distribution play an important role in
hypothesis testing.

You might also like