Introduction to probability
Introduction to probability
1. What is Probability?
The sample space of an experiment is the set of all possible outcomes. For
example:
3. Events
4. Calculating Probability
For equally likely outcomes, the probability of an event E occurring is given by:
• Addition Rule: The probability that either of two events A or B occurs is:
P(A∪B)=P(A)+P(B)−P(A∩B)
If A and B are mutually exclusive (i.e., they cannot happen at the same
time), then P(A∩B) = 0, and the formula simplifies to:
P(A∪B)=P(A)+P(B)
P(A∩B)=P(A)×P(B)
P(A∩B)=P(A)×P(B∣A)
6. Conditional Probability
This is useful in many real-world situations, like determining the likelihood of rain
if it's already cloudy.
7. Complementary Events
The complement of an event A (denoted Ac) is the event that A does not occur.
The probability of the complement is:
P(Ac)=1−P(A)
For example, if the probability of it raining tomorrow is 0.3, the probability that it
does not rain is:
8. Common Distributions
• Uniform Distribution: All outcomes are equally likely (e.g., rolling a fair die).
• Binomial Distribution: Models the number of successes in a fixed number
of independent trials, each with two possible outcomes (success/failure).
• Normal Distribution: A continuous probability distribution that is
symmetric about the mean, often used in statistics and natural sciences.
9. Bayes' Theorem
P(A∣B)=P(B∣A) × P(A)
P(B)
This theorem is frequently used in areas like medical testing, machine learning,
and decision-making under uncertainty.
For example, when we toss a coin, either we get Head OR Tail, only two possible
outcomes are possible (H, T). But when two coins are tossed then there will be
four possible outcomes, i.e {(H, H), (H, T), (T, H), (T, T)}.
1) There are 6 pillows in a bed, 3 are red, 2 are yellow and 1 is blue. What is the
probability of picking a yellow pillow?
Ans: The probability is equal to the number of yellow pillows in the bed divided by
the total number of pillows, i.e. 2/6 = 1/3.
2) There is a container full of coloured bottles, red, blue, green and orange.
Some of the bottles are picked out and displaced. Sumit did this 1000 times and
got the following results:
Ans: For every 1000 bottles picked out, 450 are green.
b) If there are 100 bottles in the container, how many of them are likely to be
green?
Ans: The experiment implies that 450 out of 1000 bottles are green.
Conclusion
Probability Theory
2. Events
• Event: A subset of the sample space. An event may consist of one or more
outcomes. For example, the event "rolling an even number" when tossing a
fair die is E={2,4,6}.
3. Probability of an Event
For example, the probability of rolling an even number on a fair die is:
• Addition Rule (for mutually exclusive events): If A and B are two mutually
exclusive events (they cannot both happen at the same time), then the
probability of either A or B occurring is:
P(A∩B)=P(A)×P(B)
• Complement Rule: The probability that event A does not occur is:
P(Ac)=1−P(A)
5. Conditional Probability
P(A∣B)=P(A∩B)
P(B)
If P(B)>0.
6. Random Variables
E(X) = ∑ I xi P(xi)
where xi are the possible values of X, and P(xi) is the probability of xi.
Var(X) = E[(X−E(X))2]
The standard deviation is the square root of the variance, providing a measure of
how spread out the values of X are around the expected value.
A fundamental theorem in probability theory that states that the sum (or
average) of a large number of independent, identically distributed random
variables, regardless of their original distribution, will be approximately normally
distributed.
A function that describes the probability that a random variable X takes on a value
less than or equal to a specific value x. It is given by:
FX(x)=P( X ≤ x)
These are just a few of the fundamental terms in probability theory. Many of
these concepts are interrelated and form the foundation for more advanced
studies in statistics, machine learning, and other fields involving uncertainty and
randomness.
The axioms of probability form the foundation for the mathematical theory of
probability. They are a set of three basic rules that must be satisfied by any
probability measure on a sample space. These axioms were formalized by the
Russian mathematician Andrey Kolmogorov in 1933, and they provide a rigorous
framework for understanding probability.
Here are the three axioms:
P(A)≥0
This means that the probability of any event must be a non-negative number. It
can never be negative.
The probability of the sample space Ω (the set of all possible outcomes) is 1:
P(Ω)=1
This axiom asserts that the probability of some outcome in the sample space must
always occur, i.e., one of the events in the sample space will happen with
certainty.
For any two mutually exclusive (disjoint) events A and B, the probability of their
union is the sum of their probabilities:
P(A∪B)=P(A)+P(B), if A∩B=∅
If two events A and B cannot occur simultaneously (i.e., their intersection is the
empty set), the probability of their union is simply the sum of the individual
probabilities.
P(⋃i=1∞Ai)=∑i=1∞P(Ai)
Derived properties from the axioms:
P(Ac)=1−P(A)
P(A∪B)=P(A)+P(B)−P(A∩B)
Let’s break down how simple probability rules can be applied within association
rule learning:
1. Support (Frequency-based probability)
This is the empirical probability of observing the item set X in the dataset.
For example, if in a retail dataset, 100 out of 1,000 transactions contain both
bread and butter, the support of the item set {bread, butter} is:
This means that 10% of transactions contain both bread and butter.
Confidence is the probability that an item Y will be purchased given that item X
has been purchased. It is a measure of the reliability of the rule X→Y, and can be
thought of as a conditional probability.
P(Y∣X)=Support of (X∪Y)
Support of X
For instance, if 80 transactions contain both bread and butter, and 100
transactions contain bread, then the confidence of the rule {bread}→{butter} is:
P({butter}∣{bread})=80/100=0.8
This means that if a customer buys bread, there’s an 80% chance they will also
buy butter.
Lift is a measure of how much more likely two items are to appear together than
if they were independent. It compares the joint probability of X and Y with the
product of their individual probabilities.
Lift(X→Y) = P(X∩Y)
P(X)×P(Y)
Lift measures whether the occurrence of one item increases the likelihood of the
other item occurring:
• Lift > 1: Items are positively associated (purchased together more often
than expected by chance).
• Lift = 1: Items are independent (purchased together as expected by
chance).
• Lift < 1: Items are negatively associated (purchased together less often than
expected by chance).
A variant of confidence is the expected confidence, which assumes that the items
are independent. It is used as a baseline to evaluate how much better the actual
rule is compared to what would be expected under the assumption of
independence.
If the actual confidence is significantly higher than the expected confidence, then
the items are considered to have a strong association.
If the actual confidence P(Y∣X) is much higher than 0.008, then this indicates a
strong association.
In some cases, the associations found in rules might be evaluated for statistical
significance using tests like the chi-squared test. This is useful to determine if the
observed association is due to chance or if it is statistically significant.
Bayes Theorem:-
P(H∣E) = P(E∣H)⋅P(H)
P(E)
Where:
Intuitive Breakdown:
1. Prior Probability P(H): This is your belief about the hypothesis before
considering the new evidence. For instance, it could be based on past
knowledge or experience.
2. Likelihood P(E∣H): This represents how likely the observed evidence is,
assuming the hypothesis is true. For example, if you have a hypothesis that
it will rain today, the likelihood is how probable it is that you would see
cloudy skies if it were indeed going to rain.
3. Marginal Likelihood P(E): This is the total probability of observing the
evidence, considering all possible hypotheses. It can be seen as a
normalizing constant that ensures the probabilities sum to 1.
4. Posterior Probability P(H∣E): This is the updated probability of the
hypothesis after considering the new evidence. It's the quantity that Bayes'
Theorem helps us compute.
Example:
Where:
• P(H) is the prior probability that the patient has the disease (e.g., based on
general population statistics).
• P(E∣H) is the probability of a positive test result, given that the patient has
the disease (i.e., the sensitivity of the test).
• P(E) is the total probability of a positive test result, regardless of whether
the patient has the disease or not. This can be calculated as
P(E)=P(E∣H)⋅P(H)+P(E∣¬H)⋅P(¬H), where ¬H is the hypothesis that the patient
does not have the disease.
Bayes' Theorem allows for updating beliefs in light of new data, and it is widely
used in various fields such as medical diagnosis, machine learning, risk analysis,
and even legal reasoning. It essentially helps refine decisions by incorporating
new information in a statistically sound way.
Random Variable:
Probability Distribution
The possible outcomes of a random variable and their probabilities are often
described using a probability distribution. For a discrete random variable, the
distribution is given by a probability mass function (PMF), which assigns a
probability to each possible outcome. For continuous random variables, the
distribution is described by a probability density function (PDF).
E[X]= ∑ xi P(xi)
• Standard Deviation: The square root of the variance, which also measures
the spread but in the same units as the original random variable.
Imagine a simple game where you roll a fair die, and the random variable X
represents the outcome of the roll. The values of X are 1, 2, 3, 4, 5, and 6, each
with probability 1/6. The expected value of X would be the average of these
values, weighted by their probabilities:
This means that on average, you would expect the roll to result in a 3.5 (though in
reality, you can't roll a 3.5 on a die).
These are both fundamental concepts in probability theory, but they apply to
different types of random variables:
Example: Suppose you roll a fair six-sided die. The random variable X represents
the outcome (1 through 6), and the PMF for X is:
This means each outcome has a probability of 1/6.
3. The probability that XXX lies within an interval [a,b] is given by the
area under the curve between a and b:
Key Differences:
• Discrete: The number of heads when flipping a fair coin three times.
o The PMF gives the probability of each outcome (e.g., 0, 1, 2, or 3
heads).
• Continuous: The exact weight of a person.
o The PDF gives the likelihood of being a certain weight, but the
probability of having an exact weight (like 70.0 kg) is technically zero;
instead, you would calculate the probability of a range of weights.
Definition:
FX(x)=P(X ≤ x)
Where:
• P( X ≤ x) is the probability that the random variable X takes a value less than
or equal to x.
• x is any value in the sample space of X.
Properties of the CDF:
For a discrete random variable X, the CDF is a step function, where it jumps at
each possible value that X can take.
For a continuous random variable X, the CDF is the integral of the probability
density function (PDF), f X(x), from −∞ to x:
Example:
Discrete CDF:
Consider a random variable XXX that can take the values 1, 2, or 3 with the
following probabilities:
• P(X=1)=0.2
• P(X=2)=0.5
• P(X=3)=0.3
Continuous CDF:
For a continuous random variable, say X, with PDF f X(x), the CDF is given by:
For example, if X follows a normal distribution with mean μ (mu) and standard
deviation σ (sigma), then the CDF FX(x) is related to the standard normal CDF,
often denoted Φ(x).
Binomial distribution
Where:
Example:
Suppose you flip a coin 5 times, and you're interested in the probability of getting
exactly 3 heads (successes), with the probability of heads being 0.5 on each flip
(fair coin).
Here:
For a binomial random variable X, the mean and variance are given by:
Applications:
Poisson Distribution
Key Properties:
The probability of observing exactly x events in a given interval, when the average
rate of occurrences is λ, is given by the Poisson probability mass function:
Where:
Important Properties:
Mean: E[X]=λ
Variance: Var(X)=λ
Example:
Suppose a call center receives an average of 5 calls per hour. The number of calls
received in a given hour follows a Poisson distribution with λ=5. The probability of
receiving exactly 3 calls in the next hour is:
Thus, the probability of receiving exactly 3 calls in the next hour is approximately
14.04%.
Approximation:
Geometric Distribution
1. Trial Characteristics:
o The trials are independent.
o Each trial has two possible outcomes: success or failure.
o The probability of success, p, is constant across trials.
o The random variable represents the number of trials until the first
success occurs.
2. Probability Mass Function (PMF): The probability that the first success
occurs on the k-th trial (where k=1,2,3,…) is given by the formula:
P(X=k)=(1−p)k−1⋅p
where:
P(X ≤ k)=1−(1−p)k
4. Mean (Expected Value): The expected number of trials until the first
success is:
This tells you, on average, how many trials it will take to get the first
success.
5. Variance: The variance of the number of trials until the first success is:
Example:
Suppose you have a game where the probability of winning (success) on any given
trial is p=0.2. What is the probability that the first win happens on the 3rd trial?
P( X = 3) = (1−0.2)3−1⋅0.2=(0.8)2⋅0.2 = 0.64⋅0.2 = 0.
So, the probability that the first success occurs on the 3rd trial is 0.128 or 12.8%.
Applications:
uniform distribution
• Probability Mass Function (PMF): If you have n equally likely outcomes, the
probability of each outcome xi is given by:
where n is the total number of possible outcomes.
Example: If you roll a fair die, the probability of rolling any particular
number (e.g., 1) is:
The total probability over the entire range [a, b] must be 1, so the constant
1/b−a ensures that the area under the curve is 1.
The probability that X falls within any subinterval [c, d] of [0,10] is given by:
For example, the probability that X falls between 3 and 7 is:
For a continuous uniform distribution over [a, b], the mean is the
midpoint of the interval:
• Variance:
o For a discrete uniform distribution with n outcomes, the variance is:
o For a continuous uniform distribution over [a, b], the variance is:
Applications:
• Discrete Uniform Distribution: Used for scenarios where all outcomes are
equally likely, such as rolling dice, drawing cards from a well-shuffled deck,
or flipping a fair coin.
• Continuous Uniform Distribution: Often used in simulations and models
where any value within a range is equally likely, such as generating random
numbers within a specific interval or modeling certain types of noise in
data.
The Exponential Distribution is a continuous probability distribution that is often
used to model the time between events in a Poisson process. A Poisson process is
a type of random process where events occur independently and at a constant
average rate. The Exponential distribution describes the waiting time between
successive events in this process.
Key Features:
Where:
The cumulative distribution function, which gives the probability that the time
until the next event is less than or equal to some value xxx, is:
F(x ; λ) = 1 – e –λx , x ≥ 0
Mean and Variance:
Applications:
Example:
If a bus arrives at a station on average every 15 minutes, the time between two
consecutive bus arrivals follows an Exponential distribution with a rate parameter
λ=1/15 buses per minute.
To calculate the probability that the time until the next bus is less than 10
minutes:
So, there's approximately a 48.66% chance that the next bus will arrive in less
than 10 minutes.
X=Z12+Z22+⋯+Zk2
where Γ(⋅) is the Gamma function. The function shows how the probabilities
are distributed over the values of x.
Example Usage:
• Chi-Square Goodness-of-Fit Test: Suppose you want to test if a die is fair. You roll
it 60 times and record the outcomes. The chi-square test could be used to determine if
the observed frequencies match the expected frequencies (which would be 10 for each
of the six sides, assuming a fair die). The chi-square statistic would be computed as:
• Chi-Square Test for Independence: You may use the chi-square test for
independence to test if two categorical variables are independent. For example, you
might want to test if there is an association between gender and whether people prefer
tea or coffee. The contingency table of observed frequencies is compared to the
expected frequencies under the assumption of independence.
1.Student’s t-Distribution:
The Student's t-distribution is used in situations where the sample size is small
and/or the population variance is unknown. It was introduced by William Sealy
Gosset under the pseudonym "Student" in 1908.
Characteristics:
Common Uses:
2. F-Distribution:
Characteristics:
• Shape: The F-distribution is positively skewed (i.e., it has a longer right tail),
and its shape depends on two parameters: the degrees of freedom for the
numerator (df1) and the denominator (df2).
• Range: The F-distribution only takes positive values because it represents a
ratio of variances, which are always non-negative.
• Non-central F-distribution: When applied to non-central situations (i.e., in
the presence of a non-zero mean), the F-distribution may shift.
Degrees of Freedom:
Common Uses:
Key Differences:
Feature t-Distribution F-Distribution
Example of Application:
Both distributions are fundamental for hypothesis testing, but they serve
different purposes depending on the structure of the data and the hypothesis
being tested.