0% found this document useful (0 votes)
37 views

Mathematics for Machine Learning

Uploaded by

eram cuet
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Mathematics for Machine Learning

Uploaded by

eram cuet
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 270

ES 691

Mathematics for
Machine Learning
with
Dr. Naveed R. Butt
@
GIKI - FES
Recall…

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 2
Recall…

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 3
Recall…

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 4
Recall…

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 5
Our next module is all about that…

Probability
- Modeling Uncertainties
- Collecting Probabilities (distributions)
- Extracting Key Indicators (moments)
- Reviewing Common Distributions
- Estimating Parameters

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 6
What is Probability?

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 7
What is Probability?

• Probability is a “lack of knowledge”!

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 8
What is Probability?

• Probability is a “lack of knowledge”!


• We know you are here today. We are sure.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 9
What is Probability?

• Probability is a “lack of knowledge”!


• We know you are here today. We are sure.
• But will you be here in the next lecture? We are not sure anymore! There is
now a “lack of knowledge”
• “perhaps”, “maybe”, “probably”

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 10
What is Probability?

• Probability is a “lack of knowledge”!


• We know you are here today. We are sure.
• But will you be here in the next lecture? We are not sure anymore! There is
now a “lack of knowledge”
• “perhaps”, “maybe”, “probably”
• Another Example: You know your height. But what’s the height of the next
student who enters the room?

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 11
Why Do We Sometimes Lack Knowledge?

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 12
Why Do We Sometimes Lack Knowledge?

Future: A dice you haven’t rolled


yet! (how can we know which
number it will show).

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 13
Why Do We Sometimes Lack Knowledge?

Too hard to collect all the information


- Which places did you visit today?
- It may be possible to have a drone camera follow you
all the time.
- Then we will not have “lack of knowledge” about places
you go to.
- But this is too hard a thing to do.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 14
Why Do We Sometimes Lack Knowledge?

Too hard to collect all the information


- Which places did you visit today?
- It may be possible to have a drone camera follow you
all the time.
- Then we will not have “lack of knowledge” about places
you go to.
- But this is too hard a thing to do.

Another Example
- To perform facial recognition, we cannot ask a person to provide thousands of their photos
(different moods, lighting conditions, times of day, grooming levels)

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 15
Why Do We Sometimes Lack Knowledge?

Quantum Randomness
- Where’s the electron?
- According to current consensus, processes and
properties at quantum level are probabilistic by their
very nature.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 16
Our way of making sense of the uncertain
Statistics world through whatever data we have…

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 17
Our way of making sense of the uncertain
Statistics world through whatever data we have…

What’s the height of the next student who enters the room?

We don’t know. There is lack of knowledge!

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 18
Our way of making sense of the uncertain
Statistics world through whatever data we have…

What’s the height of the next student who enters the room?

We don’t know. There is lack of knowledge!

But is our lack of knowledge “absolute”? (do we have


absolutely no idea what their height could be?)

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 19
Our way of making sense of the uncertain
Statistics world through whatever data we have…

What’s the height of the next student who enters the room?

We don’t know. There is lack of knowledge!

But is our lack of knowledge “absolute”? (do we have


absolutely no idea what their height could be?)

Well, we do know that the next student


is most likely taller than the shortest
man on record on shorter than the
tallest man on record (otherwise, let’s
call Guinness World Records!).

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 20
Our way of making sense of the uncertain
Statistics world through whatever data we have…

In fact, we also do know that


human heights have a typical
distribution (very short and very
tall less common etc.)

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 21
Our way of making sense of the uncertain
Statistics world through whatever data we have…

In fact, we also do know that


We can see these as
human heights have a typical
graphs of “relative
distribution (very short and very
likelihoods” of heights.
tall less common etc.)

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 22
Our way of making sense of the uncertain
Statistics world through whatever data we have…

In fact, we also do know that


We can see these as
human heights have a typical
graphs of “relative
distribution (very short and very
likelihoods” of heights.
tall less common etc.)

We could even make some “guesses” based on whatever


information we have (based on observation, experience, statistics)

Can you spot examples of such guesses around you?

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 23
Using Probability & Statistics …to make design decisions
based on educated guesses…

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 24
Using Probability & Statistics …to make design decisions
based on educated guesses…

Examples

Why don’t we make


doors this size?

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 25
Using Probability & Statistics …to make design decisions
based on educated guesses…

Examples

- The lecture hall doors are of width and height


Why don’t we make that allow most humans to pass through
doors this size? comfortably (of course making them too big
would be waste of resources).
- Your seats are designed with human dimensions
in mind!

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 26
Using Probability & Statistics …to make design decisions
based on educated guesses…

Examples

- The lecture hall doors are of width and height


Why don’t we make that allow most humans to pass through
doors this size? comfortably (of course making them too big
would be waste of resources).
- Your seats are designed with human dimensions
in mind!

- Probability and Statistics help us make smart guesses about


uncertain events.
- Based on the smart guesses we can plan, design, or take
steps to better control the situation.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 27
How Do We Assign Probabilities?

- Through repeated experimentation/observation


(i.e., via collected statistics).

- Through some belief!

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 28
A Tale of Two Philosophies

Frequentist

- Probabilities can be found (in principle)


Bayesian
by a repeatable objective process (and
are thus ideally devoid of opinion).
- Probability expresses a degree of belief in an event. The
degree of belief may be based on prior knowledge about the
- Inferences are based on data only.
event, such as the results of previous experiments, or on
personal beliefs about the event.
- Hypotheses are tested and declared
true or false.
- Inferences are based on both data and prior beliefs.
- In practice: take as much data as you
- Hypotheses are tested and assigned probability of being
can, and use it to make educated guess.
true or false.

- In practice: Make a first guess about what to expect, then


update the guess based on new data.
Will be clearer with formulae.
ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 29
Bayesian Approach Naturally
Allows “Learning Cycles”.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 30
Assigning Probabilities – Four Steps

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 31
Assigning Probabilities – Four Steps
1. Clearly define your experiment…

- Rolling a six-sided die


- Tossing a coin
- Picking a student at random from a class

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 32
Assigning Probabilities – Four Steps
1. Clearly define your experiment…

- Rolling a six-sided die


- Tossing a coin
- Picking a student at random from a class

2. Define the Sample Space…

- Set of all distinct possibilities.


- For rolling a die Ω = {1 2 3 4 5 6}

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 33
Assigning Probabilities – Four Steps
1. Clearly define your experiment…

- Rolling a six-sided die


- Tossing a coin
- Picking a student at random from a class

2. Define the Sample Space…

- Set of all distinct possibilities.


- For rolling a die Ω = {1 2 3 4 5 6}

3. Define Events (of interest)…

- Event = some outcome of interest.


- Event = A subset of Ω (could even be ∅ or Ω)
- 𝐴 = {𝜔 ∈ Ω ∶ 𝜔 satisfies some conditions}

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 34
Assigning Probabilities – Four Steps
1. Clearly define your experiment…

- Rolling a six-sided die


- Tossing a coin
- Picking a student at random from a class

2. Define the Sample Space…


4. Assign probabilities (real numbers) to events such
- Set of all distinct possibilities. that the assignment scheme (“probability measure”)
makes sense (i.e., satisfies some axioms).
- For rolling a die Ω = {1 2 3 4 5 6}

3. Define Events (of interest)…

- Event = some outcome of interest.


- Event = A subset of Ω (could even be ∅ or Ω)
- 𝐴 = {𝜔 ∈ Ω ∶ 𝜔 satisfies some conditions}

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 35
Axioms of Probability (Kolmogorov)

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 36
Axioms of Probability (Kolmogorov)

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 37
Axioms of Probability (Kolmogorov)

Assigned probabilities should not be negative.


Assigned probabilities should add up to 1.

Probabilities assigned to mutually exclusive events


(events that cannot occur together) should make sense.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 38
For the case of infinite number of possible
events, Axiom 3 is replaced by Axiom 4.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 39
Formally speaking…
Ω 𝜔1 𝜔2

𝜔3
𝜔4 𝜔5
Sample Space

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 40
Formally speaking…
Ω 𝜔1 𝜔2

𝜔3
𝜔4 𝜔5
Sample Space

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 41
Formally speaking…

Probability Measure

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 42
Formally speaking…

Probability Measure

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 43
Events - Some Terminology and Results

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 44
Events - Some Terminology and Results
Sample Space
- Set of all possible distinct events

Ω
𝜔1 𝜔2

𝜔3
𝜔4 𝜔5

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 45
Events - Some Terminology and Results
Sample Space
- Set of all possible distinct events

Ω
𝜔1 𝜔2

𝜔3
𝜔4 𝜔5

𝜔𝑖 = Sample Point

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 46
Events - Some Terminology and Results
Sample Space Simple Event

- Set of all possible distinct events - An event containing only one


sample point
- E.g., 𝐴 = {𝜔3 } is a simple event
Ω
𝜔1 𝜔2

𝜔3
𝜔4 𝜔5

𝜔𝑖 = Sample Point

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 47
Events - Some Terminology and Results
Sample Space Simple Event

- Set of all possible distinct events - An event containing only one


sample point
- E.g., 𝐴 = {𝜔3 } is a simple event
Ω
Compound Event
𝜔1 𝜔2 - An event containing more than
one sample points
𝜔3 - E.g., 𝐵 = {𝜔1 , 𝜔3 } is a compound
𝜔4 𝜔5 event

𝜔𝑖 = Sample Point

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 48
Events - Some Terminology and Results
Sample Space Simple Event

- Set of all possible distinct events - An event containing only one


sample point
- E.g., 𝐴 = {𝜔3 } is a simple event
Ω
Compound Event
𝜔1 𝜔2 - An event containing more than
one sample points
𝜔3 - E.g., 𝐵 = {𝜔1 , 𝜔3 } is a compound
𝜔4 𝜔5 event

… Null Event
- An event containing no sample
points
𝜔𝑖 = Sample Point - E.g., 𝐶 = { } is a null event

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 49
Events - Some Terminology and Results
Ω
𝜔1 𝜔2

𝜔3
Sure Event
𝜔4 𝜔5
- An event consisting of all
the sample points …
- E.g., 𝐶 = Ω is a sure event

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 50
Events - Some Terminology and Results
Ω
𝜔1 𝜔2

𝜔3
Sure Event
𝜔4 𝜔5
- An event consisting of all
the sample points …
- E.g., 𝐶 = Ω is a sure event

Equally Likely Events

- Events having the same


probability of occurring
- E.g., if 𝑃 𝐴 = 𝑃[𝐵] then events
𝐴 and 𝐵 are equally likely

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 51
Events - Some Terminology and Results
Ω
𝜔1 𝜔2

𝜔3
Sure Event
𝜔4 𝜔5
- An event consisting of all
the sample points …
- E.g., 𝐶 = Ω is a sure event
Mutually Exclusive Events
Equally Likely Events
- Events that cannot occur at the same time
- Events having the same - E.g., in the depicted case, 𝐴 = {𝜔1 , 𝜔2 } and
probability of occurring 𝐵 = {𝜔3 , 𝜔4 } then clearly 𝐴 and 𝐵 cannot
- E.g., if 𝑃 𝐴 = 𝑃[𝐵] then events occur at the same time.
𝐴 and 𝐵 are equally likely

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 52
Events - Some Terminology and Results

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 53
Events - Some Terminology and Results

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 54
Events - Some Terminology and Results

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 55
Events - Some Terminology and Results

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 56
Events - Some Terminology and Results

de Morgan’s Laws

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 57
We are often interested in finding
Joint Probability probability of two events occurring at the
same time. This is called “Joint Probability”

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 58
We are often interested in finding
Joint Probability probability of two events occurring at the
same time. This is called “Joint Probability”

𝑃 𝐴 𝑎𝑛𝑑 𝐵 = 𝑃 𝐴 ∩ 𝐵 = 𝑃[𝐴, 𝐵]

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 59
We are often interested in finding
Joint Probability probability of two events occurring at the
same time. This is called “Joint Probability”

𝑃 𝐴 𝑎𝑛𝑑 𝐵 = 𝑃 𝐴 ∩ 𝐵 = 𝑃[𝐴, 𝐵]

e.g., 𝐴 = {𝐴𝑙𝑖 𝑖𝑠 𝑖𝑛 𝑙𝑒𝑐𝑡𝑢𝑟𝑒}, 𝐵 = {𝐴𝑙𝑖 𝑖𝑠 𝑠𝑙𝑒𝑒𝑝𝑖𝑛𝑔}

𝑃 𝐴 𝑎𝑛𝑑 𝐵 = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑡ℎ𝑎𝑡 𝐴𝑙𝑖 𝑖𝑠 𝑖𝑛 𝑙𝑒𝑐𝑡𝑢𝑟𝑒 𝑎𝑛𝑑 𝑠𝑙𝑒𝑒𝑝𝑖𝑛𝑔

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 60
We are often interested in finding
Joint Probability probability of two events occurring at the
same time. This is called “Joint Probability”

𝑃 𝐴 𝑎𝑛𝑑 𝐵 = 𝑃 𝐴 ∩ 𝐵 = 𝑃[𝐴, 𝐵]

e.g., 𝐴 = {𝐴𝑙𝑖 𝑖𝑠 𝑖𝑛 𝑙𝑒𝑐𝑡𝑢𝑟𝑒}, 𝐵 = {𝐴𝑙𝑖 𝑖𝑠 𝑠𝑙𝑒𝑒𝑝𝑖𝑛𝑔}

𝑃 𝐴 𝑎𝑛𝑑 𝐵 = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑡ℎ𝑎𝑡 𝐴𝑙𝑖 𝑖𝑠 𝑖𝑛 𝑙𝑒𝑐𝑡𝑢𝑟𝑒 𝑎𝑛𝑑 𝑠𝑙𝑒𝑒𝑝𝑖𝑛𝑔

𝐶 = 𝑑𝑖𝑒 𝑠ℎ𝑜𝑤𝑠 𝑒𝑣𝑒𝑛 𝑛𝑢𝑚𝑏𝑒𝑟 , 𝐷 = {𝑑𝑖𝑒 𝑠ℎ𝑜𝑤𝑠 𝑔𝑟𝑒𝑎𝑡𝑒𝑟 𝑡ℎ𝑎𝑛 3}

𝑃 𝐴 𝑎𝑛𝑑 𝐵 = 𝑃𝑟𝑜𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑔𝑒𝑡𝑡𝑖𝑛𝑔 𝑎𝑛 𝑒𝑣𝑒𝑛 𝑛𝑢𝑚𝑏𝑒𝑟 𝑎𝑏𝑜𝑣𝑒 3 (𝑖. 𝑒. , 4 𝑜𝑟 6)

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 61
Sometimes knowledge of one random event can help us
Conditional Probability assign probability to another random event – for this
the concept of “conditional probability” comes in handy.

𝐴 = 𝐴𝑙𝑖 𝑖𝑠 𝑖𝑛 𝑙𝑒𝑐𝑡𝑢𝑟𝑒 𝑎𝑡 8 𝑎𝑚
𝐵 = {𝐴𝑙𝑖 𝑖𝑠 𝑠𝑙𝑒𝑒𝑝𝑖𝑛𝑔 𝑎𝑡 8 𝑎𝑚}

Suppose I tell you that 𝐴 has occurred (i.e., Ali is in lecture at 8 am),
now what is the probability that he is sleeping at 8 am?

𝑃[𝐴 𝑔𝑖𝑣𝑒𝑛 𝐵] = 𝑃[𝐴|𝐵]

Not zero, but rather low (as students


do sometimes fall asleep in my class).

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 62
Understanding Conditional Probability
- Suppose I tell you that I’ve written an integer from 1 to 4 on a piece of
paper, but do not tell you the number.

- Under the assumption (“belief”) that I could have picked any of the four
numbers with equal chances, what is the probability that I wrote 3?

𝐴= 3 , 𝑃 𝐴 =?

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 63
Understanding Conditional Probability
- Suppose I tell you that I’ve written an integer from 1 to 4 on a piece of
paper, but do not tell you the number.

- Under the assumption (“belief”) that I could have picked any of the four
numbers with equal chances, what is the probability that I wrote 3?

𝐴= 3 , 𝑃 𝐴 =? 1
𝑃𝐴 =
4

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 64
Understanding Conditional Probability
- Suppose I tell you that I’ve written an integer from 1 to 4 on a piece of
paper, but do not tell you the number.

- Under the assumption (“belief”) that I could have picked any of the four
numbers with equal chances, what is the probability that I wrote 3?

𝐴= 3 , 𝑃 𝐴 =? 1
𝑃𝐴 =
4

- Suppose I tell you now (additional information) that I wrote an


odd number. Now what is the probability that I wrote 3?

New information: 𝐵 = {1,3} has occurred!

𝑃 𝐴 𝑔𝑖𝑣𝑒𝑛 𝐵 = 𝑃 𝐴 𝐵 = ?

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 65
Understanding Conditional Probability
- Suppose I tell you that I’ve written an integer from 1 to 4 on a piece of
paper, but do not tell you the number.

- Under the assumption (“belief”) that I could have picked any of the four
numbers with equal chances, what is the probability that I wrote 3?

𝐴= 3 , 𝑃 𝐴 =? 1
𝑃𝐴 =
4

- Suppose I tell you now (additional information) that I wrote an


odd number. Now what is the probability that I wrote 3?

New information: 𝐵 = {1,3} has occurred!

𝑃 𝐴 𝑔𝑖𝑣𝑒𝑛 𝐵 = 𝑃 𝐴 𝐵 = ? 1
𝑃𝐴𝐵 =
2

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 66
Understanding Conditional Probability What just happened? Let’s look at
the sample space.
- Suppose I tell you that I’ve written an integer from 1 to 4 on a piece of
paper, but do not tell you the number.
Sample Space before
Ω additional information.
- Under the assumption (“belief”) that I could have picked any of the four
numbers with equal chances, what is the probability that I wrote 3? 1 2
𝐴= 3 , 𝑃 𝐴 =? 𝑃𝐴 =
1 3 4
4

- Suppose I tell you now (additional information) that I wrote an


odd number. Now what is the probability that I wrote 3?

New information: 𝐵 = {1,3} has occurred!

𝑃 𝐴 𝑔𝑖𝑣𝑒𝑛 𝐵 = 𝑃 𝐴 𝐵 = ? 1
𝑃𝐴𝐵 =
2

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 67
Understanding Conditional Probability What just happened? Let’s look at
the sample space.
- Suppose I tell you that I’ve written an integer from 1 to 4 on a piece of
paper, but do not tell you the number.
Sample Space before
Ω additional information.
- Under the assumption (“belief”) that I could have picked any of the four
numbers with equal chances, what is the probability that I wrote 3? 1 2
𝐴= 3 , 𝑃 𝐴 =? 𝑃𝐴 =
1 3 4
4

- Suppose I tell you now (additional information) that I wrote an


Sample Space after
odd number. Now what is the probability that I wrote 3? Ωnew additional information.
New information: 𝐵 = {1,3} has occurred!

𝑃 𝐴 𝑔𝑖𝑣𝑒𝑛 𝐵 = 𝑃 𝐴 𝐵 = ? 1
𝑃𝐴𝐵 =
2

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 68
Understanding Conditional Probability What just happened? Let’s look at
the sample space.
- Suppose I tell you that I’ve written an integer from 1 to 4 on a piece of
paper, but do not tell you the number.
Sample Space before
Ω additional information.
- Under the assumption (“belief”) that I could have picked any of the four
numbers with equal chances, what is the probability that I wrote 3? 1 2
𝐴= 3 , 𝑃 𝐴 =? 𝑃𝐴 =
1 3 4
4

- Suppose I tell you now (additional information) that I wrote an


Sample Space after
odd number. Now what is the probability that I wrote 3? Ωnew additional information.
New information: 𝐵 = {1,3} has occurred!

1 Clearly, a relevant piece of information has shrunk the


𝑃 𝐴 𝑔𝑖𝑣𝑒𝑛 𝐵 = 𝑃 𝐴 𝐵 = ? 𝑃𝐴𝐵 = sample space leading to a more accurate assignment of
2
probability in light of the new knowledge.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 69
Understanding Conditional Probability
Q. What if the additional information I give you is “I
wrote a positive integer”, or “I had cake for breakfast”

𝐶 = 𝑤𝑟𝑜𝑡𝑒 𝑎 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑖𝑛𝑡𝑒𝑔𝑒𝑟


𝐷 = {ℎ𝑎𝑑 𝑐𝑎𝑘𝑒 𝑓𝑜𝑟 𝑏𝑟𝑒𝑎𝑘𝑓𝑎𝑠𝑡}

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 70
Understanding Conditional Probability
Q. What if the additional information I give you is “I
wrote a positive integer”, or “I had cake for breakfast”

𝐶 = 𝑤𝑟𝑜𝑡𝑒 𝑎 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑖𝑛𝑡𝑒𝑔𝑒𝑟


𝐷 = {ℎ𝑎𝑑 𝑐𝑎𝑘𝑒 𝑓𝑜𝑟 𝑏𝑟𝑒𝑎𝑘𝑓𝑎𝑠𝑡}

Clearly…

𝑃 𝐴𝐶 =𝑃 𝐴
𝑃 𝐴 𝐷 = 𝑃[𝐴]

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 71
Understanding Conditional Probability
Q. What if the additional information I give you is “I
wrote a positive integer”, or “I had cake for breakfast”

𝐶 = 𝑤𝑟𝑜𝑡𝑒 𝑎 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑖𝑛𝑡𝑒𝑔𝑒𝑟


𝐷 = {ℎ𝑎𝑑 𝑐𝑎𝑘𝑒 𝑓𝑜𝑟 𝑏𝑟𝑒𝑎𝑘𝑓𝑎𝑠𝑡}

Clearly…

𝑃 𝐴𝐶 =𝑃 𝐴
𝑃 𝐴 𝐷 = 𝑃[𝐴]

Why? Note that the two new pieces of


information are quite useless/irrelevant in the
sense that they fail to shrink the sample space!

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 72
Understanding Conditional Probability
Q. What if the additional information I give you is “I
wrote a positive integer”, or “I had cake for breakfast”

𝐶 = 𝑤𝑟𝑜𝑡𝑒 𝑎 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑖𝑛𝑡𝑒𝑔𝑒𝑟 Sample Space before


𝐷 = {ℎ𝑎𝑑 𝑐𝑎𝑘𝑒 𝑓𝑜𝑟 𝑏𝑟𝑒𝑎𝑘𝑓𝑎𝑠𝑡}
Ω additional information.

1 2
Clearly… 4
3
𝑃 𝐴𝐶 =𝑃 𝐴
𝑃 𝐴 𝐷 = 𝑃[𝐴]
Sample Space after
Ωnew additional information.
Why? Note that the two new pieces of
information are quite useless/irrelevant in the
sense that they fail to shrink the sample space!

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 73
Linking Conditional and Joint Probabilities

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 74
Linking Conditional and Joint Probabilities

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 75
Linking Conditional and Joint Probabilities

Q. Apart from the mathematical reason (division by


zero), can you think of a logical reason for this?

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 76
Linking Conditional and Joint Probabilities

Q. Apart from the mathematical reason (division by


zero), can you think of a logical reason for this?

A. Note that if we say that event 𝐴 is impossible (𝑃 𝐴 = 0), then the


question “find probability of 𝐵 given that 𝐴 has occurred” is illogical
to begin with since 𝐴 could never have occurred!

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 77
Statistical Independence (Independent Events)

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 78
Statistical Independence (Independent Events)

When knowledge of one event does not change the probability of


another event, we say that the two are statistically independent.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 79
Statistical Independence (Independent Events)

When knowledge of one event does not change the probability of


another event, we say that the two are statistically independent.

We already saw an example of this…


𝐶 = 𝑤𝑟𝑜𝑡𝑒 𝑎 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑖𝑛𝑡𝑒𝑔𝑒𝑟

𝑃 𝐴 𝐶 = 𝑃[𝐴]

𝐴 and 𝐶 are independent, since knowledge


of 𝐶 does not change probability of 𝐴.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 80
Statistical Independence (Independent Events)

When knowledge of one event does not change the probability of


another event, we say that the two are statistically independent.

Sample Space before


We already saw an example of this…
additional information.
𝐶 = 𝑤𝑟𝑜𝑡𝑒 𝑎 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑖𝑛𝑡𝑒𝑔𝑒𝑟
1 2
3 4
𝑃 𝐴 𝐶 = 𝑃[𝐴]

𝐴 and 𝐶 are independent, since knowledge


of 𝐶 does not change probability of 𝐴. Ωnew Sample Space after additional
information (Ω𝑛𝑒𝑤 = Ω)

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 81
Statistical Independence (Independent Events)

When knowledge of one event does not change the probability of


another event, we say that the two are statistically independent.

Sample Space before


We already saw an example of this…
additional information.
𝐶 = 𝑤𝑟𝑜𝑡𝑒 𝑎 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑖𝑛𝑡𝑒𝑔𝑒𝑟
1 2
3 4
𝑃 𝐴 𝐶 = 𝑃[𝐴]

𝐴 and 𝐶 are independent, since knowledge


of 𝐶 does not change probability of 𝐴. Ωnew Sample Space after additional
information (Ω𝑛𝑒𝑤 = Ω)

So, in other words, two events are statistically independent


when knowledge of one adds no useful/new information
towards possibility of the other.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 82
Statistical Independence (Independent Events)
In fact, this is one of the primary ways of checking independence

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 83
Statistical Independence (Independent Events)
In fact, this is one of the primary ways of checking independence

i.e., if conditional probability same


Test 1 𝑃 𝐴 𝐵 = 𝑃[𝐴]
as unconditional probability.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 84
Statistical Independence (Independent Events)
In fact, this is one of the primary ways of checking independence

i.e., if conditional probability same


Test 1 𝑃 𝐴 𝐵 = 𝑃[𝐴]
as unconditional probability.

Test 2 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 𝑃[𝐵] i.e., if the joint probability is separable.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 85
Statistical Independence (Independent Events)
In fact, this is one of the primary ways of checking independence

i.e., if conditional probability same


Test 1 𝑃 𝐴 𝐵 = 𝑃[𝐴]
as unconditional probability.

Test 2 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐴 𝑃[𝐵] i.e., if the joint probability is separable.

Where do we get this test from?

Recall 𝑃 𝐴 𝐵 = 𝑃 𝐴 ∩ 𝐵 /𝑃[𝐵]

Setting 𝑃 𝐴 𝐵 = 𝑃[𝐴] (from Test 1), leads to Test 2.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 86
Some Examples and Consequences of the Axioms

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 87
Some Examples and Consequences of the Axioms

Given the sample space Ω containing six simple events, can you find 𝑃[𝜔1 ]?

1
Is 𝑃 𝜔1 = 6 ? Ω 𝜔1 𝜔2
𝜔3 𝜔4
𝜔6 𝜔5

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 88
Some Examples and Consequences of the Axioms

Given the sample space Ω containing six simple events, can you find 𝑃[𝜔1 ]?

1
Is 𝑃 𝜔1 = 6 ? Ω 𝜔1 𝜔2
No! In fact, we cannot claim this unless we have the information (or are
𝜔3 𝜔4
ready to assume) that all the simple events here are equally likely!
𝜔6 𝜔5

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 89
Some Examples and Consequences of the Axioms

Given the sample space Ω containing six simple events, can you find 𝑃[𝜔1 ]?

1
Is 𝑃 𝜔1 = 6 ? Ω 𝜔1 𝜔2
No! In fact, we cannot claim this unless we have the information (or are
𝜔3 𝜔4
ready to assume) that all the simple events here are equally likely!
𝜔6 𝜔5

Equally likely assumption implies that

𝑃 𝜔1 = 𝑃 𝜔2 = ⋯ = 𝑃 𝜔6

1
In this case we can say that 𝑃 𝜔1 = 6

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 90
Some Examples and Consequences of the Axioms
Another Example

Let’s say we have a sample space Ω containing four


simple events, and I give you the information that

𝑃 𝜔1 = 𝑃 𝜔2 = 𝑃 𝜔3 = 5
1
Ω
𝜔1 𝜔2
- Find 𝑃[𝜔4 ]
- Are the events in Ω equally likely? 𝜔3 𝜔4

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 91
Some Examples and Consequences of the Axioms
Another Example

Let’s say we have a sample space Ω containing four


simple events, and I give you the information that

𝑃 𝜔1 = 𝑃 𝜔2 = 𝑃 𝜔3 = 5
1
Ω
𝜔1 𝜔2
- Find 𝑃[𝜔4 ]
- Are the events in Ω equally likely? 𝜔3 𝜔4

From the axioms we do know that

𝑃 𝜔1 + 𝑃 𝜔2 + 𝑃 𝜔3 + 𝑃 𝜔4 = 𝑃 Ω = 𝑃 𝑠𝑢𝑟𝑒 𝑒𝑣𝑒𝑛𝑡 = 1

This leads to

1 1 1 2
𝑃 𝜔4 = 1− − − =
5 5 5 5

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 92
Some Examples and Consequences of the Axioms
Another Example

Let’s say we have a sample space Ω containing four


simple events, and I give you the information that

𝑃 𝜔1 = 𝑃 𝜔2 = 𝑃 𝜔3 = 5
1
Ω
𝜔1 𝜔2
- Find 𝑃[𝜔4 ]
- Are the events in Ω equally likely? 𝜔3 𝜔4

From the axioms we do know that

𝑃 𝜔1 + 𝑃 𝜔2 + 𝑃 𝜔3 + 𝑃 𝜔4 = 𝑃 Ω = 𝑃 𝑠𝑢𝑟𝑒 𝑒𝑣𝑒𝑛𝑡 = 1

This leads to

1 1 1 2 Clearly while 𝜔1 , 𝜔2 and 𝜔3


𝑃 𝜔4 = 1− − − = are equally likely, 𝜔4 is not.
5 5 5 5

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 93
Some Examples and Consequences of the Axioms
Another Example
Establishing whether events
Let’s say we have a sample space Ω containing four are equally likely or not…
simple events, and I give you the information that
- In practice, we either assume events to
𝑃 𝜔1 = 𝑃 𝜔2 = 𝑃 𝜔3 = 5
1
Ω be equally likely (“belief/experience”).

𝜔1 𝜔2 - Or we find out through other


- Find 𝑃[𝜔4 ]
calculations that their probabilities are
- Are the events in Ω equally likely? 𝜔3 𝜔4 the same (hence they are equally likely).

From the axioms we do know that

𝑃 𝜔1 + 𝑃 𝜔2 + 𝑃 𝜔3 + 𝑃 𝜔4 = 𝑃 Ω = 𝑃 𝑠𝑢𝑟𝑒 𝑒𝑣𝑒𝑛𝑡 = 1

This leads to

1 1 1 2 Clearly while 𝜔1 , 𝜔2 and 𝜔3


𝑃 𝜔4 = 1− − − = are equally likely, 𝜔4 is not.
5 5 5 5

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 94
Probabilities of Mutually Exclusive and Overlapping Events Ω
𝜔1 𝜔2
𝒂𝒏𝒅 𝑣𝑠. 𝒐𝒓
𝜔3 𝜔4
𝜔6 𝐵
𝐴
𝜔5

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 95
Probabilities of Mutually Exclusive and Overlapping Events Ω
𝜔1 𝜔2
𝒂𝒏𝒅 𝑣𝑠. 𝒐𝒓
𝜔3 𝜔4
- 𝐴 and 𝐵 are compound events
𝜔6 𝐵
𝐴
𝜔5
- Clearly, the two cannot occur at the same time (since 𝐴 ∩ 𝐵 = ∅)

- 𝑃 𝐴 𝑎𝑛𝑑 𝐵 = 𝑃 𝐴 ∩ 𝐵 = 0

- Such events, we’ve already defined as being “Mutually Exclusive”.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 96
Probabilities of Mutually Exclusive and Overlapping Events Ω
𝜔1 𝜔2
𝒂𝒏𝒅 𝑣𝑠. 𝒐𝒓
𝜔3 𝜔4
- 𝐴 and 𝐵 are compound events
𝜔6 𝐵
𝐴
𝜔5
- Clearly, the two cannot occur at the same time (since 𝐴 ∩ 𝐵 = ∅)

- 𝑃 𝐴 𝑎𝑛𝑑 𝐵 = 𝑃 𝐴 ∩ 𝐵 = 0

- Such events, we’ve already defined as being “Mutually Exclusive”.

But what about 𝑃 𝐴 ∪ 𝐵 ?

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 97
Probabilities of Mutually Exclusive and Overlapping Events Ω
𝜔1 𝜔2
𝒂𝒏𝒅 𝑣𝑠. 𝒐𝒓
𝜔3 𝜔4
- 𝐴 and 𝐵 are compound events
𝜔6 𝐵
𝐴
𝜔5
- Clearly, the two cannot occur at the same time (since 𝐴 ∩ 𝐵 = ∅)

- 𝑃 𝐴 𝑎𝑛𝑑 𝐵 = 𝑃 𝐴 ∩ 𝐵 = 0

- Such events, we’ve already defined as being “Mutually Exclusive”.

But what about 𝑃 𝐴 ∪ 𝐵 ?

- From the third axiom we know that for


mutually exclusive events we must have:

- 𝑃 𝐴 𝑜𝑟 𝐵 = 𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃[𝐵]

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 98
Probabilities of Mutually Exclusive and Overlapping Events Ω
𝜔1 𝜔2
𝒂𝒏𝒅 𝑣𝑠. 𝒐𝒓
𝜔3 𝜔4
- 𝐴 and 𝐵 are compound events
𝜔6 𝐵
𝐴
𝜔5
- Clearly, the two cannot occur at the same time (since 𝐴 ∩ 𝐵 = ∅)

- 𝑃 𝐴 𝑎𝑛𝑑 𝐵 = 𝑃 𝐴 ∩ 𝐵 = 0
Interesting question: can mutually
exclusive events be independent?
- Such events, we’ve already defined as being “Mutually Exclusive”.

But what about 𝑃 𝐴 ∪ 𝐵 ?

- From the third axiom we know that for


mutually exclusive events we must have:

- 𝑃 𝐴 𝑜𝑟 𝐵 = 𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃[𝐵]

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 99
Probabilities of Mutually Exclusive and Overlapping Events Ω
𝜔1 𝜔2
𝒂𝒏𝒅 𝑣𝑠. 𝒐𝒓
𝜔3 𝜔4
- 𝐴 and 𝐵 are compound events
𝜔6 𝐵
𝐴
𝜔5
- Clearly, the two cannot occur at the same time (since 𝐴 ∩ 𝐵 = ∅)

- 𝑃 𝐴 𝑎𝑛𝑑 𝐵 = 𝑃 𝐴 ∩ 𝐵 = 0
Interesting question: can mutually
exclusive events be independent?
- Such events, we’ve already defined as being “Mutually Exclusive”.
No! Since knowledge of one
occurring immediately changes the
But what about 𝑃 𝐴 ∪ 𝐵 ? probability of the other two zero.
- From the third axiom we know that for
mutually exclusive events we must have:

- 𝑃 𝐴 𝑜𝑟 𝐵 = 𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃[𝐵]

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 100
Probabilities of Mutually Exclusive and Overlapping Events Ω
𝜔1 𝜔2
𝒂𝒏𝒅 𝑣𝑠. 𝒐𝒓
𝜔3 𝜔4
- 𝐴 and 𝐵 are compound events
𝜔6 𝐵
𝐴
𝜔5
- Clearly, the two cannot occur at the same time (since 𝐴 ∩ 𝐵 = ∅)

- 𝑃 𝐴 𝑎𝑛𝑑 𝐵 = 𝑃 𝐴 ∩ 𝐵 = 0
Interesting question: can mutually
exclusive events be independent?
- Such events, we’ve already defined as being “Mutually Exclusive”.
No! Since knowledge of one
occurring immediately changes the
But what about 𝑃 𝐴 ∪ 𝐵 ? probability of the other two zero.
- From the third axiom we know that for
mutually exclusive events we must have: In the example shown

- 𝑃 𝐴 𝑜𝑟 𝐵 = 𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃[𝐵] 𝑃 𝐴 = 𝑃 𝜔1 + 𝑃 𝜔1 + 𝑃 𝜔5

But

𝑃 𝐴𝐵 =0
ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 101
Probabilities of Mutually Exclusive and Overlapping Events 𝜔2
Ω
𝜔1
What if 𝐴 and 𝐵 are overlapping? i.e., 𝐴 ∩ 𝐵 ≠ ∅ 𝜔3 𝜔4
𝐴
𝜔6 𝐵
What about 𝑃[𝐴 𝑜𝑟 𝐵] now? 𝜔5

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 102
Probabilities of Mutually Exclusive and Overlapping Events 𝜔2
Ω
𝜔1
What if 𝐴 and 𝐵 are overlapping? i.e., 𝐴 ∩ 𝐵 ≠ ∅ 𝜔3 𝜔4
𝐴
𝜔6 𝐵
What about 𝑃[𝐴 𝑜𝑟 𝐵] now? 𝜔5

𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃[𝐴 ∩ 𝐵]

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 103
Probabilities of Mutually Exclusive and Overlapping Events 𝜔2
Ω
𝜔1
What if 𝐴 and 𝐵 are overlapping? i.e., 𝐴 ∩ 𝐵 ≠ ∅ 𝜔3 𝜔4
𝐴
𝜔6 𝐵
What about 𝑃[𝐴 𝑜𝑟 𝐵] now? 𝜔5

𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃[𝐴 ∩ 𝐵]

Removes the overlapping part


(that was counted twice).
Kind of converts the two mutually
exclusive events like this

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 104
Probabilities of Mutually Exclusive and Overlapping Events 𝜔2
Ω
𝜔1
What if 𝐴 and 𝐵 are overlapping? i.e., 𝐴 ∩ 𝐵 ≠ ∅ 𝜔3 𝜔4
𝐴
𝜔6 𝐵
What about 𝑃[𝐴 𝑜𝑟 𝐵] now? 𝜔5

𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃[𝐴 ∩ 𝐵]

Removes the overlapping part


(that was counted twice).
Kind of converts the two mutually
exclusive events like this

𝐴−𝐴∩𝐵
𝜔1
𝐵
𝜔3 𝜔4
𝜔6

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 105
Probabilities of Mutually Exclusive and Overlapping Events 𝜔2
Ω
𝜔1
What if 𝐴 and 𝐵 are overlapping? i.e., 𝐴 ∩ 𝐵 ≠ ∅ 𝜔3 𝜔4
𝐴
𝜔6 𝐵
What about 𝑃[𝐴 𝑜𝑟 𝐵] now? 𝜔5

𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃[𝐴 ∩ 𝐵]

Removes the overlapping part


(that was counted twice).
Kind of converts the two mutually
exclusive events like this

𝐴
𝐴−𝐴∩𝐵
𝜔1 or 𝜔1
𝐵
𝜔3 𝜔4 𝜔3 𝜔4
𝜔6 𝜔6
𝐵−𝐴∩𝐵

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 106
Probabilities of Mutually Exclusive and Overlapping Events 𝜔2
Ω
𝜔1
What if 𝐴 and 𝐵 are overlapping? i.e., 𝐴 ∩ 𝐵 ≠ ∅ 𝜔3 𝜔4
𝐴
𝜔6 𝐵
What about 𝑃[𝐴 𝑜𝑟 𝐵] now? 𝜔5

𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃[𝐴 ∩ 𝐵]

Removes the overlapping part


(that was counted twice).
Note that for mutually exclusive case, the formula
Kind of converts the two mutually reduced to the axiom 𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵
exclusive events like this (since the last term 𝑃[𝐴 ∩ 𝐵] becomes zero).

𝐴
𝐴−𝐴∩𝐵
𝜔1 or 𝜔1
𝐵
𝜔3 𝜔4 𝜔3 𝜔4
𝜔6 𝜔6
𝐵−𝐴∩𝐵

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 107
Probabilities of Mutually Exclusive and Overlapping Events

Another look
at 𝑃[𝐴 𝑜𝑟 𝐵]

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 108
Probabilities of Mutually Exclusive and Overlapping Events

Another look
at 𝑃[𝐴 𝑜𝑟 𝐵]

Equality holds for


mutually exclusive events!

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 109
Some other interesting consequences of the axioms… (for any events 𝐴, 𝐵)

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 110
A Smart Use of Conditional Probabilities … Bayes’ Theorem

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 111
A Smart Use of Conditional Probabilities … Bayes’ Theorem

- Suppose that a certain system would fail if two of its


distinct components 𝛼 and 𝛽 both fail.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 112
A Smart Use of Conditional Probabilities … Bayes’ Theorem

- Suppose that a certain system would fail if two of its


distinct components 𝛼 and 𝛽 both fail.

- If probability that 𝛼 fails is 0.01 and the probability that 𝛽


fails is 0.005, and probability that 𝛽 fails if 𝛼 has failed is
0.015, find the probability that 𝛼 will fail if 𝛽 has failed.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 113
A Smart Use of Conditional Probabilities … Bayes’ Theorem

- Suppose that a certain system would fail if two of its


distinct components 𝛼 and 𝛽 both fail.

- If probability that 𝛼 fails is 0.01 and the probability that 𝛽


fails is 0.005, and probability that 𝛽 fails if 𝛼 has failed is
0.015, find the probability that 𝛼 will fail if 𝛽 has failed.

Always good to define events first…

𝐴 = 𝛼 𝑓𝑎𝑖𝑙𝑠
𝐵 = {𝛽 𝑓𝑎𝑖𝑙𝑠}

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 114
A Smart Use of Conditional Probabilities … Bayes’ Theorem

- Suppose that a certain system would fail if two of its


distinct components 𝛼 and 𝛽 both fail.

- If probability that 𝛼 fails is 0.01 and the probability that 𝛽


fails is 0.005, and probability that 𝛽 fails if 𝛼 has failed is
0.015, find the probability that 𝛼 will fail if 𝛽 has failed.

Always good to define events first…

𝐴 = 𝛼 𝑓𝑎𝑖𝑙𝑠
𝐵 = {𝛽 𝑓𝑎𝑖𝑙𝑠}

In terms of these events, what we are given is:

𝑃 𝐴 = 0.01, 𝑃 𝐵 = 0.005, 𝑃 𝐵 𝐴 = 0.015

And we have to find:

𝑃 𝐴 𝐵 =?
ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 115
A Smart Use of Conditional Probabilities … Bayes’ Theorem We’ll use the previously discussed link
between conditional and joint probabilities
- Suppose that a certain system would fail if two of its
distinct components 𝛼 and 𝛽 both fail. 𝑃 𝐴∩𝐵
𝑃𝐴𝐵 = … (⋆)
𝑃𝐵
- If probability that 𝛼 fails is 0.01 and the probability that 𝛽
fails is 0.005, and probability that 𝛽 fails if 𝛼 has failed is
0.015, find the probability that 𝛼 will fail if 𝛽 has failed.

Always good to define events first…

𝐴 = 𝛼 𝑓𝑎𝑖𝑙𝑠
𝐵 = {𝛽 𝑓𝑎𝑖𝑙𝑠}

In terms of these events, what we are given is:

𝑃 𝐴 = 0.01, 𝑃 𝐵 = 0.005, 𝑃 𝐵 𝐴 = 0.015

And we have to find:

𝑃 𝐴 𝐵 =?
ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 116
A Smart Use of Conditional Probabilities … Bayes’ Theorem We’ll use the previously discussed link
between conditional and joint probabilities
- Suppose that a certain system would fail if two of its
distinct components 𝛼 and 𝛽 both fail. 𝑃 𝐴∩𝐵
𝑃𝐴𝐵 = … (⋆)
𝑃𝐵
- If probability that 𝛼 fails is 0.01 and the probability that 𝛽
fails is 0.005, and probability that 𝛽 fails if 𝛼 has failed is Though we do not have 𝑃[𝐴 ∩ 𝐵], we can
0.015, find the probability that 𝛼 will fail if 𝛽 has failed. derive it by using the conditional in reverse

𝑃 𝐴∩𝐵
Always good to define events first… 𝑃𝐵𝐴 =
𝑃𝐴
𝐴 = 𝛼 𝑓𝑎𝑖𝑙𝑠 ⇒ 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐵 𝐴 𝑃[𝐴]
𝐵 = {𝛽 𝑓𝑎𝑖𝑙𝑠}

In terms of these events, what we are given is:

𝑃 𝐴 = 0.01, 𝑃 𝐵 = 0.005, 𝑃 𝐵 𝐴 = 0.015

And we have to find:

𝑃 𝐴 𝐵 =?
ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 117
A Smart Use of Conditional Probabilities … Bayes’ Theorem We’ll use the previously discussed link
between conditional and joint probabilities
- Suppose that a certain system would fail if two of its
distinct components 𝛼 and 𝛽 both fail. 𝑃 𝐴∩𝐵
𝑃𝐴𝐵 = … (⋆)
𝑃𝐵
- If probability that 𝛼 fails is 0.01 and the probability that 𝛽
fails is 0.005, and probability that 𝛽 fails if 𝛼 has failed is Though we do not have 𝑃[𝐴 ∩ 𝐵], we can
0.015, find the probability that 𝛼 will fail if 𝛽 has failed. derive it by using the conditional in reverse

𝑃 𝐴∩𝐵
Always good to define events first… 𝑃𝐵𝐴 =
𝑃𝐴
𝐴 = 𝛼 𝑓𝑎𝑖𝑙𝑠 ⇒ 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐵 𝐴 𝑃[𝐴]
𝐵 = {𝛽 𝑓𝑎𝑖𝑙𝑠}
Plugging this in (⋆) gives
In terms of these events, what we are given is:
𝑃 𝐵 𝐴 𝑃[𝐴]
𝑃 𝐴 = 0.01, 𝑃 𝐵 = 0.005, 𝑃 𝐵 𝐴 = 0.015 𝑃𝐴𝐵 =
𝑃𝐵

And we have to find:

𝑃 𝐴 𝐵 =?
ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 118
A Smart Use of Conditional Probabilities … Bayes’ Theorem We’ll use the previously discussed link
between conditional and joint probabilities
- Suppose that a certain system would fail if two of its
distinct components 𝛼 and 𝛽 both fail. 𝑃 𝐴∩𝐵
𝑃𝐴𝐵 = … (⋆)
𝑃𝐵
- If probability that 𝛼 fails is 0.01 and the probability that 𝛽
fails is 0.005, and probability that 𝛽 fails if 𝛼 has failed is Though we do not have 𝑃[𝐴 ∩ 𝐵], we can
0.015, find the probability that 𝛼 will fail if 𝛽 has failed. derive it by using the conditional in reverse

𝑃 𝐴∩𝐵
Always good to define events first… 𝑃𝐵𝐴 =
𝑃𝐴
𝐴 = 𝛼 𝑓𝑎𝑖𝑙𝑠 ⇒ 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐵 𝐴 𝑃[𝐴]
𝐵 = {𝛽 𝑓𝑎𝑖𝑙𝑠}
Plugging this in (⋆) gives
In terms of these events, what we are given is:
𝑃 𝐵 𝐴 𝑃[𝐴]
𝑃 𝐴 = 0.01, 𝑃 𝐵 = 0.005, 𝑃 𝐵 𝐴 = 0.015 𝑃𝐴𝐵 =
𝑃𝐵

And we have to find: We can now plug in all the given


values to get 𝑃 𝐴 𝐵 = 0.03
𝑃 𝐴 𝐵 =?
ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 119
A Smart Use of Conditional Probabilities … Bayes’ Theorem

- The formulation we got in the previous example, is in fact


one of the most important results in probability, called
Bayes’ Theorem.
𝑃 𝐵 𝐴 𝑃[𝐴]
𝑃𝐴𝐵 = - It forms the foundation of Bayesian Statistics.
𝑃𝐵

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 120
A Smart Use of Conditional Probabilities … Bayes’ Theorem

- The formulation we got in the previous example, is in fact


one of the most important results in probability, called
Bayes’ Theorem.
𝑃 𝐵 𝐴 𝑃[𝐴]
𝑃𝐴𝐵 = - It forms the foundation of Bayesian Statistics.
𝑃𝐵
- Also, the idea is used heavily in learning systems, since the
relation can be seen as an “update” equation for
probability of event 𝐴.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 121
“Learning” Probabilities

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES
122
“Learning” Probabilities

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES
123
“Learning” Probabilities

Initial probability of an
event (assumed or
based on past data).

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES
124
“Learning” Probabilities

Statistics of new data and


Initial probability of an its statistical relevance to
event (assumed or event of interest.
based on past data).

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES
125
“Learning” Probabilities

Learned/Updated Statistics of new data and


probability of Initial probability of an its statistical relevance to
event given new event (assumed or event of interest.
data/information. based on past data).

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES
126
“Learning” Probabilities

Iterative “learning” as
Learned/Updated Statistics of new data and new data arrives
probability of Initial probability of an its statistical relevance to
event given new event (assumed or event of interest.
data/information. based on past data).

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES
127
“Learning” Probabilities

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES
128
Bayes’ Theorem – Formal Definition

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES
129
Bayes’ Theorem – Formal Definition
This means that {𝐴𝑗 } are mutually
exclusive and their union spans Ω

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES
130
Another Smart Use of Conditional Probabilities … Total Probability

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 131
Another Smart Use of Conditional Probabilities … Total Probability

- Suppose your team is to play a PSL match against one of three


teams: Quetta Gladiators, Peshawar Zalmi, or Lahore Qalandars.

- You reckon your chances of winning from Zalmi are 0.05, from
Gladiators 0.04, and from Qalandars 0.1.

- What is your probability of winning if your opponent is chosen


randomly with equal probabilities?

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 132
Another Smart Use of Conditional Probabilities … Total Probability

- Suppose your team is to play a PSL match against one of three


teams: Quetta Gladiators, Peshawar Zalmi, or Lahore Qalandars.

- You reckon your chances of winning from Zalmi are 0.05, from
Gladiators 0.04, and from Qalandars 0.1.

- What is your probability of winning if your opponent is chosen


randomly with equal probabilities?

Let’s call our team Topi Drama, and define some events

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 133
Another Smart Use of Conditional Probabilities … Total Probability

- Suppose your team is to play a PSL match against one of three


teams: Quetta Gladiators, Peshawar Zalmi, or Lahore Qalandars.

- You reckon your chances of winning from Zalmi are 0.05, from
Gladiators 0.04, and from Qalandars 0.1.

- What is your probability of winning if your opponent is chosen


randomly with equal probabilities?

Let’s call our team Topi Drama, and define some events

𝑊 = 𝐷𝑟𝑎𝑚𝑎 𝑤𝑖𝑛𝑠
𝑍 = 𝑍𝑎𝑙𝑚𝑖 𝑐ℎ𝑜𝑠𝑒𝑛
𝐺 = 𝐺𝑙𝑎𝑑𝑖𝑎𝑡𝑜𝑟𝑠 𝑐ℎ𝑜𝑠𝑒𝑛
𝑄 = {𝑄𝑎𝑙𝑎𝑛𝑑𝑎𝑟𝑠 𝑐ℎ𝑜𝑠𝑒𝑛}

Clearly, 𝑍, 𝐺, 𝑎𝑛𝑑 𝑄 are mutually exclusive!

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 134
Another Smart Use of Conditional Probabilities … Total Probability
In terms of these events, we are given:
- Suppose your team is to play a PSL match against one of three
teams: Quetta Gladiators, Peshawar Zalmi, or Lahore Qalandars. 𝑃 𝑊 𝑍 = 0.05
𝑃 𝑊 𝐺 = 0.04
- You reckon your chances of winning from Zalmi are 0.05, from 𝑃 𝑊 𝑄 = 0.1
Gladiators 0.04, and from Qalandars 0.1.
Further
- What is your probability of winning if your opponent is chosen
randomly with equal probabilities? 1
𝑃 𝑍 =𝑃 𝐺 =𝑃 𝑄 =
3
Let’s call our team Topi Drama, and define some events

𝑊 = 𝐷𝑟𝑎𝑚𝑎 𝑤𝑖𝑛𝑠
𝑍 = 𝑍𝑎𝑙𝑚𝑖 𝑐ℎ𝑜𝑠𝑒𝑛
𝐺 = 𝐺𝑙𝑎𝑑𝑖𝑎𝑡𝑜𝑟𝑠 𝑐ℎ𝑜𝑠𝑒𝑛
𝑄 = {𝑄𝑎𝑙𝑎𝑛𝑑𝑎𝑟𝑠 𝑐ℎ𝑜𝑠𝑒𝑛}

Clearly, 𝑍, 𝐺, 𝑎𝑛𝑑 𝑄 are mutually exclusive!

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 135
Another Smart Use of Conditional Probabilities … Total Probability
In terms of these events, we are given:
- Suppose your team is to play a PSL match against one of three
teams: Quetta Gladiators, Peshawar Zalmi, or Lahore Qalandars. 𝑃 𝑊 𝑍 = 0.05
𝑃 𝑊 𝐺 = 0.04
- You reckon your chances of winning from Zalmi are 0.05, from 𝑃 𝑊 𝑄 = 0.1
Gladiators 0.04, and from Qalandars 0.1.
Further
- What is your probability of winning if your opponent is chosen
randomly with equal probabilities? 1
𝑃 𝑍 =𝑃 𝐺 =𝑃 𝑄 =
3
Let’s call our team Topi Drama, and define some events
And we have to find

𝑊 = 𝐷𝑟𝑎𝑚𝑎 𝑤𝑖𝑛𝑠 𝑃 𝑊 =?
𝑍 = 𝑍𝑎𝑙𝑚𝑖 𝑐ℎ𝑜𝑠𝑒𝑛
𝐺 = 𝐺𝑙𝑎𝑑𝑖𝑎𝑡𝑜𝑟𝑠 𝑐ℎ𝑜𝑠𝑒𝑛
𝑄 = {𝑄𝑎𝑙𝑎𝑛𝑑𝑎𝑟𝑠 𝑐ℎ𝑜𝑠𝑒𝑛} How do we solve this?

Clearly, 𝑍, 𝐺, 𝑎𝑛𝑑 𝑄 are mutually exclusive!

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 136
Another Smart Use of Conditional Probabilities … Total Probability

Let’s look at matters logically…

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 137
Another Smart Use of Conditional Probabilities … Total Probability

Let’s look at matters logically…

𝑌𝑜𝑢 𝑝𝑙𝑎𝑦 𝑍𝑎𝑙𝑚𝑖 𝑎𝑛𝑑 𝑊𝑖𝑛


𝑜𝑟 Note that these three events
𝑃 𝑊 = 𝑌𝑜𝑢 𝑝𝑙𝑎𝑦 𝐺𝑙𝑎𝑑𝑖𝑎𝑡𝑜𝑟𝑠 𝑎𝑛𝑑 𝑊𝑖𝑛 are mutually exclusive!
𝑜𝑟
𝑌𝑜𝑢 𝑝𝑙𝑎𝑦 𝑄𝑎𝑙𝑎𝑛𝑑𝑎𝑟𝑠 𝑎𝑛𝑑 𝑊𝑖𝑛

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 138
Another Smart Use of Conditional Probabilities … Total Probability

Let’s look at matters logically…

𝑌𝑜𝑢 𝑝𝑙𝑎𝑦 𝑍𝑎𝑙𝑚𝑖 𝑎𝑛𝑑 𝑊𝑖𝑛


𝑜𝑟 Note that these three events
𝑃 𝑊 = 𝑌𝑜𝑢 𝑝𝑙𝑎𝑦 𝐺𝑙𝑎𝑑𝑖𝑎𝑡𝑜𝑟𝑠 𝑎𝑛𝑑 𝑊𝑖𝑛 are mutually exclusive!
𝑜𝑟
𝑌𝑜𝑢 𝑝𝑙𝑎𝑦 𝑄𝑎𝑙𝑎𝑛𝑑𝑎𝑟𝑠 𝑎𝑛𝑑 𝑊𝑖𝑛

Recall that for mutually exclusive events


𝑃 𝑊 = 𝑃 𝑍 ∩ 𝑊 + 𝑃 𝐺 ∩ 𝑊 + 𝑃[𝑄 ∩ 𝑊] 𝑃 𝐴 𝑜𝑟 𝐵 = 𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃[𝐵]

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 139
Another Smart Use of Conditional Probabilities … Total Probability

Let’s look at matters logically…

𝑌𝑜𝑢 𝑝𝑙𝑎𝑦 𝑍𝑎𝑙𝑚𝑖 𝑎𝑛𝑑 𝑊𝑖𝑛


𝑜𝑟 Note that these three events
𝑃 𝑊 = 𝑌𝑜𝑢 𝑝𝑙𝑎𝑦 𝐺𝑙𝑎𝑑𝑖𝑎𝑡𝑜𝑟𝑠 𝑎𝑛𝑑 𝑊𝑖𝑛 are mutually exclusive!
𝑜𝑟
𝑌𝑜𝑢 𝑝𝑙𝑎𝑦 𝑄𝑎𝑙𝑎𝑛𝑑𝑎𝑟𝑠 𝑎𝑛𝑑 𝑊𝑖𝑛

Recall that for mutually exclusive events


𝑃 𝑊 = 𝑃 𝑍 ∩ 𝑊 + 𝑃 𝐺 ∩ 𝑊 + 𝑃[𝑄 ∩ 𝑊] 𝑃 𝐴 𝑜𝑟 𝐵 = 𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃[𝐵]

Or, using the link between joint and conditional


probabilities, i.e., 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐵 𝐴 𝑃[𝐴]

𝑃 𝑊 = 𝑃 𝑊 𝑍 𝑃 𝑍 + 𝑃 𝑊 𝐺 𝑃 𝐺 + 𝑃 𝑊 𝑄 𝑃[𝑄]

1 1 1 0.19
𝑃 𝑊 = 0.05 × + 0.04 × + 0.1 × =
3 3 3 3

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 140
Another Smart Use of Conditional Probabilities … Total Probability

Let’s look at matters logically…

𝑌𝑜𝑢 𝑝𝑙𝑎𝑦 𝑍𝑎𝑙𝑚𝑖 𝑎𝑛𝑑 𝑊𝑖𝑛


𝑜𝑟 Note that these three events
𝑃 𝑊 = 𝑌𝑜𝑢 𝑝𝑙𝑎𝑦 𝐺𝑙𝑎𝑑𝑖𝑎𝑡𝑜𝑟𝑠 𝑎𝑛𝑑 𝑊𝑖𝑛 are mutually exclusive!
𝑜𝑟
𝑌𝑜𝑢 𝑝𝑙𝑎𝑦 𝑄𝑎𝑙𝑎𝑛𝑑𝑎𝑟𝑠 𝑎𝑛𝑑 𝑊𝑖𝑛

Recall that for mutually exclusive events


𝑃 𝑊 = 𝑃 𝑍 ∩ 𝑊 + 𝑃 𝐺 ∩ 𝑊 + 𝑃[𝑄 ∩ 𝑊] 𝑃 𝐴 𝑜𝑟 𝐵 = 𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃[𝐵]

Or, using the link between joint and conditional


probabilities, i.e., 𝑃 𝐴 ∩ 𝐵 = 𝑃 𝐵 𝐴 𝑃[𝐴] This method of calculating the probability
of an event by splitting it into all possible
𝑃 𝑊 = 𝑃 𝑊 𝑍 𝑃 𝑍 + 𝑃 𝑊 𝐺 𝑃 𝐺 + 𝑃 𝑊 𝑄 𝑃[𝑄] conditional events is in fact a very useful
result called “Total Probability”.
1 1 1 0.19
𝑃 𝑊 = 0.05 × + 0.04 × + 0.1 × =
3 3 3 3

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 141
Total Probability – Formal Definition

Probability split into Total Probability


conditionals or joint
probabilities.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 142
Total Probability – Formal Definition

Logic: since 𝐴’s are mutually exclusive Probability split into Total Probability
and exhaustive, 𝐵 ∈ Ω must conditionals or joint
necessarily be made up of disjoint probabilities.
overlaps with some (or all) of the 𝐴’s.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 143
Topological
Space
Vector
We have previously defined several spaces (vector
Space
space, metric space, inner product space…). Metric
Space
Inner
Product
Space

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 144
Topological
Space
Vector
We have previously defined several spaces (vector
Space
space, metric space, inner product space…). Metric
Space
Inner
Let us now formally define the Probability Space. Product
Probability
Space
Space

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 145
Probability A Set,
A Field,
Space and A Measure

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 146
Probability A Set,
A Field,
Space

Set (Sample Space)

Probability Measure
and A Measure

Ω 𝐹 𝑃

𝜎 − 𝑓𝑖𝑒𝑙𝑑
ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 147
Probability A Set,
A Field,
Space

Set (Sample Space)

Probability Measure
and A Measure

Ω 𝐹 𝑃

𝜎 − 𝑓𝑖𝑒𝑙𝑑
- A Set of All Possibilities
(Sample Space Ω),

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 148
Probability A Set,
A Field,
Space

Set (Sample Space)

Probability Measure
and A Measure

Ω 𝐹 𝑃

𝜎 − 𝑓𝑖𝑒𝑙𝑑
- A Set of All Possibilities
(Sample Space Ω),

- A collection (𝐹) of
subsets (events) of Ω
that allows closure
under several set
operations,

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 149
Probability A Set,
A Field,
Space

Set (Sample Space)

Probability Measure
and A Measure

Ω 𝐹 𝑃

𝜎 − 𝑓𝑖𝑒𝑙𝑑
- A Set of All Possibilities
(Sample Space Ω),

- A collection (𝐹) of
subsets (events) of Ω
that allows closure
under several set
operations,

- A measure that assigns


values to members of 𝐹
following axioms of
probability.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 150
Probability A Set,
A Field,
Space

Set (Sample Space)

Probability Measure
and A Measure

Ω 𝐹 𝑃

𝜎 − 𝑓𝑖𝑒𝑙𝑑
- A Set of All Possibilities
(Sample Space Ω),

- A collection (𝐹) of
subsets (events) of Ω
that allows closure
under several set
operations,

- A measure that assigns


values to members of 𝐹
following axioms of
probability.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 151
ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 152
- These ensure closures (under union, intersection) etc. so
that probability measure assignments make sense.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 153
- These ensure closures (under union, intersection) etc. so
that probability measure assignments make sense.
- E.g., probability axioms require 𝑃 𝐴 ∪ 𝐴𝑐 = 𝑃 𝐴 +
𝑃(𝐴𝑐 ) = 1, but if only 𝐴 ∈ 𝐹 but 𝐴𝑐 is not included in 𝐹,
things will become problematic.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 154
ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 155
ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 156
ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 157
ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 158
With the basic notions covered, we are
now ready to embrace uncertainty by
modeling it precisely…

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 159
Embracing Uncertainty/Variation… Prefix: Random

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 160
Embracing Uncertainty/Variation… Prefix: Random

Deterministic
Event
Variable
Process

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 161
Embracing Uncertainty/Variation… Prefix: Random

Deterministic → Random
Event Event
Variable Variable
Process Process

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 162
Embracing Uncertainty/Variation… Prefix: Random

Deterministic → Random
Event Event
Variable Variable
Process Process

- Moon passing between earth and sun (deterministic)


Events - Earthquake (random)
- Next student who enters taller than me (random)

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 163
Embracing Uncertainty/Variation… Prefix: Random

Deterministic → Random
Event Event
Variable Variable
Process Process

- Moon passing between earth and sun (deterministic)


Events - Earthquake (random)
- Next student who enters taller than me (random)

- An outcome of an uncertain
Random Event happening/experiment.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 164
Embracing Uncertainty/Variation… Prefix: Random

Variable

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 165
Embracing Uncertainty/Variation… Prefix: Random

- A placeholder that holds data or helps give


relations in algebraic form.
Variable
- 𝑎 = 𝜋𝑟 2
- 𝑋 = height of tallest person on record

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 166
Embracing Uncertainty/Variation… Prefix: Random

- A placeholder that holds data or helps give


relations in algebraic form.
Variable
- 𝑎 = 𝜋𝑟 2
- 𝑋 = height of tallest person on record

- A variable that takes numeric values based


Random Variable on a random event, and has an associated
probability space (Ω, 𝐹, 𝑃).

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 167
Embracing Uncertainty/Variation… Prefix: Random

- A placeholder that holds data or helps give


relations in algebraic form.
Variable
- 𝑎 = 𝜋𝑟 2
- 𝑋 = height of tallest person on record

- A variable that takes numeric values based


Random Variable on a random event, and has an associated
probability space (Ω, 𝐹, 𝑃).
- 𝑋 = height of next student entering the
room
- Consider a circle whose radius is decided
through a random event (e.g., roll of a die),
then both the radius (𝑟) and area (𝑎) of the
circle are random variables.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 168
Embracing Uncertainty/Variation… Prefix: Random

- A placeholder that holds data or helps give


relations in algebraic form.
Variable
- 𝑎 = 𝜋𝑟 2
- 𝑋 = height of tallest person on record To emphasize that 𝑟 is based on a random
event, we sometimes write it as 𝑟(𝜔).

- A variable that takes numeric values based 𝜔 = result of a die roll


Random Variable on a random event, and has an associated Ω, 𝐹, 𝑃 = Probability space of 𝜔
probability space (Ω, 𝐹, 𝑃).
- 𝑋 = height of next student entering the
room
- Consider a circle whose radius is decided
through a random event (e.g., roll of a die),
then both the radius (𝑟) and area (𝑎) of the
circle are random variables.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 169
Recall… (we previously saw a random
variable as a function/mapping)

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 170
A random variable is a mapping from
probability space to real number line.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 171
A random variable is a mapping from
probability space to real number line.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 172
Back to the radius example… 𝑟 𝜔 = radius of a circle decided through outcome of random
event 𝜔 coming from a probability space.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 173
Back to the radius example… 𝑟 𝜔 = radius of a circle decided through outcome of random
event 𝜔 coming from a probability space.

- We note that the random variable 𝑟(𝜔)


could perform the mapping in many ways.

- E.g., let 𝜔 be the number that shows up


upon rolling a die, then:

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 174
Back to the radius example… 𝑟 𝜔 = radius of a circle decided through outcome of random
event 𝜔 coming from a probability space.

- We note that the random variable 𝑟(𝜔)


could perform the mapping in many ways.

- E.g., let 𝜔 be the number that shows up


upon rolling a die, then:

Mapping 1: 𝑟 𝜔 =𝜔

1, 𝜔=1
2, 𝜔=2
3, 𝜔=3
𝑟 𝜔 =
4, 𝜔=4
5, 𝜔=5
6, 𝜔=6

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 175
Back to the radius example… 𝑟 𝜔 = radius of a circle decided through outcome of random
event 𝜔 coming from a probability space.

- We note that the random variable 𝑟(𝜔) Mapping 2: 𝑟 𝜔 = 𝜔2


could perform the mapping in many ways.
1, 𝜔=1
- E.g., let 𝜔 be the number that shows up 4, 𝜔=2
upon rolling a die, then: 9, 𝜔=3
𝑟 𝜔 =
16, 𝜔=4
25, 𝜔=5
36, 𝜔=6
Mapping 1: 𝑟 𝜔 =𝜔

1, 𝜔=1
2, 𝜔=2
3, 𝜔=3
𝑟 𝜔 =
4, 𝜔=4
5, 𝜔=5
6, 𝜔=6

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 176
Back to the radius example… 𝑟 𝜔 = radius of a circle decided through outcome of random
event 𝜔 coming from a probability space.

- We note that the random variable 𝑟(𝜔) Mapping 2: 𝑟 𝜔 = 𝜔2


could perform the mapping in many ways.
1, 𝜔=1
- E.g., let 𝜔 be the number that shows up 4, 𝜔=2
upon rolling a die, then: 9, 𝜔=3
𝑟 𝜔 =
16, 𝜔=4
25, 𝜔=5
36, 𝜔=6
Mapping 1: 𝑟 𝜔 =𝜔
𝐴 = 𝜔: 𝜔 𝑒𝑣𝑒𝑛 = {2, 4, 6}
Mapping 3:
1, 𝜔=1 𝐵 = 𝜔: 𝜔 𝑜𝑑𝑑 = {1, 3, 5}
2, 𝜔=2
3, 𝜔=3
𝑟 𝜔 = 1, 𝜔∈𝐴
4, 𝜔=4 𝑟 𝜔 =ቊ
4, 𝜔∈𝐵
5, 𝜔=5
6, 𝜔=6

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 177
Random Variable As a Mapping

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 178
Random Variable As a Mapping

- Thus, from the same random experiment, we can create


infinite different random variables (mappings).
- Of course, in practice, we assign the mapping that is most
meaningful/useful for us.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 179
Random Variable As a Mapping

- Thus, from the same random experiment, we can create


infinite different random variables (mappings).
- Of course, in practice, we assign the mapping that is most
meaningful/useful for us.

Random Variable vs. Random Event

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 180
Random Variable As a Mapping

- Thus, from the same random experiment, we can create


infinite different random variables (mappings).
- Of course, in practice, we assign the mapping that is most
meaningful/useful for us.

Random Variable vs. Random Event

- Random variable takes numeric value based on a random


event.
- Random event can be non-numeric (descriptive), random
variable cannot.
- A random event can always be converted into a random
variable through mapping.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 181
𝑍 = Next student who enters the room, is he from Swabi region? Not a random variable yet.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 182
𝑍 = Next student who enters the room, is he from Swabi region? Not a random variable yet.

Possible outcomes
(descriptive)

Yes

No

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 183
𝑍 = Next student who enters the room, is he from Swabi region? Not a random variable yet.

Possible outcomes
(descriptive)
Possible outcomes
(numeric)

Yes

1
No

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 184
𝑍 = Next student who enters the room, is he from Swabi region? Not a random variable yet.

Possible outcomes
(descriptive)
Possible outcomes
(numeric)

Yes

1
No

𝑍 = 1 if next student entering the room is from Swabi Now 𝑍 is a random variable!!
𝑍 = 0 if next student entering the room in not from Swabi

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 185
Once the outcome of a random experiment is available (or assumed), the
Realization
value of the random variable becomes fixed, and is called a realization.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 186
Once the outcome of a random experiment is available (or assumed), the
Realization
value of the random variable becomes fixed, and is called a realization.

Let 𝑟(𝜔) be the radius of a circle chosen based on 𝜔, which is the


number that shows up upon rolling a die, then the mapping

𝑟 𝜔 =𝜔

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 187
Once the outcome of a random experiment is available (or assumed), the
Realization
value of the random variable becomes fixed, and is called a realization.

Let 𝑟(𝜔) be the radius of a circle chosen based on 𝜔, which is the


number that shows up upon rolling a die, then the mapping

𝑟 𝜔 =𝜔

gives following possible realizations


for radius and area of the circle

𝑟 ∈ 1, 2, 3, 4, 5, 6
𝑎 ∈ {𝜋, 4𝜋, 9𝜋, 16𝜋, 25𝜋, 36𝜋}

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 188
Once the outcome of a random experiment is available (or assumed), the
Realization
value of the random variable becomes fixed, and is called a realization.

Let 𝑟(𝜔) be the radius of a circle chosen based on 𝜔, which is the


number that shows up upon rolling a die, then the mapping

𝑟 𝜔 =𝜔

gives following possible realizations


for radius and area of the circle

𝑟 ∈ 1, 2, 3, 4, 5, 6
𝑎 ∈ {𝜋, 4𝜋, 9𝜋, 16𝜋, 25𝜋, 36𝜋}

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 189
Random Process A collection of random variables.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 190
Random Process A collection of random variables.

𝑋 = height of next student entering the room Random Variable

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 191
Random Process A collection of random variables.

𝑋 = height of next student entering the room Random Variable

{𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 } = heights of next four students entering the room Random Process

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 192
Random Process A collection of random variables.

𝑋 = height of next student entering the room Random Variable

{𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 } = heights of next four students entering the room Random Process

Realization of the random


process in Lecture 1

Realization of the random


process in Lecture 2

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 193
In general, a random process can be represented as…

𝑋(𝜔, 𝑡) 𝑡∈𝑇

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 194
In general, a random process can be represented as…

𝑋(𝜔, 𝑡) 𝑡∈𝑇

Discrete Continuous
State State

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 195
In general, a random process can be represented as…

Discrete-time

𝑋(𝜔, 𝑡) 𝑡∈𝑇 Continuous-time

Discrete Continuous
State State

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 196
In general, a random process can be represented as…

Discrete-time

𝑋(𝜔, 𝑡) 𝑡∈𝑇 Continuous-time

Discrete Continuous
State State
Examples

- Discrete-State, Discrete-Time:
- Continuous-State, Discrete-Time:
- Continuous-State, Continuous-Time:
- Discrete-State, Continuous-Time:

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 197
In general, a random process can be represented as…

Discrete-time

𝑋(𝜔, 𝑡) 𝑡∈𝑇 Continuous-time

Discrete Continuous
State State
Examples

- Discrete-State, Discrete-Time: Number of siblings next four students entering the room have.
- Continuous-State, Discrete-Time: Heights of next four students entering the room.
- Continuous-State, Continuous-Time: Temperature of the room over a day.
- Discrete-State, Continuous-Time: Number of students in a classroom over a day.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 198
Having defined probabilities and
randomness, we now define functions
that collect probabilities…

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 199
Discrete vs. Continuous Case For collecting probabilities, we need to split the problem
into two cases depending on whether random variable 𝑋
is discrete-state or continuous-state.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 200
Discrete vs. Continuous Case For collecting probabilities, we need to split the problem
into two cases depending on whether random variable 𝑋
is discrete-state or continuous-state.

Number of values
random variable
𝑋 takes, can be

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 201
Discrete vs. Continuous Case For collecting probabilities, we need to split the problem
into two cases depending on whether random variable 𝑋
is discrete-state or continuous-state.

Finite
Number of values
Countably Infinite
random variable
𝑋 takes, can be Uncountably Infinite

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 202
Discrete vs. Continuous Case For collecting probabilities, we need to split the problem
into two cases depending on whether random variable 𝑋
is discrete-state or continuous-state.

Finite Discrete
Number of values random variable
Countably Infinite
random variable
𝑋 takes, can be Continuous
Uncountably Infinite
random variable

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 203
Discrete vs. Continuous Case For collecting probabilities, we need to split the problem
into two cases depending on whether random variable 𝑋
is discrete-state or continuous-state.

Finite Discrete
Number of values random variable
Countably Infinite
random variable
𝑋 takes, can be Continuous
Uncountably Infinite
random variable

Discrete Random Variable → 𝑋 takes isolated values in 𝑅

- 𝑋 ∈ {1, 2, 3, 4, 5, 6} : Finite
- 𝑋 ∈ {1, 2, 3, 4, … } : Countably Infinite

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 204
Discrete vs. Continuous Case For collecting probabilities, we need to split the problem
into two cases depending on whether random variable 𝑋
is discrete-state or continuous-state.

Finite Discrete
Number of values random variable
Countably Infinite
random variable
𝑋 takes, can be Continuous
Uncountably Infinite
random variable

Discrete Random Variable → 𝑋 takes isolated values in 𝑅

- 𝑋 ∈ {1, 2, 3, 4, 5, 6} : Finite
- 𝑋 ∈ {1, 2, 3, 4, … } : Countably Infinite

Continuous Random Variable → 𝑋 takes a continuum of values in 𝑅

- 𝑋∈𝑅
- 𝑋 𝑖𝑠 𝑎 𝑟𝑒𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑡ℎ𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 [0 1]

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 205
We look at the discrete case first…

We collect probabilities for a discrete random


variable in the Probability Distribution Function

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 206
We look at the discrete case first…

We collect probabilities for a discrete random


variable in the Probability Distribution Function

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 207
Being defined on probabilities, the Distribution
Function 𝑝𝑋 (𝑥) must satisfy the three axioms of
probability, leading to the conditions

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 208
Being defined on probabilities, the Distribution
Function 𝑝𝑋 (𝑥) must satisfy the three axioms of
probability, leading to the conditions

𝑝𝑋 𝑥𝑖 ≥ 0 Assigned probabilities cannot be negative

෍ 𝑝𝑋 (𝑥𝑖 ) = 1 Probabilities of all possible values of 𝑋 must add


up to 1 (neither more nor less than 1)
∀𝑖

𝑝𝑋 𝑥𝑖 = 𝑃[𝑋 = 𝑥𝑖 ] 𝑝𝑋 (𝑥) must represent probabilities

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 209
Let’s play …

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 210
Let’s play …
• Let’s assume there is a bag with three balls in it with numbers 1 – 3 written on
them. You draw one ball at random (with each equally likely to be picked).
• 𝑋 = number written on the ball (random variable)
• 𝑥 = 1, 2, 3 (possible values of 𝑋)
• 𝑝𝑋 𝑥 = 𝑃(𝑋 = 𝑥)

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 211
Let’s play …
• Let’s assume there is a bag with three balls in it with numbers 1 – 3 written on
them. You draw one ball at random (with each equally likely to be picked).
• 𝑋 = number written on the ball (random variable)
• 𝑥 = 1, 2, 3 (possible values of 𝑋)
• 𝑝𝑋 𝑥 = 𝑃(𝑋 = 𝑥)

𝒙 𝒑𝑿 (𝒙)
1
1
3
1
2
3
1
3
3

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 212
Let’s play …
• Let’s assume there is a bag with three balls in it with numbers 1 – 3 written on
them. You draw one ball at random (with each equally likely to be picked).
• 𝑋 = number written on the ball (random variable)
• 𝑥 = 1, 2, 3 (possible values of 𝑋)
• 𝑝𝑋 𝑥 = 𝑃(𝑋 = 𝑥)

𝒙 𝒑𝑿 (𝒙)
Note that 𝑝𝑋 𝑥 satisfies the 1
three conditions 1
3
1. Its always non-negative 1
2. Sums up to 1 2
3. It assigns probabilities to 3
each value of 𝑋 1
3
3

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 213
We often find it useful to discuss whether
E.g., Probability next student entering is shorter than me?
a random variable lies in a certain range…

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 214
We often find it useful to discuss whether
E.g., Probability next student entering is shorter than me?
a random variable lies in a certain range…

This is done with the help of Cumulative Distribution Function

Probability that 𝑋
takes a value less
than or equal to 𝑥

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 215
We often find it useful to discuss whether
E.g., Probability next student entering is shorter than me?
a random variable lies in a certain range…

This is done with the help of Cumulative Distribution Function

Probability that 𝑋
takes a value less
than or equal to 𝑥

Clearly

CDF of a discrete
random variable is just
the sum of the PDF
values for 𝑥𝑖 ≤ 𝑥

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 216
Properties of the CDF

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 217
Properties of the CDF

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 218
Properties of the CDF

CDF cannot be negative

(as every new term added has


CDF is a non-decreasing function
to be non-negative)

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 219
Example

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 220
Example

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 221
Example

1 1 5
𝑃 𝑋 ≤ 𝑥2 = sum of all these possibilities = + =
3 2 6

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 222
Example

1 1 5
𝑃 𝑋 ≤ 𝑥2 = sum of all these possibilities = + =
3 2 6

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 223
Let’s play …
• Let’s write 𝐹𝑋 𝑥 for the three balls example

𝒙 𝒑𝑿 (𝒙) 𝐹𝑋 𝑥 = 𝑃(𝑋 ≤ 𝑥)
1 1
1 𝐹𝑋 1 = 𝑝𝑋 1 = 3
3
1 2
2 𝐹𝑋 2 = 𝑝𝑋 1 + 𝑝𝑋 (2) = 3
3
1
3 𝐹𝑋 3 =𝑝𝑋 1 + 𝑝𝑋 2 + 𝑝𝑋 (3) = 1
3

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 224
Let’s play …
• Let’s write 𝐹𝑋 𝑥 for the three balls example

𝒙 𝒑𝑿 (𝒙) 𝐹𝑋 𝑥 = 𝑃(𝑋 ≤ 𝑥)
1 1
1 𝐹𝑋 1 = 𝑝𝑋 1 = 3
3
1 2 Note that 𝐹 𝑥 satisfies the
2 𝐹𝑋 2 = 𝑝𝑋 1 + 𝑝𝑋 (2) = 3
3 conditions
1
3 𝐹𝑋 3 =𝑝𝑋 1 + 𝑝𝑋 2 + 𝑝𝑋 (3) = 1
3

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 225
Let’s play …
• Let’s write 𝐹𝑋 𝑥 for the three balls example

𝒙 𝒑𝑿 (𝒙) 𝐹𝑋 𝑥 = 𝑃(𝑋 ≤ 𝑥)
1 1
1 𝐹𝑋 1 = 𝑝𝑋 1 = 3
3
1 2 Note that 𝐹 𝑥 satisfies the
2 𝐹𝑋 2 = 𝑝𝑋 1 + 𝑝𝑋 (2) = 3
3 conditions
1
3 𝐹𝑋 3 =𝑝𝑋 1 + 𝑝𝑋 2 + 𝑝𝑋 (3) = 1
3

𝐹𝑋 0.5 =?
𝐹𝑋 −∞ =?
𝐹𝑋 2.5 =?
𝐹𝑋 5 =?
𝐹𝑋 ∞ =?

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 226
Let’s play …
• Let’s write 𝐹𝑋 𝑥 for the three balls example

𝒙 𝒑𝑿 (𝒙) 𝐹𝑋 𝑥 = 𝑃(𝑋 ≤ 𝑥)
1 1
1 𝐹𝑋 1 = 𝑝𝑋 1 = 3
3
1 2 Note that 𝐹 𝑥 satisfies the
2 𝐹𝑋 2 = 𝑝𝑋 1 + 𝑝𝑋 (2) = 3
3 conditions
1
3 𝐹𝑋 3 =𝑝𝑋 1 + 𝑝𝑋 2 + 𝑝𝑋 (3) = 1
3

𝐹𝑋 0.5 = 𝑃 𝑋 ≤ 0.5 = 0
𝐹𝑋 −∞ = 𝑃 𝑋 ≤ −∞ = 0
𝐹𝑋 2.5 = 𝑃 𝑋 ≤ 2.5 = 𝑃 𝑋 ≤ 2 = 𝐹𝑋 (2)
𝐹𝑋 5 = 𝑃 𝑋 ≤ 5 = 𝑃 𝑋 ≤ 3 = 𝐹𝑋 3 = 1
𝐹𝑋 ∞ = 𝑃 𝑋 ≤ ∞ = 𝑃 𝑋 ≤ 3 = 𝐹𝑋 3 = 1

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 227
Now let’s repeat the above for Continuous Random Variables…

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 228
Now let’s repeat the above for Continuous Random Variables…

Recall… Continuous Random Variable → 𝑋 takes a continuum of values in 𝑅


- 𝑋∈𝑅
- 𝑋 𝑖𝑠 𝑎 𝑟𝑒𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑡ℎ𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 [0 1]

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 229
Now let’s repeat the above for Continuous Random Variables…

Recall… Continuous Random Variable → 𝑋 takes a continuum of values in 𝑅


- 𝑋∈𝑅
- 𝑋 𝑖𝑠 𝑎 𝑟𝑒𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑡ℎ𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 [0 1]

Now, for a continuous random variable, it only makes sense to talk about
the probability that it lies in some interval, say 𝑎 𝑏 , 𝑏 > 𝑎

𝑃 𝑎 < 𝑋 ≤ 𝑏 = 𝐹𝑋 𝑏 − 𝐹𝑋 (𝑎)

Where 𝐹𝑋 𝑥 = 𝑃[𝑋 ≤ 𝑥] is the CDF of the


continuous random variable 𝑋.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 230
Now let’s repeat the above for Continuous Random Variables…

Why?
Recall… Continuous Random Variable → 𝑋 takes a continuum of values in 𝑅
- If 𝑋 is continuous, it can take
- 𝑋∈𝑅 infinite values over a range.
- 𝑋 𝑖𝑠 𝑎 𝑟𝑒𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑡ℎ𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 [0 1]
- Clearly, the probability that
𝑋 takes any one value among
Now, for a continuous random variable, it only makes sense to talk about these infinite values would
the probability that it lies in some interval, say 𝑎 𝑏 , 𝑏 > 𝑎 have to be zero.

- i.e., for continuous random


𝑃 𝑎 < 𝑋 ≤ 𝑏 = 𝐹𝑋 𝑏 − 𝐹𝑋 (𝑎) variables 𝑃 𝑋 = 𝑥 = 0.

Where 𝐹𝑋 𝑥 = 𝑃[𝑋 ≤ 𝑥] is the CDF of the


continuous random variable 𝑋.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 231
Now let’s repeat the above for Continuous Random Variables…

Why?
Recall… Continuous Random Variable → 𝑋 takes a continuum of values in 𝑅
- If 𝑋 is continuous, it can take
- 𝑋∈𝑅 infinite values over a range.
- 𝑋 𝑖𝑠 𝑎 𝑟𝑒𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑛 𝑡ℎ𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 [0 1]
- Clearly, the probability that
𝑋 takes any one value among
Now, for a continuous random variable, it only makes sense to talk about these infinite values would
the probability that it lies in some interval, say 𝑎 𝑏 , 𝑏 > 𝑎 have to be zero.

- i.e., for continuous random


𝑃 𝑎 < 𝑋 ≤ 𝑏 = 𝐹𝑋 𝑏 − 𝐹𝑋 (𝑎) variables 𝑃 𝑋 = 𝑥 = 0.

Where 𝐹𝑋 𝑥 = 𝑃[𝑋 ≤ 𝑥] is the CDF of the It’s kind of like: a line passes through a
continuous random variable 𝑋. continuum of points and has a length, even
though each point has no notion of length.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 232
Distributions of Continuous Random Variables… Probability Density Function (PDF)

For a continuous random variable, we start here

𝑃 𝑎 < 𝑋 ≤ 𝑏 = 𝐹𝑋 𝑏 − 𝐹𝑋 (𝑎)

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 233
Distributions of Continuous Random Variables… Probability Density Function (PDF)

For a continuous random variable, we start here

𝑃 𝑎 < 𝑋 ≤ 𝑏 = 𝐹𝑋 𝑏 − 𝐹𝑋 (𝑎)

We could make the interval as small as desired

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 234
Distributions of Continuous Random Variables… Probability Density Function (PDF)

For a continuous random variable, we start here

𝑃 𝑎 < 𝑋 ≤ 𝑏 = 𝐹𝑋 𝑏 − 𝐹𝑋 (𝑎)

We could make the interval as small as desired

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 235
Distributions of Continuous Random Variables… Probability Density Function (PDF)

For a continuous random variable, we start here

𝑃 𝑎 < 𝑋 ≤ 𝑏 = 𝐹𝑋 𝑏 − 𝐹𝑋 (𝑎)

We could make the interval as small as desired

In fact, if the derivative of 𝐹𝑋 (𝑥) exists, we could use it to


define a Probability Density Function (PDF) 𝑓𝑋 (𝑥) for the
continuous random variable as

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 236
Distributions of Continuous Random Variables… Probability Density Function (PDF)

For a continuous random variable, we start here

𝑃 𝑎 < 𝑋 ≤ 𝑏 = 𝐹𝑋 𝑏 − 𝐹𝑋 (𝑎)

We could make the interval as small as desired

In fact, if the derivative of 𝐹𝑋 (𝑥) exists, we could use it to


define a Probability Density Function (PDF) 𝑓𝑋 (𝑥) for the
continuous random variable as

or

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 237
Distributions of Continuous Random Variables… Probability Density Function (PDF)

For a continuous random variable, we start here


So, in case of continuous random variables, we talk of
𝑃 𝑎 < 𝑋 ≤ 𝑏 = 𝐹𝑋 𝑏 − 𝐹𝑋 (𝑎) the Probability Density Function (PDF) which can be
used to find probability that 𝑋 lies in a certain range.
We could make the interval as small as desired

In fact, if the derivative of 𝐹𝑋 (𝑥) exists, we could use it to


define a Probability Density Function (PDF) 𝑓𝑋 (𝑥) for the
continuous random variable as

or

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 238
Distributions of Continuous Random Variables… Probability Density Function (PDF)

For a continuous random variable, we start here


So, in case of continuous random variables, we talk of
𝑃 𝑎 < 𝑋 ≤ 𝑏 = 𝐹𝑋 𝑏 − 𝐹𝑋 (𝑎) the Probability Density Function (PDF) which can be
used to find probability that 𝑋 lies in a certain range.
We could make the interval as small as desired
e.g., for range (−∞ 𝑥]

𝑃 𝑋≤𝑥 =
In fact, if the derivative of 𝐹𝑋 (𝑥) exists, we could use it to
define a Probability Density Function (PDF) 𝑓𝑋 (𝑥) for the
continuous random variable as And for range 𝑎 𝑏 , 𝑏 > 𝑎

or

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 239
Some Properties of 𝑓𝑋 (𝑥)

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 240
Some Properties of 𝑓𝑋 (𝑥)

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 241
Some Properties of 𝑓𝑋 (𝑥)

Since it is the derivative of a non-


decreasing function (the CDF).

Since this represents the


probability of the sure event.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 242
- Q. If 𝑃 𝑋 = 𝑥 = 0 for continuous random variable 𝑋, and
𝑓𝑋 𝑥 ≠ 𝑃 𝑋 = 𝑥 , then what does 𝑓𝑋 (𝑥) really represent?

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 243
- Q. If 𝑃 𝑋 = 𝑥 = 0 for continuous random variable 𝑋, and
𝑓𝑋 𝑥 ≠ 𝑃 𝑋 = 𝑥 , then what does 𝑓𝑋 (𝑥) really represent?

Mathematically,

- It can be seen either as the derivative of


the CDF,

- Or, as the probability density (spread of


probability over a continuum) that gives
probability of 𝑋 lying in an interval when
𝑓𝑋 (𝑥) is integrated over that interval.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 244
- Q. If 𝑃 𝑋 = 𝑥 = 0 for continuous random variable 𝑋, and
𝑓𝑋 𝑥 ≠ 𝑃 𝑋 = 𝑥 , then what does 𝑓𝑋 (𝑥) really represent?

Mathematically,

- It can be seen either as the derivative of


the CDF,

- Or, as the probability density (spread of


probability over a continuum) that gives
probability of 𝑋 lying in an interval when
𝑓𝑋 (𝑥) is integrated over that interval.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 245
- Q. If 𝑃 𝑋 = 𝑥 = 0 for continuous random variable 𝑋, and
𝑓𝑋 𝑥 ≠ 𝑃 𝑋 = 𝑥 , then what does 𝑓𝑋 (𝑥) really represent?

Intuitively,

- It represents the “relative likelihood” of 𝑋 taking a certain value.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 246
- Q. If 𝑃 𝑋 = 𝑥 = 0 for continuous random variable 𝑋, and
𝑓𝑋 𝑥 ≠ 𝑃 𝑋 = 𝑥 , then what does 𝑓𝑋 (𝑥) really represent?

Intuitively,

- It represents the “relative likelihood” of 𝑋 taking a certain value.

- E.g., consider the case of a continuous uniform distribution


depicted here. The fact that 𝑓𝑋 0.5 = 𝑓𝑋 0.6 = 1 does not mean
that the two points have 100% probability. Instead, it only means
that they have the same “relative likelihood” of occurring.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 247
- Q. If 𝑃 𝑋 = 𝑥 = 0 for continuous random variable 𝑋, and
𝑓𝑋 𝑥 ≠ 𝑃 𝑋 = 𝑥 , then what does 𝑓𝑋 (𝑥) really represent?

Intuitively,

- It represents the “relative likelihood” of 𝑋 taking a certain value.

- E.g., consider the case of a continuous uniform distribution


depicted here. The fact that 𝑓𝑋 0.5 = 𝑓𝑋 0.6 = 1 does not mean
that the two points have 100% probability. Instead, it only means
that they have the same “relative likelihood” of occurring.

- As another example, consider the Gaussian distribution depicted


here. The values 𝑓𝑋 0.5 = 0.1 and 𝑓𝑋 0.7 = 0.2 only indicate
that 𝑋 = 0.7 is twice as likely as 𝑋 = 0.1.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 248
Probability Distribution Function 𝑝𝑋 (𝑥) (discrete case) and Probability
All the Info We Need!
Density Function 𝑓𝑋 (𝑥) (continuous case) are both very important as
they contain all the probability information about a random variable.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 249
Probability Distribution Function 𝑝𝑋 (𝑥) (discrete case) and Probability
All the Info We Need!
Density Function 𝑓𝑋 (𝑥) (continuous case) are both very important as
they contain all the probability information about a random variable.

𝑋 ~ 𝑝𝑋 (𝑥) 𝑋 ~ 𝑓𝑋 (𝑥)

Continuous Random Our knowledge of the


Discrete Random Our knowledge of the
Variable relative likelihoods
Variable probabilities 𝑃[𝑋 = 𝑥]

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 250
From One Random Variable to Two Bivariate Distributions

Just as in probabilities, we were interested in probability of two


random events occurring together 𝑃[𝐴 ∩ 𝐵], we could be interested
in joint distributions of two random variables 𝑋 and 𝑌.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 251
From One Random Variable to Two Bivariate Distributions

Just as in probabilities, we were interested in probability of two


random events occurring together 𝑃[𝐴 ∩ 𝐵], we could be interested
in joint distributions of two random variables 𝑋 and 𝑌. Probability that 𝑋 ≤ 𝑥 𝒂𝒏𝒅 𝑌 ≤ 𝑦

Bivariate CDF

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 252
From One Random Variable to Two Bivariate Distributions

Just as in probabilities, we were interested in probability of two


random events occurring together 𝑃[𝐴 ∩ 𝐵], we could be interested
in joint distributions of two random variables 𝑋 and 𝑌. Probability that 𝑋 ≤ 𝑥 𝒂𝒏𝒅 𝑌 ≤ 𝑦

Bivariate CDF

Bivariate Distribution
Function (Discrete Case)
𝑝𝑋𝑌 𝑥, 𝑦 ≥ 0.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 253
From One Random Variable to Two Bivariate Distributions

Just as in probabilities, we were interested in probability of two


random events occurring together 𝑃[𝐴 ∩ 𝐵], we could be interested
in joint distributions of two random variables 𝑋 and 𝑌. Probability that 𝑋 ≤ 𝑥 𝒂𝒏𝒅 𝑌 ≤ 𝑦

Bivariate CDF

Bivariate Distribution
Function (Discrete Case)
𝑝𝑋𝑌 𝑥, 𝑦 ≥ 0.

Bivariate Density Function


(Continuous Case)

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 254
Bivariate Distributions Important Properties and Relations

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 255
Bivariate Distributions Important Properties and Relations

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 256
Bivariate Distributions Important Properties and Relations

Called the “Marginal Distribution of 𝑌”, i.e., distribution


of 𝑌 extracted from the joint distribution/density.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 257
Properties of Bivariate CDF

Marginal CDFs

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 258
Conditional Distributions
Just as for two events, we had conditional probability 𝑃[𝐴|𝐵],
for two random variables we have conditional distributions

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 259
Conditional Distributions
Just as for two events, we had conditional probability 𝑃[𝐴|𝐵],
for two random variables we have conditional distributions

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 260
Conditional Distributions
Just as for two events, we had conditional probability 𝑃[𝐴|𝐵],
for two random variables we have conditional distributions

With properties…

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 261
Conditional Distributions
Just as for two events, we had conditional probability 𝑃[𝐴|𝐵],
for two random variables we have conditional distributions

Compare with:

𝑃 𝐴∩𝐵
𝑃𝐴𝐵 =
𝑃𝐵

With properties…

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 262
Independence of Two
Random Variables

Just as for two events to be independent, we had

𝑃[𝐴 ∩ 𝐵] = 𝑃 𝐴 𝑃[𝐵]

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 263
Independence of Two
Random Variables

Just as for two events to be independent, we had

𝑃[𝐴 ∩ 𝐵] = 𝑃 𝐴 𝑃[𝐵]

For random variables we can check any of the following

𝐹𝑋𝑌 𝑥, 𝑦 = 𝐹𝑋 𝑥 𝐹𝑌 𝑦 ∀𝑥, 𝑦

𝑓𝑋𝑌 𝑥, 𝑦 = 𝑓𝑋 𝑥 𝑓𝑌 𝑦 ∀𝑥, 𝑦

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 264
From Two to Many Random Variables Multivariate Distributions

Let 𝑋1 , 𝑋2 , … , 𝑋𝑀 be random variable defined on a probability


space. Then we can define their multivariate Joint CDF as

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 265
From Two to Many Random Variables Multivariate Distributions

Let 𝑋1 , 𝑋2 , … , 𝑋𝑀 be random variable defined on a probability


space. Then we can define their multivariate Joint CDF as

𝑿 = [𝑋1 𝑋2 … 𝑋𝑀 ]
𝐹𝑿 𝒙 = 𝑃[𝑋1 ≤ 𝑥1 , 𝑋2 ≤ 𝑥2 , … , 𝑋𝑀 ≤ 𝑥𝑚 ]
𝒙 = [𝑥1 𝑥2 … 𝑥𝑀 ]

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 266
From Two to Many Random Variables Multivariate Distributions

Let 𝑋1 , 𝑋2 , … , 𝑋𝑀 be random variable defined on a probability


space. Then we can define their multivariate Joint CDF as

𝑿 = [𝑋1 𝑋2 … 𝑋𝑀 ]
𝐹𝑿 𝒙 = 𝑃[𝑋1 ≤ 𝑥1 , 𝑋2 ≤ 𝑥2 , … , 𝑋𝑀 ≤ 𝑥𝑚 ]
𝒙 = [𝑥1 𝑥2 … 𝑥𝑀 ]

𝑓𝑿 (𝒙) and 𝑝𝑿 (𝒙) can be defined analogously. HW!

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 267
From Two to Many Random Variables Multivariate Distributions

Let 𝑋1 , 𝑋2 , … , 𝑋𝑀 be random variable defined on a probability


space. Then we can define their multivariate Joint CDF as

𝑿 = [𝑋1 𝑋2 … 𝑋𝑀 ]
𝐹𝑿 𝒙 = 𝑃[𝑋1 ≤ 𝑥1 , 𝑋2 ≤ 𝑥2 , … , 𝑋𝑀 ≤ 𝑥𝑚 ]
𝒙 = [𝑥1 𝑥2 … 𝑥𝑀 ]

𝑓𝑿 (𝒙) and 𝑝𝑿 (𝒙) can be defined analogously. HW!

Independence conditions may also be extended


easily from two to many random variables.

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES 268
Next time, we look at what key
information we can extract from
these distribution functions…

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt


269
@ GIKI - FES
Questions?? Thoughts??

ES691 - Mathematics for Machine Learning / Dr. Naveed R. Butt @ GIKI - FES
270

You might also like