Module-5 complete notes-Quantifying Uncertainty 20th February 2024
Module-5 complete notes-Quantifying Uncertainty 20th February 2024
Dr AJ KAMESWARA PRASAD
PROFESSOR
Department of Acharya Institute of Technology
Bangalore
Copyright notice: Most examples and images displayed in the slides of this course are taken from [Russell &
Norwig, “Artificial Intelligence, a Modern Approach”, 3rd ed., Pearson], including explicitly figures from the above-
mentioned book, so that their copyright is detained by the authors. A few other material, included AJK detain its
copyright.
These slides cannot can be displayed in public without the permission of the author. Following fair use doctrine
these slides shall be shared to students
1 / 44
Fundamentals of Artificial Intelligence
Chapter 13: Quantifying Uncertainty
Module-5
Uncertain Knowledge and Reasoning: Quantifying Uncertainty: Acting
under Uncertainty, Basic Probability Notation, Inference using Full Joint
Distributions, Independence, Baye’s Rule and its use. Wumpus World
Revisited Text Book 1: Chapter 13-13.1, 13.2, 13.3, 13.4, 13.5, 13.6
2 / 44
Module-5
Syllabus Uncertain Knowledge and Reasoning:
Quantifying Uncertainty: Acting under Uncertainty,
Basic Probability Notation, Inference using Full Joint
Distributions, Independence, Baye’s Rule and its use.
Wumpus World Revisited
Outline
2 Basics on Probability
4 / 44
Outline
2 Basics on Probability
5 / 44
Acting Under Uncertainty
Agents often make decisions based on incomplete information
partial observability
nondeterministic actions
Partial solution (see previous chapters): maintain belief states
represent the set of all possible world states the agent might be
in generating a contingency plan handling every possible
eventuality
Several drawbacks:
must consider every possible explanation for the observation (even
very-unlikely ones)
=⇒ impossibly complex belief-states
contingent plans handling every eventuality grow arbitrarily large
sometimes there is no plan that is guaranteed to achieve the
goal
Agent’s knowledge cannot guarantee a successful outcome ...
... but can provide some degree of belief (likelihood) on it
A rational decision depends on both the relative importance of (sub)goals and the likelihood
6 / 44
Acting Under Uncertainty: Example
Automated taxi to Airport
Goal: deliver a passenger to the airport on time
Action At : leave for airport t minutes before flight
How can we be sure that A90 will succeed?
Too many sources of uncertainty:
partial observability (ex: road state, other drivers’ plans, etc.)
uncertainty in action outcome (ex: flat tire, etc.)
noisy sensors (ex: unreliable traffic reports)
complexity of modelling and predicting traffic
=⇒ With purely-logical approach it is difficult to
anticipate everything that can go wrong
risks falsehood: “A25 will get me there on
time” or
leads to conclusions that are too weak for
decision making:
“A25 will get me there on time if there’s no accident on the bridge , and it doesn’t rain and my tires
remain intact , and...”
7 / 44
Acting Under Uncertainty: Example (2)
A medical diagnosis
Given the symptoms (toothache) infer the cause (cavity)
How to encode this relation in logic?
diagnostic rules:
Toothache → Cavity (wrong)
Toothache → (Cavity ∨ GumProblem ∨ Abscess
∨ ...) (too many possible causes, some very unlikely)
causal rules:
Cavity → Toothache (wrong)
(Cavity ∧ ...) → Toothache (many possible
(con)causes)
Problems in specifying the correct logical rules:
Complexity: too many possible antecedents or
consequents Theoretical ignorance: no complete theory for
the domain Practical ignorance: no complete knowledge of
the patient
8 / 44
Summarizing Uncertainty
9 / 44
Making Decisions Under Uncertainty
2 Basics on Probability
11 / 44
Probabilities Basics: an Artificial Intelligentish Introduction
Probabilistic assertions: state how likely possible worlds are
Sample space Ω: the set of all possible worlds
ω ∈ Ω is a possible world (aka sample point or atomic
event) ex: the dice roll (1,4)
the possible worlds are mutually exclusive and exhaustive
ex: the 36 possible outcomes of rolling two dice: (1,1),
(1,2), ...
A probability model (aka probability space) is a sample space with an assignment P(ω) for
every ω ∈ Ω s.t.
0 ≤ P(ω) ≤ 1, for every ω ∈
Ω Σω∈ΩP(ω) = 1
Ex: 1-die roll: P(1) = P(2) =
P(3) = P(4) = P(5) = P(6) =
1/6
An Event A is any subset of Ω,
s.t. P(A) = Σω∈AP(ω)
events can be described by propositions in some formal 12 / 44
Random Variables
13 / 44
Propositions and Probabilities
We think a proposition a as the event A (set of sample points) where the proposition is true
Odd is a propositional random variable of range {true, false}
notation: a ⇐ ⇒ “A = true′′
Given Boolean random variables A
and B:
a: set of sample points where A(ω) =
true
¬a: set of sample points where A(ω) =
false
a ∧ b: set of sample points where A(ω)
= true, B(ω) = true
=⇒ with Boolean random variables, sample points are PL models
Proposition: disjunction of the sample points in which it is true
ex: (a ∨ b) ≡ (¬a ∧ b) ∨ (a ∧ ¬b) ∨ (a ∧ b)
=⇒ P(a ∨ b) = P(¬a ∧ b) + P(a ∧ ¬b) + P(a ∧ b)
Some derived facts:
P(¬a) = 1 − P(a) 14 / 44
Probability Distributions
Probability Distribution gives the probabilities of all the possible values of a random variable
ex: Weather: {sunny, rain, cloudy, snow}
P(Weather = sunny ) = 0.6
P(Weather = rain) = 0.1
=⇒ P(Weather ) = (0.6, 0.1, 0.29, 0.01)
P(Weather = =
⇐⇒ cloudy ) P(Weather 0.29
normalized: their sum is 1 = snow ) =
Joint Probability Distribution for multiple variables 0.01
gives the probability of every sample point
Weather = sunny rain snow
ex: P(Weather, Cavity ) cloudy Cavity = true
0.144 0.02 0.016 0.02
= Cavity = false 0.576 0.08 0.064 0.08
Every event is a sum of sample points,
=⇒ its probability is determined by the joint distribution
15 / 44
Probability for Continuous Variables
Express continuous probability distributions:
∫+∞
density functions f (x ) ∈ [0, 1] s.t− ∞ f (x )dx =
P(x ∈1 [a, b]) =∫ b f (x )
a
dx
=⇒ P(x ∈ [val, val]) = 0, P(x ∈ [−∞, +∞])
∫
= 1 ex: P(x ∈ [20, 22]) = 22 0.125 dx =
20
0.25 P(x ) = P(X = x ) =def lim P(X ∈ [x, x + dx
Density: '→0
dx
])/dxex: P(20.1) = limdx'→0 P(X ∈ [20.1, 20.1 + dx ])/dx =
0.125
note: P(v ) /= P(x ∈ [v, v ]) = 0
Logic Probability
a P(a) = 1
¬a P(a) = 0
a→ b P(b|a) = 1
(a, a P(a) = 1, P(b|a) = 1
→ b)
b P(b) = 1
(a → b, b → c) P(b|a) = 1, P(c|b) = 1
a→ c P(c|a) = 1
Proof of P(b|a) = 1, P(c|b) = 1 =⇒ P(c|a) =
def
1 P(b|a) = 1 =⇒ P(¬b, a) = P(¬b|a)P(a) =
def
0 P(c|b) = 1 =⇒ P(¬c, b) = P(¬c|b)P(b) =
P(¬c, a) = P(¬c, a, b) + P(¬c, a, ¬b) ≤ P(¬c, b) + P(a,
0
¬b) = 0 x
P(¬c|a) = P(¬c, a)/P(a) =
0
P(c|a) = 1 − P(¬c|a) = 1
(Courtesy of Maria Simi, UniPI)
19 / 44
Outline
2 Basics on Probability
20 / 44
Probabilistic Inference via Enumeration
Basic Ideas
Start with the joint distribution P(Toothache, Catch, Cavity )
For any proposition ϕ, sum the atomic events where ϕ is true: P(ϕ) = Σω : ω|=ϕP(ω)
21 / 44
Probabilistic Inference via Enumeration: Example
20 / 44
Probabilistic Inference via Enumeration: Example
20 / 44
Marginalization
21 / 44
Marginalization: Example
Start with the joint distribution P(Toothache, Catch, Cavity )
Marginalization (aka summing out):
sum up the probabilities for each possible value of the other variables:
Σ
P(Y) = z∈Z P(Y, z)
Σ
Ex: P(Toothache) =
z∈{Catch,Cavity} P(Toothache, z)
P(toothache) = 0.108 + 0.012 + 0.016 + 0.064 =
0.2
P(¬toothache) = 1 − P(toothache) = 1 − 0.2 = 0.8
=⇒ P(Toothache) = ⟨0.2, 0.8⟩
22 / 44
Marginalization: Example
Start with the joint distribution P(Toothache, Catch, Cavity )
Marginalization (aka summing out):
sum up the probabilities for each possible value of the other variables:
Σ
P(Y) = z∈Z P(Y, z)
Σ
Ex: P(Toothache) =
z∈{Catch,Cavity} P(Toothache, z)
P(toothache) = 0.108 + 0.012 + 0.016 + 0.064 =
0.2
P(¬toothache) = 1 − P(toothache) = 1 − 0.2 = 0.8
=⇒ P(Toothache) = ⟨0.2, 0.8⟩
22 / 44
Conditional Probability via Enumeration: Example
23 / 44
Conditional Probability via Enumeration: Example
23 / 44
Conditional Probability via Enumeration: Example
23 / 44
Normalization
2 Basics on Probability
26 / 44
Independence
Variables X and Y are independent iff P(X, Y ) = P(X )P(Y )
(or equivalently, iff P(X|Y ) = P(X ) or P(Y|X ) = P(Y ))
ex: P(Toothache, Catch, Cavity, Weather ) = P(Toothache, Catch,
Cavity )P(Weather )
=⇒ e.g. P(toothache, catch, cavity, cloudy ) = P(toothache, catch, cavity )P(cloudy )
typically based on domain knowledge
May drastically reduce the number of entries and computation
=⇒ ex: 32-element table decomposed into one 8-element and one 4-element table
Unfortunately, absolute independence is quite rare
27 / 44
Conditional Independence
Variables X and Y are conditionally independent given Z iff P(X, Y|Z) = P(X|Z)P(Y|Z)
(or equivalently, iff P(X|Y, Z) = P(X|Z) or P(Y|X, Z) = P(Y|Z))
Consider P(Toothache, Cavity, Catch)
if I have a cavity, the probability that the probe catches in it doesn’t depend on whether I have
a toothache: P(catch|toothache, cavity ) = P(catch|cavity )
the same independence holds if I haven’t got a cavity:
P(catch|toothache, ¬cavity ) = P(catch|¬cavity )
=⇒ Catch is conditionally independent of Toothache given Cavity:
P(Catch|Toothache, Cavity ) = P(Catch|Cavity )
or, equivalently: P(Toothache|Catch, Cavity ) = P(Toothache|Cavity ), or
P(Toothache, Catch|Cavity ) = P(Toothache|Cavity )P(Catch|Cavity )
Hint: Toothache and Catch are two (mutually-independent) effects of the same cause
Cavity
28 / 44
Conditional Independence [cont.]
In many cases, the use of conditional independence reduces the size of the representation
of the joint distribution dramatically
even from exponential to linear!
P(Toothache, Catch, Cavity )
= P(Toothache|Catch, Cavity )P(Catch, Cavity )
Ex:
= P(Toothache|Catch, Cavity )P(Catch|
Cavity )P(Cavity )
= P(Toothache|Cavity
=⇒ Passes )P(Catch|Cavity
from 7 to 2+2+1=5 independent )P(Cavity )
numbers
P(Toothache, Catch, Cavity ) contains
Σ 7 independent
8th can be obtained as 1 −
entries
(the
...)
P(Toothache|Cavity ),P(Catch|Cavity ) contain 2 independent entries (2 × 2 matrix, each
row sums to 1)
P(Cavity ) contains 1 independent entry
General Case: if one causes has n independent
Q effects:
P(Cause, Effect1 , ..., Effectn ) = P(Cause) i P(Effecti |Cause)
=⇒ reduces from 2n+1 − 1 to 2n + 1 independent entries
29 / 44
Exercise
Consider the joint probability distribution described in the table in previous section (slide 20
onwards): P(Toothache, Catch, Cavity )
Consider the example in previous slide:
P(Toothache, Catch, Cavity )
= P(Toothache|Catch, Cavity )P(Catch, Cavity )
= P(Toothache|Catch, Cavity )P(Catch|Cavity )P(Cavity )
= P(Toothache|Cavity )P(Catch|Cavity )P(Cavity )
Compute separately the distributions P(Toothache|Catch, Cavity ), P(Catch|Cavity ),
P(Cavity ), P(Toothache|Cavity ).
Recompute P(Toothache, Catch, Cavity ) in two
ways: P(Toothache|Catch, Cavity )P(Catch|
Cavity )P(Cavity ) P(Toothache|Cavity )P(Catch|Cavity
)P(Cavity )
and compare the result with P(Toothache, Catch, Cavity )
30 / 44
Outline
2 Basics on Probability
31 / 44
Bayes’ Rule
Bayes’ Rule/Theorem/Law
P(a ∧ b) P(b|a)P(a)
Bayes’ rule: P(a|b) = =
P(b)
P(b) P(X|Y )P(Y )
In distribution form P(Y|X ) = P(X = αP(X|Y )P(Y )
def )
α = 1/P(X ): normalization constant to make P(Y|X ) entries sum to
1 (different α′s for different values of X )
A version conditionalized on some background evidence e:
P(X|Y, e)P(Y|e)
P(Y|X, e) =
P(X|e)
32 / 44
Using Bayes’ Rule: The Simple Case
Used to assess diagnostic probability from causal probability:
P(effect|cause)P(cause)
P(cause|effect ) =
P(effect )
P(cause|effect ) goes from effect to cause (diagnostic direction)
P(effect|cause) goes from cause to effect (causal direction)
Example
An expert doctor is likely to have causal knowledge ... P(symptoms|disease)
(i.e., P(effect|cause))
... and needs producing diagnostic knowledge P(disease|symptoms) (i.e., P(cause|effect ))
Ex: let m be meningitis, s be stiff neck
P(m) = 1/50000, P(s) = 0.01 (prior knowledge, from statistics)
“meningitis causes to the0.7
P(s|m)P(m) patient a stiff neck in 70% of cases”: P(s|m) = 0.7 (doctor’s
· 1/50000
=⇒ P(m|s) =
experience) = = 0.0014
P(s) 0.01
33 / 44
Using Bayes’ Rule: Combining Evidence
A naive Bayes model is a probability model that assumes the effects are conditionally
independent, given the cause
Q
=⇒ P(Cause, Effect1 , ..., Effect
n ) = P(Cause) i P(Effecti |Cause)
total number of parameters is linear in n
ex: P(Cavity, Toothache, Catch) = P(Cavity )P(Toothache|Cavity )P(Catch|Cavity
)
Q: How can we compute P(Cause|Effect1, ..., Effectk )?
ex P(Cavity|toothache ∧ catch)?
34 / 44
Using Bayes’ Rule: Combining Evidence [cont.]
35 / 44
Outline
2 Basics on Probability
36 / 44
An Example: The Wumpus World
A probability model of the Wumpus World
Consider again the Wumpus World (restricted to pit detection)
Evidence: no pit in (1,1), (1,2), (2,1), breezy in (1,2), (2,1)
Q. Given the evidence, what is the probability of having a pit in (1,3), (2,2) or (3,1)?
Two groups of variables:
Pij = true iff [i, j] contains a pit
(“causes”)
Bij = true iff [i, j] is breezy
(“effects”, consider only
B1,1, B1,2, B2,1)
Joint
Distribution:
P(P1,1, ..., P4,4, B1,1,
B1,2, B∗2,1def)
b = ¬b1,1 ∧
Known facts def
b1,2 ∧ b2,1
p∗ = ¬p1,1 ∧ ¬p1,2 ∧ ¬p2,1
(evidence):
Queries: P(P1,3|b∗, p∗)? P(P22|b∗, p∗)?
(P(P3,1|b∗, p∗) symmetric) (© S. Russell & P. Norwig, AIMA)
37 / 44
An Example: The Wumpus World
A probability model of the Wumpus World
Consider again the Wumpus World (restricted to pit detection)
Evidence: no pit in (1,1), (1,2), (2,1), breezy in (1,2), (2,1)
Q. Given the evidence, what is the probability of having a pit in (1,3), (2,2) or (3,1)?
Two groups of variables:
Pij = true iff [i, j] contains a pit
(“causes”)
Bij = true iff [i, j] is breezy
(“effects”, consider only
B1,1, B1,2, B2,1)
Joint
Distribution:
P(P1,1, ..., P4,4, B1,1,
B1,2, B∗2,1def)
b = ¬ b1 1, ∧ b1 2, ∧ 2 1
Known facts
b∗, =
def
p ¬ p1 1, ∧ ¬1p2 , ∧ 2 1
(evidence):
¬p ,
Queries: P(P1,3|b∗, p∗)? P(P22|b∗, p∗)?
(© S. Russell & P. Norwig, AIMA)
(P(P3,1|b∗, p∗) symmetric) 37 / 44
An Example: The Wumpus World [cont.]
38 / 44
An Example: The Wumpus World [cont.]
Inference by enumeration
Case P1,3:
Σ
General form of query: P(Y|E = e) = αP(Y, E = e) = α h P(Y, E = e, H =
h)
Y: query vars; E,e: evidence vars/values; H,h: hidden vars/values
Our case: P(P1,3|p∗, b∗), s.t. the evidence is
def
b∗ = чb1 ,1 A b1 ,2 A b2 ,1
def
p∗ = чp1 ,1 A чp1 ,2 A 2 1
Sumчp
over
,
hidden variables:
Σ 1,3|p , b ) =
P(P ∗ ∗
39 / 44
An Example: The Wumpus World [cont.]
Using conditional independence
Basic insight: Given the fringe squares (see below), b∗ is conditionally independent of the
other hidden squares
def
Unknown = Fringe ∪ Other
def
=⇒ P(b∗|p∗, P1,3, Unknown) = P(b∗|p∗, P1,3, Fringe, Others) = P(b∗|p∗, P1,3, Fringe)
Next: manipulate the query into a form
where this equation can be used
40 / 44
An Example: The Wumpus World [cont.]
Using conditional independence
40 / 44
An Example: The Wumpus World [cont.]
Using conditional independence
Fringe in Normalization:
In the context of normalization, which often involves scaling data to fit within a certain
range or distribution, the fringe refers to the extreme values of the original data that
may lie outside the normalized range.
For instance, if you're normalizing data to a range between 0 and 1, the fringe would
consist of the original data points that are closest to the minimum and maximum
values. Normalization techniques like min-max scaling or z-score
normalization aim to handle these fringe values effectively to ensure that
they are appropriately represented in the normalized data.
Handling fringe values properly during normalization is important to prevent them
from unduly skewing the analysis or the performance of machine learning algorithms
trained on the data.
In both cases, understanding and appropriately dealing with the fringe are essential
for accurate modeling, analysis, and decision-making in uncertain or data-driven
(© S. Russell & P. Norwig, AIMA)
contexts. 40 / 44
An Example: The Wumpus World [cont.]
Using conditional independence
Min-max scaling, also known as min-max normalization, transforms the original data
into a new range, typically between 0 and 1. It does this by linearly scaling each
feature in the dataset based on the minimum and maximum values observed in that
feature.
The formula for min-max scaling is as follows:
�scaled=�−�min�max−�minxscaled=xmax−xminx−xmin
Where:
• �x is an original data point.
• �minxminis the minimum value of the feature.
• �maxxmaxis the maximum value of the feature.
• �scaledxscaledis the scaled value of �x within the range [0, 1].
40 / 44
An Example: The Wumpus World [cont.]
P(p∗, b∗) = P(p∗, b∗) is scalar; use as a normalization constant
41 / 44
An Example: The Wumpus World [cont.]
Sum over the unknowns
41 / 44
An Example: The Wumpus World [cont.]
Use the product rule
41 / 44
An Example: The Wumpus World [cont.]
Separate unknown into fringe and other
41 / 44
An Example: The Wumpus World [cont.]
b∗ is conditionally independent of other given fringe
41 / 44
An Example: The Wumpus World [cont.]
Move P(b∗|p∗, P1,3, fringe) outward
41 / 44
An Example: The Wumpus World [cont.]
All of the pit locations are independent
41 / 44
An Example: The Wumpus World [cont.]
Move P(p∗), P(P1,3), and P(fringe) outward
41 / 44
An Example: The Wumpus World [cont.]
Σ
Remove other P(other ) because it equals 1
41 / 44
An Example: The Wumpus World [cont.]
P(p∗) is scalar, so make it part of the normalization constant
41 / 44
An Example: The Wumpus World [cont.]
Σ
We have obtained: P(P1,3|p∗, b∗) = α′P(P1,3) f r i n g e P(b∗|p∗, P1,3, fringe)P(fringe)
We know that P(P1,3) = ⟨0.2, 0.8⟩ (see slide 38)
We can compute the normalization coefficient α ′ afterwards
Σ ∗ ∗
fringe P(b |p , P1,3 , fringe)P(fringe): only 4 possible fringes
Start by rewriting as two separate
Σ
equations:
P( p1,3|p∗, b∗) = α′P( p1,3) fringe P(b∗|p∗, p1,3, fringe)P(fringe)
Σ
P(чp1,3|p∗, b∗) = α′P(чp1,3) fringe P(b∗|p∗, чp1,3, fringe)P(fringe)
42 / 44
An Example: The Wumpus World [cont.]
Start by rewriting as two separate
Σ
equations:
P( p1,3|p∗, b∗) = α′P( p1,3) fringe P(b∗|p∗, p1,3, fringe)P(fringe)
Σ
P(чp1,3|p , b ) = α P(чp1,3)
∗ ∗ ′
fringe P(b |p , чp1,3, fringe)P(fringe)
∗ ∗
For
Σ
each of them, P(b∗|...) is 1 if the breezes occur, 0 otherwise:
P(b∗|p∗, p1,3, fringe)P(fringe) = 1 · 0.04 + 1 · 0.16 + 1 · 0.16 + 0 · 0.64 =
Σfringe
f r i n g e P(b |p , чp1,3, fringe)P(fringe) = 1·0.04 + 1 · 0.16 + 0 · 0.16 + 0 · 0.64 =
∗ ∗
0.36 Σ
P(P1,3 |p∗ , b∗ ) = α′ (P 1,3 ) fringe P(b ∗|p ∗, P1,3 , fringe)P(fringe)
=⇒ 0.2
P = α′⟨0.2, 0.8⟩⟨0.36, 0.2⟩ = α′⟨0.072, 0.16⟩ = (normalization, s.t. α ′ ≈ 4.31) ≈ ⟨0.31, 0.69⟩
44 / 44