0% found this document useful (0 votes)
0 views

CS6364 Lecture12 - AI Ch13 Prob Reasoning - Rev4(1)

The document discusses probabilistic reasoning in artificial intelligence, particularly in the context of medical diagnosis. It highlights the challenges of using first-order logic due to the complexity and uncertainty in medical conditions, emphasizing the importance of probability theory to manage degrees of belief. Key concepts such as random variables, atomic events, probability distributions, and conditional probabilities are explained to illustrate how to infer probabilities and make decisions under uncertainty.

Uploaded by

Erhan Tiryaki
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

CS6364 Lecture12 - AI Ch13 Prob Reasoning - Rev4(1)

The document discusses probabilistic reasoning in artificial intelligence, particularly in the context of medical diagnosis. It highlights the challenges of using first-order logic due to the complexity and uncertainty in medical conditions, emphasizing the importance of probability theory to manage degrees of belief. Key concepts such as random variables, atomic events, probability distributions, and conditional probabilities are explained to illustrate how to infer probabilities and make decisions under uncertainty.

Uploaded by

Erhan Tiryaki
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 58

Artificial Intelligence

CS 6364

Section 12
Probabilistic Reasoning
Acting under uncertainty
 Logical agents assume propositions are
- True
- False
- Unknown  acting under uncertainty
Example: diagnosis (for medicine)  dental diagnosis using 1st order
logic:
Wrong if one think that all patients with toothaches must have
cavities.
Why?
not all patients p with toothaches have cavities!
p Symptom( p, Toothache)  Disease( p, Cavity)
Some have gum disease, an abscess or other problems

Conclusion: To make the rule true, we have to add an almost


unlimited list( pof
p Symptom possible
, Toothache )causes
Disease ( p, Cavity )  Disease ( p, GumDisease)  Disease ( p, Abscess )
2
Medical Diagnosis
Trying to use first-order logic to cope with a domain like
medical diagnosis fails because
1. Laziness – too much work to list the complete list of rules
+ too hard to use such rules
Example of causal rule:
p Disease( p, Cavity)  Symptom( p, Toothache)

wrong, not all cavities cause pain  need to augment


the antecedent with all conditions that cause toothaches
2. Theoretical ignorance: Medical science has no complete
theory for the domain
3. Practical ignorance: Even if we know all the rules, we
might be uncertain about a particular patient, because
not all necessary tests have been or can be run (too
costly or too time-consuming)

3
Degree of belief
 When propositions are not known to be true or false, the agent can at
best provide a degree of belief in relevant sentences.
 The main tool for dealing with degrees of belief is the Probability Theory

function DT-AGENT(percept) returns an action


static belief_state, probabilistic beliefs about the current state of the world
action, the agent’s action

1. update belief_state based on action and percept


2. calculate outcome probabilities for actions,
given action descriptions and current belief_state
3. select action with highest expected utility
given probabilities of outcomes and utility information

return action
4
Basic Probability Notation
 Random Variables – thought as referring to a “part” of the
world whose “status” is unknown
 Random variables have domains  values they may take.
Depending on the domain, random variables may be
classified as
- Boolean random variables  have the domain <true,
false>
Example: Cavity = true Cavity = false (cavity)
- Discrete random variables  take values from a
countable domain
- Continuous random variables  values from the real
numbers
Example: the proposition x=4.02 asserts that the
random variable x has the exact value 4.02. We can
also have propositions that use inequalities like x4.20
5
Atomic Events
 An atomic event is a complete specification of the state
of the world about which the agent is uncertain. If the
world is described by a set of random variables, an
atomic event is a particular assignment of values to the
random variables

Example: 2 random variables: Cavity and Toothache


How many atomic events? 4
E1 (cavity  false)  (toothache  false)
E2 (cavity  false)  (toothache true)
E3 (cavity true)  (toothache  false)
E4 (cavity true)  (toothache true)

Properties:
a) Mutually exclusive: at most one can be true
b) The set of all possible atomic events is exhaustive
6
Axioms of Probability
1. 0 ≤ P(a) ≤ 1
2. P(true) = 1, P(false) = 0
3. P(a  b) = P(a) + P(b) – P(a  b)

We can deduce P(a) = 1 – P(a)


because P(true) = P(a) + P(a) – P(false)

P(true) = P(a  a) = 1


P(false) = P(a  a) = 0

P(a) + P(a) = 1

7
Prior probability
 Prior probability (or unconditional probability) associated
with proposition a is the degree of belief accorded to it in the
absence of any other information
Example: p(cavity = true) = 0.1
 Important: P(a) can be used only when there is no other
information. As soon as some new information is known, we
must reason with the conditional probability of a, given that
new information
 Sometimes we are interested in all possible values of a
random variable  use expressions such as P(weather)
which denotes a vector of values for the probabilities of each
individual state of the weather
Example: P(weather = sunny) = 0.7
P(weather = rain) = 0.2
P(weather = cloudy) = 0.08
P(weather = snow) = 0.02
also written as:
P(weather) = <0.7, 0.2, 0.08, 0.02>
8
Probability distributions
Example 1:
Hair_color is a discrete random variable.
It has the domain <blond, brown, red, black, white, none>
Out of a sample of 10000 people, we find that 1872 had
blond hair, 4325 had brown hair, 2135 had black hair, 652
had red hair, 321 had white hair and the rest were bald
The probability distribution is:
P(Hair_color) = <0.1872, 0.4325, 0.652, 0.2135, 0.0321,
0.0721>

Example 2:
UTD_student is a binary random variable  the domain is
True or False
Out of a sample of 1000 youngsters between 19 and 21 in
the Dallas area, 321 were students at UTD
P(UTD_student) = <0.321, 0.679> 9
Full Joint Probability Distribution
Suppose the world consists only of the variables
Cavity Toothache Weather 2x2x4=16

2 values 2 values 4 values entries for the


(binary (binary random (sunny, rain joint
random variable) cloudy, snow) distribution
variable)
P(Cavity) = <0.23,0.77>
P(Toothache) = <0.35,0.65>
P(Weather) = <0.7,0.2,0.008,0.02>
Represent the joint probability distribution:
We need to
Cavity Cavity
know to
Weathe Toothache Toothache Toothach Toothache
e
compute
r
conditional
Sunny ? ? ? ?
probabilities
Rain ? ? ? ?
Cloudy ? ? ? ?
Snow ? ? ? ?
10
Probability Density Functions
 For continuous variables, it is not possible to write
out the entire distribution as a table, because there
are infinitely many values.

Instead, we define the probability that a random


variable takes on some value x as a parameterized
function of x.

 Example: Let the random variable X denote


tomorrow’s maximum rain fall in Dallas.

 The sentence P(X=x) = U[1,3](x) expresses the


belief that X is distributed uniformly between 1in
and 3in
11
12
Conditional Probability
 When new evidence concerning a previously
unknown random variable is found, prior
probabilities no longer apply.
 We use conditional probabilities:
- If a – a random variable
- b – a random variable
- P(a |b) denotes “the probability of a, given that all
we know is b”

 Example: P(cavity |toothache) = 0.8


P ( a  b)
 How do we compute P(a |b)? P ( a | b) 
P (b)

 Same rule: P(a  b) = P(a |b)P(b) => the product rule!


13
Conditional Distributions
 If two random variables X and Y define the world,
P(X|Y) gives the values for P(X=x i|Y=yi) for each
possible i and j.

 Expressed with the product rule, this becomes:


P ( a  b)
P ( a | b) 
P (b)
P(X=x1  Y=y1) = P(X=x1|Y=y1)P(Y=y1)
P(X=x1  Y=y2) = P(X=x1|Y=y2)P(Y=y2)

 This can be combined in a single equation:


P(X,Y) = P(X|Y)P(Y)
 This denotes a set of equations relating the
corresponding individual entries in the tables (not a
matrix multiplication of the tables). 14
Inference Using Full Joint
Distribution
Example: The domains consist of three boolean variables: Toothache, Cavity
and Catch (the dentist’s nasty steel probe catches cavity in my tooth)
Toothache Toothache
catch catch catch catch
Cavity 0.108 0.012 0.072 0.008
Cavit 0.016 0.064 0.144 0.576
y
How many atomic events? 23 = 8 (as many entries as in the table!)

e1 (cavity=false)(toothache=false)(catch=false) P(e1)=0.576
e2 (cavity=false)(toothache=false)(catch=true) P(e2)=0.144
e3 (cavity=false)(toothache=true)(catch=false) P(e3)=0.064
e4 (cavity=false)(toothache=true)(catch=true) P(e4)=0.016
e5 (cavity=true)(toothache=false)(catch=false) P(e5)=0.008
e6 (cavity=true)(toothache=false)(catch=true) P(e6)=0.072
e7 (cavity=true)(toothache=true)(catch=false) P(e7)=0.012
15
e8 (cavity=true)(toothache=true)(catch=tr P(e8)=0.10
Inferring Probabilities
Given any proposition a, we can derive its probability as the
sum of the probabilities of the atomic events in which it
holds P ( a )   P (ei )
e e ( a ) i

ExampleToothache
1: a = cavity  toothache
Toothache a = cavity  toothache
catc catc catch catch Six events: e5, e6, e7, e8, e3, e4
h h
Cavity 0.10 0.012 0.072 0.008 P(cavity) = 0.108+0.012+0.072+0.008=0.2
8
Cavity 0.01 0.064 0.144 0.576
e1 6
(cavity=false)(toothache=false)(catch=fa P(e1)=0.57
lse) 6
e2 (cavity=false)(toothache=false)(catch=tr P(e2)=0.14
When adding
ue) all probabilities
4
in a row we obtain
e3 (cavity=false)(toothache=true)(catch=fal P(e3)=0.06 the unconditional
se) 4 or marginal
e4 (cavity=false)(toothache=true)(catch=tru P(e4)=0.01 probability
e) 6
16
e5 (cavity=true)(toothache=false)(catch=fal P(e5)=0.00
Probabilistic Inference by
Enumeration
 Given a full join distribution to work with, ENUMERATE-JOIN-ASK is
a complete algorithm for answering probabilistic queries for
discrete variables

17
Probabilistic Inference by
Enumeration
 Problems: domain described by n Boolean
variables
 Table of size O(2n) and O(2n) time to
process it!

 Full joint distribution in tabular form is not a


practical tool for building reasoning systems.

 It can be viewed as the theoretical foundation on


which more effective approaches may be built.

18
19
Marginalization and Conditioning
Rules
 Given any two random variables Y and Z,

P (Y )  P (Y , Z ) marginalization rule
z

 A variant involves conditional probabilities, instead of joint


probabilities:
conditioning rule P (Y )  P (Y | Z )P ( Z )
z

P(cavity) = 0.2
Using it to compute: P(cavity)=0.8

P(toothache)=0.108+0.012+0.016+0.06=0.2
Using it to compute: P(toothache)=0.8

P(catch)=0.108+0.016+0.072+0.14=0.34
Using it to compute: P(catch)=0.66 20
Marginalization and Conditioning
Rules
 If we have three random variables X, Y and Z
 P ( X )   P ( X , y, z ) marginalization rule
y z

P ( X )   P ( X | y , z ) P ( y | z ) P ( z ) P ( a, b)
y z P ( a | b) 
P (b)
P ( a, b) P ( a | b) P (b)
P ( a | b)  P ( a | b) 1
 From the product rule:

P ( a , b, c ) P ( a , b, c )
P (a | bc)  
P (b, c) P (b | c) P (c)

 P ( a, b, c ) P ( a | bc ) P (b | c ) P (c )
21
P ( a, b)
P ( a | b) 
P (b)

Conditional Probabilities P ( a, b) P ( a | b) P (b)


P ( a | b)  P ( a | b) 1

P (cavity  toothache)
P (cavity | toothache)  toothache toothache
P (toothache) catch catch catch catch

0.108  0.012 cavity 0.108 0.012 0.07 0.008


 0.6 2
0.108  0.012  0.016  0.064 cavity 0.016 0.064 0.14 0.576
4
P( cavity  toothache)
Also : P( cavity | toothache) 
P(toothache)
0.016  0.064 Added to 1
 0.4
0.108  0.012  0.016  0.064 P ( a | b)  P ( a | b) 1

 Notice: P(toothache) remains the same in both calculations  it


acts like a normalization constant for the distribution P(cavity|
toothache) with P( cavity|toochache), added to a sum of 1.
22
P ( a, b)
P ( a | b) 
P (b)

Normalization Constants P ( a, b) P ( a | b) P (b)


P ( a | b)  P ( a | b) 1

 - normalization constant

P(Cavity | toothache)
=> P(Cavity, toothache)
=> [ P(Cavity, toothache, catch) + P(Cavity, toothache, catch) ]
=> [ <0.108, 0.016> + <0.012, 0.064> ] = [ <0.12, 0.08> ]

toothache toothache P(cavity, toothache, catch)


catc catch catc catch
h h + P(cavity, toothache,
cavity 0.10 0.012 0.07 0.008 catch)]
8 2
cavity 0.01 0.064 0.14 0.576
6 4 P(cavity, toothache, catch)
+ P(cavity, toothache,
P ( a | b)  P ( a | b) 1
catch)]
Normalization Constants are useful shortcuts in many probability computations!

23
P ( a, b)
P ( a | b) 
P (b)

Normalization Constants P ( a, b) P ( a | b) P (b)


P ( a | b)  P ( a | b) 1

 - normalization constant

P(Cavity | toothache)
=> P(Cavity,toothache)
 [ P(Cavity, toothache, catch) + P(Cavity, toothache, catch) ]
=> [ <0.108, _____> + <0.012, _____> ] = [ <0.12, ____> ]
=> [ <0.108, 0.016> + <0.012, 0.064> ] = [ <0.12, 0.08> ]

toothache toothache That is, (0.12 + 0.08) = 1


(0.2) = 1
catc catch catc catch
h h 0.2 = 1/
cavity 0.10 0.012 0.07 0.008  = 1/0.2 = 5
8 2
cavity 0.01 0.064 0.14 0.576
Normalization Constants are
6 4
Thus 0.12 = 50.12 = 0.6
useful shortcuts in many
probability computations! for P(Cavity | toothache)

and 0.08 = 50.08 = 0.4


for P(Cavity | toothache)24
P ( a, b)
P ( a | b) 
P (b)

General inference Procedure P ( a, b) P ( a | b) P (b)


P ( a | b)  P ( a | b) 1

 A notation:
- X is a query variable (Cavity in the example)
- E is the set of evidence variables (Toothache in the
example)
- e are the observed values for the evidence
 The query: P(X| e)
 Evaluated as
P ( X | e) P ( X , e)   P ( X , e, y )
y

P(Cavity | toothache)
=>  with P(Cavity, toothache) and P(Cavity, toothache)
=>  with P(Cavity, toothache) and P(Cavity, toothache) with catch or not
P( Cavity, toothache, catch) + P( Cavity, toothache, catch) ]
vs P(Cavity, toothache, catch) + P(Cavity, toothache, catch) ]

25
26
Independence
toothache toothache
catch catch catch catch
cavity 0.108 0.012 0.072 0.008
cavity 0.016 0.064 0.144 0.576

 Let us add a fourth variable  Weather (with 4 values for Weather)


 The full distribution becomes P(Toothache, Catch, Cavity, Weather)
which has 32 entries (8 before 4 values for Weather), assuming
Weather has 4 values.
This table contains four “editions” of the full table, one for each kind of
weather.
 What relations do these editions have to each other and to the original
3-variable table?
 How are P(toothache, catch, cavity, Weather=cloudy) and
P(toothache, catch, cavity) related?
 Use the product rule: P(toothache, catch, cavity, Weather=cloudy) =
P(Weather=cloudy | toothache, catch, cavity) *
P(toothache,catch,cavity)
 If Weather is independent of the others, then
P(Weather=cloudy | toothache, catch, cavity) =
P(Weather=cloudy) 27
Absolute Independence
 Weather is independent of one’s dental problems.
P(Toothache, Catch, Cavity, Weather)
= P(Toothache, Catch, Cavity) P(Weather)
 The 32-element table can be constructed
from one 8-element table and one 4-element table.

28
Independence in Equations
 If propositions a and b are independent

 P (a ) P (b) 
P(a  b) = P(a) P(b)  P (a | b)  
P(a | b) = P(a)  P (b) 

 Independence between variables X and Y is written:

P( X, Y ) = P( X ) P( Y )
P( X | Y ) = P( X )
P( Y | X ) = P( Y )

29
P ( a, b)
P ( a | b) 
P (b)

Bayes Rule P ( a, b)  P ( a | b) P (b)


P ( a | b)  P ( a | b) 1

 From the product rule: P(a  b) = P(a | b) P(b) = P(b | a) P(a)

 Then: P(a | b) P(b) = P(b | a) P(a)

P ( a | b) P (b)
P (b | a )  Bayes’ Rule
P(a)

 For multi-valued variables, Bayes’ Rule is:


P ( X | Y ) P (Y )
P (Y | X ) 
P( X )

a set of equations, each dealing with specific values of the


variables.

30
P ( a, b)
P ( a | b) 
P (b)

Examples P ( a, b)  P ( a | b) P (b)
P ( a | b)  P ( a | b) 1

P (toothache | cavity ) P (cavity )


P (cavity | toothache) 
P (toothache)

P(toothache |  cavity) P( cavity)


P( cavity | toothache) 
P(toothache)

P ( toothache | cavity) P (cavity)


P (cavity |  toothache) 
P ( toothache)

P ( toothache |  cavity ) P ( cavity )


P ( cavity |  toothache ) 
P ( toothache )

31
32
Applying Bayes’ Rule: The Simple
Case
 Example: Medical diagnosis
 Meningitis is a disease caused by the inflammation of the protective membranes
covering the brain and spinal cord known as the meninges.

 A doctor knows that meningitis causes a stiff neck 50% of the time: P(s |
m)=0.5
The doctor also knows some unconditional facts:
the prior probability that the patient has meningitis is 1/50,000 
P(m)=1/50,000
the prior probability that any patient has a stiff neck is 1/20  P(s)=1/20

P ( m s )  P ( s m )  P ( m | s ) P ( s )  P ( s | m ) P ( m )

P ( s | m ) P ( m ) 0.5 1 / 50,000
P(m | s)   0.0002
P( s) 1 / 20

33
Only 1 in 5,000 patients with a stiff neck is expected to have meningitis
P ( a, b)
P ( a | b) 
P (b)

Another way P ( a, b)  P ( a | b) P (b)


P ( a | b)  P ( a | b) 1

P( s | m) P(m) 0.5 1 / 50,000 small because


P(m | s) 0.0002  
P( s) 1 / 20 P(m)=1/50,000 << P(s)=1/20
 We can still compute P(m | s) without knowing P(s)
- instead compute the posterior probability for each value of the
query variable (here m and m) and normalizing the results:
P(s)=P(s | m)P(m) + P(s | m)P(m)
P ( s | m) P ( m) 1
Then: P ( m | s )  
P ( s | m) P ( m)  P ( s |  m) P ( m) 1  P ( s |  m) P ( m)
P ( s | m) P ( m)
Similarly:
P ( s |  m) P ( m) 1
P ( m | s )  
P ( s | m) P ( m)  P ( s |  m) P ( m) 1  P ( s | m) P ( m)
P ( s |  m) P ( m)

P(m | s) =  P(m , s) =  < P(s | m)P(m), P(s | m)P(m) >


As this can be obtained also from applying Bayes’ Rule with
normalization:
P(Y|X) =  P(X|Y)P(Y)
where  is a normalization constant needed to make the entries in P(Y|X)
34
Example of Bayes’ Rule with
Normalization
We have two discrete random variables:
 X describing weather conditions, with the domain:

X={sunny, rain, cloudy, snow}


 Y describing clothes: Y={t-shirt, long-sleeves, coat}

The distributions of X and Y are:


 X = <0.331, 0.26, 0.159, 0.25>
 Y = <0.5, 0.3, 0.2>

We also have values of joint probabilities:


P(t-shirt, sunny)=0.32 P(long-sleeves, sunny)=0.01 P(coat, sunny)=0.001
P(t-shirt, rain)=0.08 P(long-sleeves, rain)=0.15 P(coat, rain)=0.03
P(t-shirt, cloudy)=0.09 P(long-sleeves, cloudy)=0.05 P(coat, cloudy)=0.019
P(t-shirt, snow)=0.01 P(long-sleeves, snow)=0.09 P(coat, snow)=0.15

35
Example of Bayes’ Rule with
Normalization
 From these calculate conditional probabilities P(Y|X), that is
probability of weather given clothing

P(t-shirt | sunny)=P(t-shirt, sunny)/P(sunny)=0.32/ 0.331 = .967


P(long-sleeves | sunny)= P(long-sleeves, sunny)/P(sunny)= 0.01/0.331=
0.0302
P(coat | sunny)= P(coat, sunny)/P(sunny)= 0.001/0.331= 0.003
P(t-shirt | rain)= P(t-shirt, rain)/P(rain)= 0.08/0.26=0.307
P(long-sleeves | rain)= P(long-sleeves, rain)/P(rain)= 0.15/0.26=0.577
P(coat | rain)= P(coat, rain)/P(rain)= 0.03/0.26=0.1154
P(t-shirt | cloudy)= P(t-shirt, cloudy)/P(cloudy)= 0.09/0.159=0.566
P(long-sleeves | cloudy)= P(long-sleeves, cloudy)/P(cloudy)=
0.05/0.159=0.314
P(coat | cloudy)=P(coat, cloudy)/P(cloudy) = 0.019/0.159=0.1195
P(t-shirt | snow)= P(t-shirt, snow)/P(snow)= 0.01/0.25=0.04
P(long-sleeves | snow)= P(long-sleeves, snow)/P(snow)= 0.09/0.25=0.36
P(coat | snow)= P(coat, snow)/P(snow)= 0.15/0.25=0.6
36
Computing the Normalization Constant
 Bayes’s Rule:  trying to guess the weather from the clothes people
wear: P(X|Y) =  P(Y|X)P(X) =  P(X,Y)
P(X|Y) add up to 1 for each value of Y P ( a | b) 
P ( a, b)
P (b)
P ( a, b) P ( a | b) P (b)
1. P(sunny | t-shirt) = 1 P(t-shirt, sunny) =1P×0.32
( a | b)  P ( a | b) 1
2. P(rain | t-shirt) = 1 P(t-shirt, rain) =1×0.08
3. P(cloudy | t-shirt) = 1 P(t-shirt, cloudy) =1×0.09
4. P(snow| t-shirt) = 1 P(t-shirt, snow)=1×0.01
Since P(sunny or rain or cloudy or snow | t-shirt)
= 1 (0.32+0.08+0.09+0.01) = 1
so 1=1/0.5 = 2
1. P(sunny | t-shirt)=1P(t-shirt, sunny)=1×0.32= 2×0.32=0.64
2. P(rain | t-shirt)=1P(t-shirt, rain)=1×0.08=2×0.08=0.16
3. P(cloudy | t-shirt)=1P(t-shirt, cloudy)=1×0.09=2×0.09=0.18
4. P(snow | t-shirt)=1P(t-shirt, snow)=1×0.01=2×0.01=0.02

37
Computing the Normalization Constant
 Bayes’s Rule:  trying to guess the weather from the clothes
people wear: P(X|Y) =  P(Y|X)P(X) =  P(X,Y)
P(X|Y) add up to 1 for each value of Y P ( a, b)
P ( a | b) 
1. P(sunny | t-shirt)=1P(t-shirt, sunny)=1×0.32= 2×0.32=0.64 P (b)
2. P(rain | t-shirt)=1P(t-shirt, rain)=1×0.08=2×0.08=0.16 P ( a, b) P ( a | b) P (b)
3. P(cloudy | t-shirt)=1P(t-shirt, cloudy)=1×0.09=2×0.09=0.18P ( a | b)  P ( a | b) 1
4. P(snow | t-shirt)=1P(t-shirt, snow)=1×0.01=2×0.01=0.02
where 1(0.32+0.08+0.09+0.01) = 1 So 1=1/0.5 = 2

1. P(sunny | long-sleeves)=2P(long-sleeves, sunny)=2×0.01=3.33×0.01=0.033


2. P(rain | long-sleeves)=2P(long-sleeves, rain)=2×0.15=3.33×0.15=0.5
3. P(cloudy | long-sleeves)=2P(long-sleeves, cloudy)=2×0.05=3.33×0.05=0.167
4. P(snow | long-sleeves)=2P(long-sleeves, snow)P=2×0.09=3.33×0.09=0.3
where 2(0.01+0.15+0.05+0.09) = 1 So 2=1/0.3 = 3,333

n P(sunny | coat)=3P(coat, sunny)=3×0.001=5×0.001=0.005


n P(rain | coat)=3P(coat, rain)=3×0.03=5×0.03=0.15
n P(cloudy | coat)=3P(coat, cloudy)=3×0.019=5×0.019=0.095
n P(snow | coat)=3P(coat, snow)=3×0.15=5×0.15=0.75
where 3(0.001+0.03+0.019+0.15) = 1 So 3=1/0.2 = 5 38
P ( a, b)
P ( a | b) 
P (b)

Example - Summary P ( a, b)  P ( a | b) P (b)


P ( a | b)  P ( a | b) 1
X Joint Probabilities
sunny rain cloudy snow

t-shirt .32 .08 .09 .01 .5 P (X, Y)


long-sleeve .01 .15 .05 .09 .3
Y
coat .001 .03 .019 .15 .2

.331 .26 .159 .25

Y
t-shirt long- coat
weather X sleeve

sunny .967 .0302 .003 1

rain .307 .577 .1154 1 P (Y | X)


X
cloudy .566 .314 .1195 1
clothes Y
snow .04 .36 .6 1

Conditional Probabilities
X
sunny rain cloudy snow
clothes Y
t-shirt .64 .16 .18 .02 1 P (X | Y)
long-sleeve .033 .5 .167 .03 1
Y
coat .005 .15 .095 .75 1
weather X
Conditional Probabilities

39
40
P ( a, b)
P ( a | b) 
P (b)

More on Bayes’ rule P ( a, b)  P ( a | b) P (b)


P ( a | b)  P ( a | b) 1

 What happens if one has the conditional probability in one


direction but not the other?

Example: meningitis domain

The doctor knows that a stiff neck implies meningitis in 1 of


5000 cases

 the doctor has quantitative information in the diagnostic


direction from symptoms to causes. Lucky case  the
doctor has no need to use Bayes’s rule.

 Note: Unfortunately, diagnostic knowledge is often more


fragile than causal knowledge.

41
P ( a, b)
P ( a | b) 
P (b)

Causal Knowledge P ( a, b)  P ( a | b) P (b)


P ( a | b)  P ( a | b) 1

Note: Unfortunately, diagnostic knowledge is often more fragile than


causal knowledge.
Why?  If there is a sudden epidemic of meningitis,
the prior probability of meningitis P(m) will go up.
Because of this, the doctor who designed the diagnostic probability
P(m | s) directly from statistical information, will not know how to
update P(m | s).

But
 P(m|s) should go up
P ( s | m) P ( m)
P(m | s)  proportionally with
P(m) P( s)
Important: P(s | m) is unaffected by epidemic because it reflects how
meningitis works (that is, stiff-neck case caused only by
meningitis).

Conclusions: Using Causal or model-based knowledge provides


robustness  feasible probabilistic reasoning
42
P ( a, b)
P ( a | b) 
P (b)

Combining Evidence P ( a, b)  P ( a | b) P (b)


P ( a | b)  P ( a | b) 1

 Until now, we considered probabilistic information available in the


form P(effect|cause)
single evidence
 What happens when there are multiple pieces of evidence?

Example: dentist domain


P(Cavity | toothache  catch) = <0.108,0.016>  <0.871, 0.129>
This will not scale up to larger numbers of variables
P(Cavity | toothache  catch) = P(toothache  catch |
Cavity)P(Cavity)
 we need to know the values of the conditional probabilities of
the conjunction toothache  catch for all values of Cavity

If we have n possible evidence variables (X rays, diet, oral


hygiene, …) these are 2n possible combinations of observed
values, and for each we need to know the conditional
probabilities.
43
P ( a, b)
P ( a | b) 
P (b)

Solution P ( a, b)  P ( a | b) P (b)
P ( a | b)  P ( a | b) 1

 Consider the notion of independence

 Three variables: Cavity, Toothache, Catch


 Which are independent?

- (Cavity, Catch)?  if the probe catches in the tooth, it


probably has a cavity and that probably causes a toothache

- (Toothache, Catch)?  if there is a cavity, there will be a


toothache, regardless of the probable catching the tooth

- (Toothache, Cavity)?  if there is a cavity, it might cause a


toothache, but toothaches are not only caused by cavities

44
45
P ( a, b)
P ( a | b) 
P (b)

Conditional Independence P ( a, b)  P ( a | b) P (b)


P ( a | b)  P ( a | b) 1

P(toothache  catch | Cavity )  P(toothache|Cavity) P (catch | Cavity )

Conditional independence of toothache and catch given Cavity


We also know:
P(Cavity | toothache  catch )  P(toothach e  catch|Cavi ty) P (Cavity )
then interpret:
P(Cavity | toothache  catch) 
P(toothache | Cavity) P (catch | Cavity ) P (cavity )
P(Cavity | toothache  catch) 
cause effect1 effect2
P(toothache|Cavity) P(catch | Cavity ) P(Cavity )

effect1 given cause effect2 given cause cause

46
P ( a, b)
P ( a | b) 
P (b)

Computational Complexity P ( a, b)  P ( a | b) P (b)


P ( a | b)  P ( a | b) 1

General definition of conditional independence


of two variables X and Y given a third variable Z:
P(X , Y | Z )  P(X | Z) P (Y | Z )
Example:
P(Toothach e, Catch | Cavity )  P(Toothac he|Cavity)P (Catch | Cavity )
We can now derive the decomposition:
P(Toothach e, Catch, Cavity ) 
P(Toothach e, Catch|Cavity)P(Cavity ) (product rule)
P(Toothache | Cavity) P(Catch | Cavity ) P (Cavity)
Table 1 Table 2 Table 3
Initial table-size (Toochache, Catch, Cavity) = 23-1 = 7 values (23 all, but
since they sum to 1, we do not need the last one)
Table1-size = 2 x (22-1) – 1 = 5 values
Table2-size = Table1-size
Table3-size = 21-1 = 1 the size of the representation grows as O(n) instead
of O(2n)
47
P ( a, b)
P ( a | b) 
P (b)

Separation (conditional independence)


P ( a, b)  P ( a | b) P (b)
P ( a | b)  P ( a | b) 1

P(Cause, Effect1 ,..., Effect n )  P(Cause) P ( Effecti | Cause)


i

P(Toothach e, Catch, Cavity ) 


 P (Toothache | Cavity ) P (Catch | Cavity ) P (Cavity )

Here Cavity separates Toothache and Catch because it is a


Cause to both of them! (with a naïve assumption: “Toothache and
Catch” are [or believed to be] conditionally independent).

Naïve Bayes model


It is called Naïve because it works surprisingly well
even when the effect variables are not conditionally
independent
48
49
P ( a, b)
P ( a | b) 
P (b)

The Wumpus World Revisited P ( a, b)  P ( a | b) P (b)


P ( a | b)  P ( a | b) 1

After finding a breeze in both [1,2] and [2,1], the agent is stuck because there is
no safe place to explore.

1,4 2,4 3,4 4,4 Goal: compute the probability that the 3 neighboring
squares contain a pit
1,3 2,3 3,3 4,3 P(Pit [1,3]), P(Pit[2,2] ), P( Pit[3,1])

1,2 2,2 3,2 4,2 Information:


B - a pit causes a breeze in all neighboring squares
OK - each square other than [1,1] contains a pit with
1,1 2,1 B 3,1 4,1
probability 0.2
OK OK
Step 1: Identify random variables
Pi,j=1 if square [i,j] contains a pit

[i,j] = only for observed squares, e.g [1,1],[1,2],[2,1] so far


Bij = 1 if square [i,j] is breezy

Pij, Bij are boolean variables 50


Probabilistic Reasoning for the
Wumpus
Pij for pit
Next step: specify the full distribution:
Bij for breezy

P(P11 , P12 , P13 , P14 , P21 , P22 , P23 , P24 , P31 , P32 , P33 , P34 , P41 , P42 , P43 , P44 , B11 , B12 , B21 )
P( B11 , B12 , B21 | P11 ,..., P44 ) P( P11 ,..., P44 )

The prior probability of a pit configuration:


(assuming independence of each cell)
4, 4
P ( P11 ,..., P44 )   P ( Pij )
i , j 1,1

If there are n pits [where p(a cell having a pit)=0.2], then

n 16  n
P( P11 ,..., P44 ) (0.2) (0.8)
51
P ( a, b)
P ( a | b) 
P (b)

Combining Evidence P ( a, b)  P ( a | b) P (b)


P ( a | b)  P ( a | b) 1

The evidence: observed breeze in each square visited


+ each square visited contains no pit
b  b11  b12  b21 [1,2] and [2,1] have breeze
known  p11   p12   p21
How likely is that [1,3] contains a pit,
Query : P ( P13 | known, b) given the observation so far
1,4 2,4 3,4 4,4 1,4 2,4 3,4 4,4

1,3 2,3 3,3 4,3 1,3 2,3 3,3OTHER 4,3


QUERY

1,2 2,2 3,2 4,2 1,2 2,2 3,2 4,2


B B
OK OK
frontier
1,1 2,1 3,1 4,1 1,1 2,1 3,1 4,1
B B
KNOWN
OK OK OK OK
52
P ( a, b)
P ( a | b) 
P (b)

Answering the Query P ( a, b)  P ( a | b) P (b)


P ( a | b)  P ( a | b) 1

 To answer P(P13 | known, breeze)


We sum over the entries from the full distribution.
 Let unknown be a composite variable consisting of the Pij
variables for squares other than known and the query square
[1,3]
P ( P13| known, b)   P( P13, unknown, known, b)
 unknown

How many squares? With 4x4=16 cells.


16-3-1=12  the summation contains: 212 = 4096 terms (too
known
many!)
query

53
P ( a, b)
P ( a | b) 
P (b)

Careful Computation P ( a, b)  P ( a | b) P (b)


P ( a | b)  P ( a | b) 1

Why would the contents of [4,4] affect whether [1,3] has a pit?
 Let us consider the set Frontier made of variables of the visited
squares. Frontier: Frontier = {[2,2],[3,1]}
 Let us also consider the set Other containing variables of the
unknown squares (10 of them)
 The observed breezes are conditionally independent of all other
variables, given the known, frontier, and query variables.

P ( P13 | known, b)   P( P
unknown
13 , known, b, unknown)

  P(b | P
unknown
13 , known, unknown) P ( P13 , known, unknown)

   P(b | known, P
frontier other
13 , frontier , other) P ( P13 , known, frontier , other)

   P(b | known, P
frontier other
13 , frontier ) P ( P13 , known, frontier , other)

The final step uses conditional independence: “b” is independent of “other”


given “known”, “P13”, and “frontier”. 54
P ( a, b)
P ( a | b) 
P (b)

Continue Computation P ( a, b)  P ( a | b) P (b)


P ( a | b)  P ( a | b) 1

P( P13 | known, b)    P(b | known, P , frontier) P( P , known, frontier, other)


frontier other
13 13

The first term in this expression does not depend on the other variables
 we can move the summation inwards
 
frontier
P (b | known , P13 , frontier )  P ( P13 , known , frontier , other )
other

The prior term can be factored and then the terms can be reordered:

P ( P13 | known, b)  
frontier
P (b | known, P13 , frontier ) 
other
P ( P13 ) P (known) P ( frontier ) P (other )

 P ( known) P ( P13 ) 
frontier
P (b | known, P13 , frontier) P ( frontier)  P (other)
other

This equals 1
 ' P ( P13 )  P(b | known, P , frontier ) P( frontier )
13
frontier

55
P ( a, b)
P ( a | b) 
P (b)

Finishing P ( a, b)  P ( a | b) P (b)
P ( a | b)  P ( a | b) 1

P ( P13 | known, b)  ' P ( P13 )  P(b | known, P , frontier ) P( frontier )


frontier
13

frontier {[2,2], [3,1]} ... if both [2,2] & [3,1] have pits 0.2x0.2 0.04

Case (a) [1,3] with a Case (b) [1,3] without a


pit pit
P13 P22 P31
How do we build models for the Frontier?
1 1 1 Model1
Since B12 and B21 (signs for a pit in neighbor) 0.2x0.2x0.2
(a) 1 1 0 Model2 0.2x0.2x0.8
we may have a pit in P13,P22 or P31
1 0 1 Model3 0.2x0.8x0.2
0 1 1 Model4 0.8x0.2x0.2
(a)the cases when [1,3] has a pit (P13=1) (b) 0.8x0.2x0.8
0 1 0 Model5 56
P ( a, b)
P ( a | b) 
P (b)

Likelihood of Pit at [1,3] P ( a, b)  P ( a | b) P (b)


P ( a | b)  P ( a | b) 1

P ( P13 | known, b)  ' P ( P13 )  P(b | known, P , frontier ) P( frontier )


frontier
13

Model 1,2&3 for P13=1

Model 4,5 for P13=0

P(P13 | known, b) = ’ <0.2 x (0.04 + 0.16 + 0.16), 0.8 x (0.04+0.16)>

P(P13) Model1 Model2 Model3 P(P13) Model4 Model5

= ’ <0.2 x 0.36, 0.8 x 0.2>


= ’ <0.072, 0.016>

Because ’ ( 0.072 + 0.16 ) = ’ x (0.232) = 1


So ’ = 1/0.232 = 4.3103448
 <0.3103, 0.6897> 57
P ( a, b)
P ( a | b) 
P (b)

Interpretation P ( a, b)  P ( a | b) P (b)
P ( a | b)  P ( a | b) 1

 From P(P13| known,b) = <0.3103, 0.6897>


 We know that [1,3] contains a pit roughly 31% in probability.
(Likewise, [3,1] by symmetry) contains a pit roughly 31%)
 Similarly (as we compute P(P22 | known, b)),
P(P22|known,b) contains a pit roughly 86% probability
 That is, the agent should avoid [2,2]!

Lessons:
 Seemingly complicated problems can be formulated precisely
in probability theory and solved using simple algorithms
 Efficient solutions are obtained when independence and
conditional independence relationships are used to simplify
the summations
 Independence corresponds to our natural understanding of
how the problem should be decomposed
58

You might also like