Decision Making Under Uncertainty
Decision Making Under Uncertainty
Uncertainty
VARIOUS STEPS
Utility-Based Agent
sensors
?
environment
agent
actuators
Non-deterministic vs.
Probabilistic Uncertainty
? ?
a b c a b c
{a,b,c} {a(pa),b(pb),c(pc)}
decision that is decision that maximizes
best for worst case expected utility value
Non-deterministic model Probabilistic model
~ Adversarial search
Expected Utility
Random variable X with n values x1,…,xn
and distribution (p1,…,pn)
E.g.: X is the state reached after doing
an action A under uncertainty
Function U of X
E.g., U is the utility of a state
The expected utility of A is
EU[A] = i=1,…,n p(xi|A)U(xi)
One State/One Action Example
s0
s1 s2 s3
0.2 0.7 0.1
100 50 70
One State/Two Actions Example
• EU(AI) = 62
s0 • EU(A2) = 74
• EU(S0) = max{EU(A1),EU(A2)}
= 74
A1 A2
s1 s2 s3 s4
0.2 0.7 0.2 0.1 0.8
100 50 70 80
Introducing Action Costs
• EU(A1) = 62 – 5 = 57
s0 • EU(A2) = 74 – 25 = 49
• EU(S0) = max{EU(A1),EU(A2)}
= 57
A1 A2
-5 -25
s1 s2 s3 s4
0.2 0.7 0.2 0.1 0.8
100 50 70 80
MEU Principle
rational agent should choose the action
that maximizes agent’s expected utility
this is the basis of the field of decision
theory
normative criterion for rational choice of
action
Not quite…
Must have complete model of:
Actions
Utilities
States
Even if you have a complete model, will be
computationally intractable
In fact, a truly rational agent takes into account the
utility of reasoning as well---bounded rationality
Nevertheless, great progress has been made in this
area recently, and we are able to solve much more
complex decision theoretic problems than ever before
We’ll look at
Decision Theoretic Planning
Simple decision making (ch. 16)
Sequential decision making (ch. 17)
Decision Networks
Extend BNs to handle actions and
utilities
Also called Influence diagrams
Make use of BN inference
Can do Value of Information
calculations
Decision Networks cont.
Chance nodes: random variables, as in
BNs
Decision nodes: actions that decision
maker can take
Utility/value nodes: the utility of the
outcome state.
R&N example
Umbrella Network
take/don’t take
P(rain) = 0.4
Take Umbrella
rain
umbrella
P(umb|take) = 1.0
P(~umb|~take)=1.0 happiness
take/don’t take
P(rain) = 0.4
Take Umbrella
rain
umbrella #1
umb rain P(umb,rain | take)
P(umb|take) = 1.0
P(~umb|~take)=1.0 happiness 0 0
0 1
1 0
U(~umb, ~rain) = 100 1 1
U(~umb, rain) = -100
U(umb,~rain) = 0
U(umb,rain) = -25 #2: EU(take)
Umbrella Network
take/don’t take
P(rain) = 0.4
Take Umbrella
rain
umbrella #1
umb rain P(umb,rain | ~take)
P(umb|take) = 1.0
P(~umb|~take)=1.0 happiness 0 0
0 1
1 0
U(~umb, ~rain) = 100 1 1
U(~umb, rain) = -100
U(umb,~rain) = 0
U(umb,rain) = -25 #2: EU(~take)
Value of Information (VOI)
suppose agent’s current knowledge is E. The value
of the current best action is
EU( | E) max U(Re sult i ( A))P(Re sult i ( A ) | E, Do( A))
A
i
take/don’t take
P(rain) = 0.4
Take Umbrella
rain
umbrella
P(umb|take) = 1.0 forecast
P(~umb|~take)=1.0 happiness
R F P(F|R)
0 0 0.8
U(~umb, ~rain) = 100 0 1 0.2
U(~umb, rain) = -100 1 0 0.3
U(umb,~rain) = 0 1 1 0.7
U(umb,rain) = -25
VOI
VOI(forecast)=
P(rainy)EU(rainy) +
P(~rainy)EU(~rainy) –
EU()
P(F=rainy) = 0.4
Umbrella Network
F R P(R|F)
0 0 0.8
0 1 0.2
1 0 0.3
1 1 0.7
take/don’t take
P(rain) = 0.4
Take Umbrella
rain
umbrella
P(umb|take) = 1.0 forecast
P(~umb|~take)=1.0 happiness
R F P(F|R)
0 0 0.8
U(~umb, ~rain) = 100 0 1 0.2
U(~umb, rain) = -100 1 0 0.3
U(umb,~rain) = 0 1 1 0.7
U(umb,rain) = -25
umb rain P(umb,rain | take, rainy) umb rain P(umb,rain | take, ~rainy)
0 0 0 0
0 1 0 1
1 0 1 0
1 1 1 1
umb rain P(umb,rain | ~take, rainy) umb rain P(umb,rain |~take, ~rainy)
0 0 0 0
0 1 0 1
1 0 1 0
1 1 1 1
1 2 3 4
• Planned sequence of actions: (U, R)
Sequence of Actions
[3,2]
3
1 2 3 4
• Planned sequence of actions: (U, R)
• U is executed
Histories
[3,2]
3
1
[3,1] [3,2] [3,3] [4,1] [4,2] [4,3]
1 2 3 4
• Planned sequence of actions: (U, R)
• U has been executed
• R is executed
1 2 3 4
•P([4,3] | (U,R).[3,2]) =
P([4,3] | R.[3,3]) x P([3,3] | U.[3,2])
+ P([4,3] | R.[4,2]) x P([4,2] | U.[3,2])
•P([4,3] | R.[3,3]) = 0.8 •P([3,3] | U.[3,2]) = 0.8
•P([4,3] | R.[4,2]) = 0.1 •P([4,2] | U.[3,2]) = 0.1
•P([4,3] | (U,R).[3,2]) = 0.65
Utility Function
3 +1
2 -1
1 2 3 4
2 -1
1 2 3 4
2 -1
1 2 3 4
2 -1
1 2 3 4
2 -1
1 2 3 4
1
[3,1] [3,2] [3,3] [4,1] [4,2] [4,3]
1 2 3 4
1
[3,1] [3,2] [3,3] [4,1] [4,2] [4,3]
1 2 3 4
1
[3,1] [3,2] [3,3] [4,1] [4,2] [4,3]
1 2 3 4
1
[3,1] [3,2] [3,3] [4,1] [4,2] [4,3]
1 2 3 4
2 -1
1 2 3 4
Repeat:
s sensed state
If s is terminal then exit
a (s)
Perform a
Optimal Policy
3 +1
2 -1
1 2 3 4
2 -1
1 2 3 4
This problem
• A policy is a complete mapping from isstates
calledtoa actions
• The optimal policy *Markov Decision
is the one Problemyields
that always (MDP) a
history with maximal expected utility
Reward
Additive Utility
History H = (s0,s1,…,sn)
The utility of H is additive iff:
U(s ,s ,…,s ) = R(0) + U(s ,…,s )
0 1 n 1 n = R(i)
Robot navigation example:
R(n) = +1 if s = [4,3]
n
R(n) = -1 if s = [4,2]
n
+1
-1
First-step analysis
U(i) = R(i) + max a P(k | a.i) U(k)
k
For t = 0, 1, 2, …, do:
Ut+1(i) R(i) + max a P(k | a.i) U (k)
k t
3 +1
2 -1
1 2 3 4
Value Iteration
Note the importance
of terminal states
Initialize the utility of each non-terminal state sand
i to
For t = 0, 1, 2, …, do:
Ut+1(i) R(i) + max a P(k | a.i) U (k)
k t
Ut([3,1])
0.812 0.868 0.918
3 +1 0.611
0.762 0.660
0.5
2 -1
0
0.705 0.655 0.611 0.388
1
1 2 3 4 0 10 20 30 t
Policy Iteration
2 -1
1 2 3 4
POMDP (Partially Observable Markov Decision Problem)
• A sensing operation returns multiple
states, with a probability distribution