0% found this document useful (0 votes)
55 views

Decision Making Under Uncertainty

The document discusses decision making under uncertainty. It explains that a rational agent should choose the action that maximizes expected utility, which is the basis of decision theory. It introduces concepts like expected utility calculations, decision networks that extend Bayesian networks to model decisions and utilities, and the value of information. Sequential decision making problems with probabilistic transitions are also discussed.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Decision Making Under Uncertainty

The document discusses decision making under uncertainty. It explains that a rational agent should choose the action that maximizes expected utility, which is the basis of decision theory. It introduces concepts like expected utility calculations, decision networks that extend Bayesian networks to model decisions and utilities, and the value of information. Sequential decision making problems with probabilistic transitions are also discussed.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 63

Decision Making Under

Uncertainty

VARIOUS STEPS
Utility-Based Agent
sensors

?
environment
agent

actuators
Non-deterministic vs.
Probabilistic Uncertainty

? ?

a b c a b c
{a,b,c} {a(pa),b(pb),c(pc)}
 decision that is  decision that maximizes
best for worst case expected utility value
Non-deterministic model Probabilistic model
~ Adversarial search
Expected Utility
Random variable X with n values x1,…,xn
and distribution (p1,…,pn)
E.g.: X is the state reached after doing
an action A under uncertainty
Function U of X
E.g., U is the utility of a state
The expected utility of A is
EU[A] = i=1,…,n p(xi|A)U(xi)
One State/One Action Example
s0

EU(A1) = 100 x 0.2 + 50 x 0.7 + 70 x 0.1


= 20 + 35 + 7
A1
= 62

s1 s2 s3
0.2 0.7 0.1
100 50 70
One State/Two Actions Example
• EU(AI) = 62
s0 • EU(A2) = 74
• EU(S0) = max{EU(A1),EU(A2)}
= 74

A1 A2

s1 s2 s3 s4
0.2 0.7 0.2 0.1 0.8
100 50 70 80
Introducing Action Costs
• EU(A1) = 62 – 5 = 57
s0 • EU(A2) = 74 – 25 = 49
• EU(S0) = max{EU(A1),EU(A2)}
= 57

A1 A2
-5 -25

s1 s2 s3 s4
0.2 0.7 0.2 0.1 0.8
100 50 70 80
MEU Principle
rational agent should choose the action
that maximizes agent’s expected utility
this is the basis of the field of decision
theory
normative criterion for rational choice of
action
Not quite…
Must have complete model of:
 Actions
 Utilities
 States
Even if you have a complete model, will be
computationally intractable
In fact, a truly rational agent takes into account the
utility of reasoning as well---bounded rationality
Nevertheless, great progress has been made in this
area recently, and we are able to solve much more
complex decision theoretic problems than ever before
We’ll look at
Decision Theoretic Planning
 Simple decision making (ch. 16)
 Sequential decision making (ch. 17)
Decision Networks
Extend BNs to handle actions and
utilities
Also called Influence diagrams
Make use of BN inference
Can do Value of Information
calculations
Decision Networks cont.
Chance nodes: random variables, as in
BNs
Decision nodes: actions that decision
maker can take
Utility/value nodes: the utility of the
outcome state.
R&N example
Umbrella Network

take/don’t take
P(rain) = 0.4
Take Umbrella
rain

umbrella
P(umb|take) = 1.0
P(~umb|~take)=1.0 happiness

U(~umb, ~rain) = 100


U(~umb, rain) = -100
U(umb,~rain) = 0
U(umb,rain) = -25
Evaluating Decision Networks
Set the evidence variables for current state
For each possible value of the decision node:
 Set decision node to that value
 Calculate the posterior probability of the parent
nodes of the utility node, using BN inference
 Calculate the resulting utility for action
return the action with the highest utility
Umbrella Network

take/don’t take
P(rain) = 0.4
Take Umbrella
rain

umbrella #1
umb rain P(umb,rain | take)
P(umb|take) = 1.0
P(~umb|~take)=1.0 happiness 0 0
0 1
1 0
U(~umb, ~rain) = 100 1 1
U(~umb, rain) = -100
U(umb,~rain) = 0
U(umb,rain) = -25 #2: EU(take)
Umbrella Network

take/don’t take
P(rain) = 0.4
Take Umbrella
rain

umbrella #1
umb rain P(umb,rain | ~take)
P(umb|take) = 1.0
P(~umb|~take)=1.0 happiness 0 0
0 1
1 0
U(~umb, ~rain) = 100 1 1
U(~umb, rain) = -100
U(umb,~rain) = 0
U(umb,rain) = -25 #2: EU(~take)
Value of Information (VOI)
suppose agent’s current knowledge is E. The value
of the current best action  is
EU( | E)  max  U(Re sult i ( A))P(Re sult i ( A ) | E, Do( A))
A
i

the value of the new best action (after new evidence


E’ is obtained):
EU( | E, E)  max  U(Result i ( A))P(Result i ( A) | E, E, Do( A ))
A
i

the value of information for E’ is:


VOI(E)   P(e
k
k | E)EU( ek | e k , E) EU( | E)
Umbrella Network

take/don’t take
P(rain) = 0.4
Take Umbrella
rain

umbrella
P(umb|take) = 1.0 forecast
P(~umb|~take)=1.0 happiness
R F P(F|R)
0 0 0.8
U(~umb, ~rain) = 100 0 1 0.2
U(~umb, rain) = -100 1 0 0.3
U(umb,~rain) = 0 1 1 0.7
U(umb,rain) = -25
VOI
VOI(forecast)=
P(rainy)EU(rainy) +
P(~rainy)EU(~rainy) –
EU()
P(F=rainy) = 0.4

Umbrella Network
F R P(R|F)
0 0 0.8
0 1 0.2
1 0 0.3
1 1 0.7

take/don’t take
P(rain) = 0.4
Take Umbrella
rain

umbrella
P(umb|take) = 1.0 forecast
P(~umb|~take)=1.0 happiness
R F P(F|R)
0 0 0.8
U(~umb, ~rain) = 100 0 1 0.2
U(~umb, rain) = -100 1 0 0.3
U(umb,~rain) = 0 1 1 0.7
U(umb,rain) = -25
umb rain P(umb,rain | take, rainy) umb rain P(umb,rain | take, ~rainy)

0 0 0 0

0 1 0 1

1 0 1 0

1 1 1 1

#1: EU(take|rainy) #3: EU(take|~rainy)

umb rain P(umb,rain | ~take, rainy) umb rain P(umb,rain |~take, ~rainy)

0 0 0 0

0 1 0 1

1 0 1 0

1 1 1 1

#2: EU(~take|rainy) #4: EU(~take|~rainy)


VOI
VOI(forecast)=
P(rainy)EU(rainy) +
P(~rainy)EU(~rainy) –
EU()
Sequential Decision Making
Finite Horizon
Infinite Horizon
Simple Robot Navigation Problem

• In each state, the possible actions are U, D, R, and L


Probabilistic Transition Model

• In each state, the possible actions are U, D, R, and L


• The effect of U is as follows (transition model):
• With probability 0.8 the robot moves up one square (if the
robot is already in the top row, then it does not move)
Probabilistic Transition Model

• In each state, the possible actions are U, D, R, and L


• The effect of U is as follows (transition model):
• With probability 0.8 the robot moves up one square (if the
robot is already in the top row, then it does not move)
• With probability 0.1 the robot moves right one square (if the
robot is already in the rightmost row, then it does not move)
Probabilistic Transition Model

• In each state, the possible actions are U, D, R, and L


• The effect of U is as follows (transition model):
• With probability 0.8 the robot moves up one square (if the
robot is already in the top row, then it does not move)
• With probability 0.1 the robot moves right one square (if the
robot is already in the rightmost row, then it does not move)
• With probability 0.1 the robot moves left one square (if the
robot is already in the leftmost row, then it does not move)
Markov Property

The transition properties depend only


on the current state, not on previous
history (how that state was reached)
Sequence of Actions
[3,2]
3

1 2 3 4
• Planned sequence of actions: (U, R)
Sequence of Actions
[3,2]
3

2 [3,2] [3,3] [4,2]

1 2 3 4
• Planned sequence of actions: (U, R)
• U is executed
Histories
[3,2]
3

2 [3,2] [3,3] [4,2]

1
[3,1] [3,2] [3,3] [4,1] [4,2] [4,3]
1 2 3 4
• Planned sequence of actions: (U, R)
• U has been executed
• R is executed

• There are 9 possible sequences of states


– called histories – and 6 possible final states
for the robot!
Probability of Reaching the Goal
3
Note importance of Markov property
2
in this derivation
1

1 2 3 4
•P([4,3] | (U,R).[3,2]) =
P([4,3] | R.[3,3]) x P([3,3] | U.[3,2])
+ P([4,3] | R.[4,2]) x P([4,2] | U.[3,2])
•P([4,3] | R.[3,3]) = 0.8 •P([3,3] | U.[3,2]) = 0.8
•P([4,3] | R.[4,2]) = 0.1 •P([4,2] | U.[3,2]) = 0.1
•P([4,3] | (U,R).[3,2]) = 0.65
Utility Function
3 +1

2 -1

1 2 3 4

• [4,3] provides power supply


• [4,2] is a sand area from which the robot cannot escape
Utility Function
3 +1

2 -1

1 2 3 4

• [4,3] provides power supply


• [4,2] is a sand area from which the robot cannot escape
• The robot needs to recharge its batteries
Utility Function
3 +1

2 -1

1 2 3 4

• [4,3] provides power supply


• [4,2] is a sand area from which the robot cannot escape
• The robot needs to recharge its batteries
• [4,3] or [4,2] are terminal states
Utility of a History
3 +1

2 -1

1 2 3 4

• [4,3] provides power supply


• [4,2] is a sand area from which the robot cannot escape
• The robot needs to recharge its batteries
• [4,3] or [4,2] are terminal states
• The utility of a history is defined by the utility of the last
state (+1 or –1) minus n/25, where n is the number of moves
Utility of an Action Sequence
3 +1

2 -1

1 2 3 4

• Consider the action sequence (U,R) from [3,2]


Utility of an Action Sequence
+1 [3,2]
3

2 -1 [3,2] [3,3] [4,2]

1
[3,1] [3,2] [3,3] [4,1] [4,2] [4,3]
1 2 3 4

• Consider the action sequence (U,R) from [3,2]


• A run produces one among 7 possible histories, each with some
probability
Utility of an Action Sequence
+1 [3,2]
3

2 -1 [3,2] [3,3] [4,2]

1
[3,1] [3,2] [3,3] [4,1] [4,2] [4,3]
1 2 3 4

• Consider the action sequence (U,R) from [3,2]


• A run produces one among 7 possible histories, each with some
probability
• The utility of the sequence is the expected utility of the histories:
U = hUh P(h)
Optimal Action Sequence
+1 [3,2]
3

2 -1 [3,2] [3,3] [4,2]

1
[3,1] [3,2] [3,3] [4,1] [4,2] [4,3]
1 2 3 4

• Consider the action sequence (U,R) from [3,2]


• A run produces one among 7 possible histories, each with some
probability
• The utility of the sequence is the expected utility of the histories
• The optimal sequence is the one with maximal utility
Optimal Action Sequence
+1 [3,2]
3

2 -1 [3,2] [3,3] [4,2]

1
[3,1] [3,2] [3,3] [4,1] [4,2] [4,3]
1 2 3 4

• Consider the action sequence (U,R) from [3,2]


• A run produces one among 7 possible histories, each with some
only if the sequence is executed blindly!
probability
• The utility of the sequence is the expected utility of the histories
• The optimal sequence is the one with maximal utility
• But is the optimal action sequence what we want to
compute?
Reactive Agent Algorithm
Accessible or
Repeat: observable state
 s  sensed state
 If s is terminal then exit
 a  choose action (given s)
 Perform a
Policy (Reactive/Closed-Loop Strategy)
3 +1

2 -1

1 2 3 4

• A policy  is a complete mapping from states to actions


Reactive Agent Algorithm

Repeat:
 s  sensed state
 If s is terminal then exit
 a  (s)
 Perform a
Optimal Policy
3 +1

2 -1

1 2 3 4

• A policy  is a completeNote that [3,2]


mapping from is a “dangerous”
states to actions
state
• The optimal policy * is the onethat
thatthe optimal
always policy
yields a
tries
history (ending at a terminal state) to maximal
with avoid
expected utility
Makes sense because of Markov property
Optimal Policy
3 +1

2 -1

1 2 3 4

This problem
• A policy  is a complete mapping from isstates
calledtoa actions
• The optimal policy *Markov Decision
is the one Problemyields
that always (MDP) a
history with maximal expected utility

How to compute *?


Additive Utility
History H = (s0,s1,…,sn)
The utility of H is additive iff:
U(s ,s ,…,s ) = R(0) + U(s ,…,s ) = R(i)
0 1 n 1 n

Reward
Additive Utility
History H = (s0,s1,…,sn)
The utility of H is additive iff:
U(s ,s ,…,s ) = R(0) + U(s ,…,s )
0 1 n 1 n = R(i)
Robot navigation example:
 R(n) = +1 if s = [4,3]
n

 R(n) = -1 if s = [4,2]
n

 R(i) = -1/25 if i = 0, …, n-1


Principle of Max Expected Utility
History H = (s0,s1,…,sn)
Utility of H: U(s ,s ,…,s ) = R(i)
0 1 n


+1

-1

First-step analysis 
U(i) = R(i) + max a  P(k | a.i) U(k)
k

*(i) = arg maxa kP(k | a.i) U(k)


Value Iteration
Initialize the utility of each non-terminal state si to
U (i) = 0
0

For t = 0, 1, 2, …, do:
Ut+1(i)  R(i) + max a  P(k | a.i) U (k)
k t

3 +1

2 -1

1 2 3 4
Value Iteration
Note the importance
of terminal states
Initialize the utility of each non-terminal state sand
i to

U (i) = 0 connectivity of the


state-transition graph
0

For t = 0, 1, 2, …, do:
Ut+1(i)  R(i) + max a  P(k | a.i) U (k)
k t

Ut([3,1])
0.812 0.868 0.918
3 +1 0.611
0.762 0.660
0.5
2 -1
0
0.705 0.655 0.611 0.388
1

1 2 3 4 0 10 20 30 t
Policy Iteration

Pick a policy  at random


Policy Iteration

Pick a policy  at random


Repeat:
 Compute the utility of each state for 
Ut+1(i)  R(i) +  P(k | (i).i) Ut(k)
k
Policy Iteration

Pick a policy  at random


Repeat:
 Compute the utility of each state for 
Ut+1(i)  R(i) +  P(k | (i).i) Ut(k)
k

 Compute the policy ’ given these utilities


’(i) = arg maxa  P(k | a.i) U(k)
k
Policy Iteration

Pick a policy  at random


Repeat:
 Compute the utility of each state for 
Ut+1(i)  R(i) +  P(k | (i).i) Ut(k)
k

 Compute the policy ’ given these utilities


’(i) = arg maxa  P(k | a.i) U(k)
Or solve the set of linear equations:
U(i) = R(i) +  P(k | (i).i) U(k)
k

If ’ =  then return a sparse system)


k

(often
Example: Tracking a Target

• The robot must keep


the target in view
• The target’s trajectory
is not known in advance
robot target
Example: Tracking a Target
Example: Tracking a Target
Infinite Horizon
In many problems, e.g., the robot
navigation
Whatexample, histories
if the robot are
lives forever?
potentially unbounded and the same
state can be reached Onemany
trick: times
Use discounting to make infinite
Horizon problem mathematically
3 +1 tractible

2 -1

1 2 3 4
POMDP (Partially Observable Markov Decision Problem)
• A sensing operation returns multiple
states, with a probability distribution

• Choosing the action that maximizes the


expected utility of this state distribution
assuming “state utilities” computed as
above is not good enough, and actually
does not make sense (is not rational)
Example: Target Tracking
There is uncertainty
in the robot’s and target’s
positions, and this uncertainty
grows with further motion

There is a risk that the target


escape behind the corner
requiring the robot to move
appropriately

But there is a positioning


landmark nearby. Should
the robot tries to reduce
position uncertainty?
Summary

Decision making under uncertainty


Utility function
Optimal policy
Maximal expected utility
Value iteration
Policy iteration

You might also like