Lecture 4: Sequential Decision Making: Simon Parsons
Lecture 4: Sequential Decision Making: Simon Parsons
Simon Parsons
Department of Informatics
Kings College London
Version 1
1 / 92
Today
Introduction
Probabilistic Reasoning I
Probabilistic Reasoning II
Sequential Decision Making
Argumentation I
Argumentation II
Temporal Probabilistic Reasoning
Game Theory
(A peek at) Machine Learning
AI & Ethics
2 / 92
What to do?
3 / 92
What to do?
(mystorybook.com/books/42485)
4 / 92
What to do?
5 / 92
Sequential decision making?
(mystorybook.com/books/42485)
One decision leads to another.
Each decision depends on the ones before, and affects
the ones after.
6 / 92
How to decide what to do
Start simple.
Single decision.
Consider being offered a bet in which you pay 2 if an odd
number is rolled on a die, and win 3 if an even number
appears.
Is this a good bet?
7 / 92
How to decide what to do
8 / 92
How to decide what to do
9 / 92
How to decide what to do
E pX q 0.5 3 ` 0.5 2
10 / 92
How to decide what to do
11 / 92
How to decide what to do
12 / 92
How to decide what to do
(fivethirtyeight.com)
13 / 92
How to decide what to do
which is 0.33.
14 / 92
Example
Pacman is at a T-junction
Based on their knowledge, estimates that if they go Left:
Probability of 0.3 of getting a payoff of 10
Probability of 0.2 of getting a payoff of 1
Probability of 0.5 of getting a payoff of -5
15 / 92
Example
16 / 92
How an agent might decide what to do
17 / 92
How an agent might decide what to do
18 / 92
How an agent might decide what to do
s a2 s6
a1
s5
s3 s4
s1
s2
19 / 92
How an agent might decide what to do
That is it picks the action that has the greatest expected utility.
The right thing to do.
20 / 92
Non-deterministic
s a2 s6
a1
s5
s3 s4
s1
s2
A given action has several possible outcomes.
We dont know, in advance, which one will happen.
21 / 92
Non-deterministic
(fivethirtyeight.com)
A lot like life.
22 / 92
Other notions of rational
23 / 92
Other notions of rational
This will ignore possible bad outcomes and just focus on the
best outcome of each action.
24 / 92
Example
Pacman is at a T-junction
Based on their knowledge, estimates that if they go Left:
Probability of 0.3 of getting a payoff of 10
Probability of 0.2 of getting a payoff of 1
Probability of 0.5 of getting a payoff of -5
If they go Right:
Probability of 0.5 of getting a payoff of -5
Probability of 0.4 of getting a payoff of 3
Probability of 0.1 of getting a payoff of 15
25 / 92
Example
26 / 92
Sequential decision problems
27 / 92
Sequential decision problems
(pillsbury.com)
(Damn fine pie.)
28 / 92
An example
29 / 92
An example
30 / 92
An example
31 / 92
An example
32 / 92
An example
0.85 0.32768
33 / 92
An example
34 / 92
An example
P ps 1 |s , a q
35 / 92
An example
36 / 92
An example
37 / 92
How do we tackle this?
38 / 92
Markov decision process
39 / 92
Markov decision process
40 / 92
Markov decision process
41 / 92
Markov decision process
Naturally wed prefer not just any policy but the optimum
policy.
But how to find it?
Need to compare policies by the reward they generate.
Since actions are stochastic, policies wont give the same
reward every time.
So compare the expected value.
The optimum policy is the policy with the highest expected
value.
At every stage the agent should do ps q.
42 / 92
Markov decision process
43 / 92
An example
44 / 92
An example
45 / 92
How utilities are calculated
46 / 92
How utilities are calculated
47 / 92
But are they?
48 / 92
How utilities are calculated
as above.
Discounted rewards:
49 / 92
How utilities are calculated
50 / 92
How utilities are calculated
51 / 92
How utilities are calculated
52 / 92
How utilities are calculated
53 / 92
How utilities are calculated
54 / 92
How utilities are calculated
55 / 92
Optimal policies
56 / 92
Optimal policies
arg max U ps q
57 / 92
Optimal policies
58 / 92
Optimal policies
59 / 92
Example
Wrong!
60 / 92
Example
61 / 92
Optimal policies
P ps 1 |s , a qU ps 1 q
ps q arg max
a PA ps q
s1
62 / 92
Optimal policies
63 / 92
Bellman equation
is a discount factor.
64 / 92
Not this Bellman
Lewis Carroll
65 / 92
Bellman equation
Apply:
U ps q R ps q ` max Prps 1 |s , a qU ps 1 q
a PA ps q
s1
and we get:
66 / 92
Bellman equation
U p1, 1q 0.04`
max r0.8U p1, 2q ` 0.1U p2, 1q ` 0.1U p1, 1q, pUp q
0.9U p1, 1q ` 0.1U p1, 2q, pLeft q
0.9U p1, 1q ` 0.1U p2, 1q, pDownq
0.8U p2, 1q ` 0.1U p1, 2q ` 0.1U p1, 1qs pRight q
67 / 92
Value iteration
68 / 92
Value iteration
69 / 92
Value iteration
70 / 92
Value iteration
U p4, 3q is pinned to 1.
U p3, 3q quickly settles to a value close to 1
U p1, 1q becomes negative, and then grows as positive utility
form the goal feeds back to it.
71 / 92
Rewards
R ps q c ps , a q
72 / 92
Policy iteration
73 / 92
Policy improvement
Easy
Calculate a new policy i `1 by applying:
i `1 ps q arg max P ps 1 |s , a qUi ps 1 q
a PA ps q
s1
74 / 92
Policy evaluation
75 / 92
Policy iteration
76 / 92
Policy iteration
77 / 92
Policy evaluation
78 / 92
Approximate you say?
79 / 92
Approximate policy evaluation
80 / 92
Modified policy iteration
81 / 92
Solving MDPs
82 / 92
Bellman redux
83 / 92
Bellman redux
84 / 92
Limitations of MDPs?
85 / 92
Partially observable MDPs
86 / 92
Partially observable MDPs
P pe |s q
87 / 92
Partially observable MDPs
88 / 92
Partially observable MDPs
89 / 92
Partially observable MDPs
90 / 92
Partially observable MDPs
91 / 92
Partially observable MDPs
92 / 92
Mathematical!
93 / 92
Summary
94 / 92