Lecture 4: Sequential decision making
Simon Parsons
Department of Informatics
Kings College London
Version 1
1 / 92
Today
Introduction
Probabilistic Reasoning I
Probabilistic Reasoning II
Sequential Decision Making
Argumentation I
Argumentation II
Temporal Probabilistic Reasoning
Game Theory
(A peek at) Machine Learning
AI & Ethics
2 / 92
What to do?
(40 Acres and a Mule Filmworks/Universal Pictures)
3 / 92
What to do?
(mystorybook.com/books/42485)
4 / 92
What to do?
(Sebastian Thrun & Chris Urmson/Google )
5 / 92
Sequential decision making?
(mystorybook.com/books/42485)
One decision leads to another.
Each decision depends on the ones before, and affects
the ones after.
6 / 92
How to decide what to do
Start simple.
Single decision.
Consider being offered a bet in which you pay 2 if an odd
number is rolled on a die, and win 3 if an even number
appears.
Is this a good bet?
7 / 92
How to decide what to do
Consider being offered a bet in which you pay 2 if an odd
number is rolled on a die, and win 3 if an even number
appears.
Is this a good bet?
To analyse this, we need the expected value of the bet.
8 / 92
How to decide what to do
We do this in terms of a random variable, which we will call X .
X can take two values:
3 if the die rolls odd
2 if the die rolls even
And we can also calculate the probability of these two values
PrpX 3q 0.5
PrpX 2q 0.5
9 / 92
How to decide what to do
The expected value is then the weighted sum of the values,
where the weights are the probabilities.
Formally the expected value of X is defined by:
E pX q k PrpX k q
k
where the summation is over all values of k for which
PrpX k q , 0.
Here the expected value is:
E pX q 0.5 3 ` 0.5 2
Thus the expected value of X is 0.5, and we take
this to be the value of the bet.
10 / 92
How to decide what to do
Do you take the bet?
Compare that 0.5 with not taking the bet.
Not taking the bet has (expected) value 0
11 / 92
How to decide what to do
0.5 is not the value you will get.
You can think of it as the long run average if you were offered
the bet many times.
12 / 92
How to decide what to do
(fivethirtyeight.com)
13 / 92
How to decide what to do
Another bet: you get 1 if a 2 or a 3 is rolled, 5 if a six is
rolled, and pay 3 otherwise.
The expected value here is:
E pX q 0.333 1 ` 0.166 5 ` 0.5 3
which is 0.33.
14 / 92
Example
Pacman is at a T-junction
Based on their knowledge, estimates that if they go Left:
Probability of 0.3 of getting a payoff of 10
Probability of 0.2 of getting a payoff of 1
Probability of 0.5 of getting a payoff of -5
What is the expected value of Left?
15 / 92
Example
16 / 92
How an agent might decide what to do
Consider an agent with a set of possible actions A .
Each a P A has a set of possible outcomes sa .
Which action should the agent pick?
17 / 92
How an agent might decide what to do
The action a which a rational agent should choose is that
which maximises the agents utility.
In other words the agent should pick:
a arg max upsa q
a PA
The problem is that in any realistic situation, we dont know
which sa will result from a given a, so we dont know the utility
of a given action.
Instead we have to calculate the expected utility of each
action and make the choice on the basis of that.
18 / 92
How an agent might decide what to do
In other words, for each action a with a set of outcomes sa ,
the agent should calculate:
E pupa qq ups 1 q. Prpsa s 1 q
s 1 Psa
and pick the best.
s a2 s6
a1
s5
s3 s4
s1
s2
19 / 92
How an agent might decide what to do
That is it picks the action that has the greatest expected utility.
The right thing to do.
(40 Acres and a Mule Filmworks/Universal Pictures)
Here rational means rational in the sense of maximising
expected utility.
20 / 92
Non-deterministic
Note that we are dealing with non-deterministic actions here.
s a2 s6
a1
s5
s3 s4
s1
s2
A given action has several possible outcomes.
We dont know, in advance, which one will happen.
21 / 92
Non-deterministic
(fivethirtyeight.com)
A lot like life.
22 / 92
Other notions of rational
There are other criteria for decision-making than maximising
expected utility.
One approach is to look at the option which has the least-bad
worst outcome.
This maximin criterion can be formalised in the same
framework as MEU, making the rational (in this sense) action:
a arg maxt min
1
ups 1 qu
a PA s Psa
Its effect is to ignore the probability of outcomes and
concentrate on optimising the worst case outcome.
23 / 92
Other notions of rational
The opposite attitude, that of optimisitic risk-seeker, is
captured by the maximax criterion:
a arg maxtmax ups 1 qu
a PA s 1 Psa
This will ignore possible bad outcomes and just focus on the
best outcome of each action.
24 / 92
Example
Pacman is at a T-junction
Based on their knowledge, estimates that if they go Left:
Probability of 0.3 of getting a payoff of 10
Probability of 0.2 of getting a payoff of 1
Probability of 0.5 of getting a payoff of -5
If they go Right:
Probability of 0.5 of getting a payoff of -5
Probability of 0.4 of getting a payoff of 3
Probability of 0.1 of getting a payoff of 15
Should they choose Left or Right (MEU)?
25 / 92
Example
26 / 92
Sequential decision problems
These approaches give us a battery of techniques to apply to
individual decisions by agents.
However, they arent really sufficient.
Agents arent usually in the business of taking single
decisions
Life is a series of decisions.
The best overall result is not necessarily obtained by a greedy
approach to a series of decisions.
The current best option isnt the best thing in the long-run.
27 / 92
Sequential decision problems
Otherwise Id only ever eat cherry pie
(pillsbury.com)
(Damn fine pie.)
28 / 92
An example
The agent has to pick a sequence of actions.
A ps q tUp , Down, Left , Right u
for all states s.
29 / 92
An example
The world is fully observable.
End states have values `1 or 1.
30 / 92
An example
If the world were deterministic, the choice of actions would be
easy here.
Up , Up , Right , Right , Right
But actions are stochastic.
31 / 92
An example
80% of the time the agent moves as intended.
20% of the time the agent moves perpendicular to the
intended direction. Half the time to the left, half the time to the
right.
The agent doesnt move if it hits a wall.
32 / 92
An example
So Up , Up , Right , Right , Right succeeds with probability:
0.85 0.32768
33 / 92
An example
Also a small chance of going around the obstacle the other
way.
34 / 92
An example
We can write a transition model to describe these actions.
Since the actions are stochastic, the model looks like:
P ps 1 |s , a q
where a is the action that takes the agent from s to s 1 .
Transitions are assumed to be (first order) Markovian.
They only depend on the current and next states.
So, we could write a large set of probability tables that would
describe all the possible actions executed in all the possible
states.
This would completely specify the actions.
35 / 92
An example
The full description of the problem also has to include the
utility function.
This is defined over sequences of states runs in the
terminology of the first lecture.
We will assume that in each state s the agent receives a
reward R ps q.
This may be positive or negative.
36 / 92
An example
The reward for non-terminal states is 0.04.
We will assume that the utility of a run is the sum of the
utilities of states, so the 0.04 is an incentive to take fewer
steps to get to the terminal state.
(You can also think of it as the cost of an action).
37 / 92
How do we tackle this?
(Pendleton Ward/Cartoon Network)
38 / 92
Markov decision process
The overall problem the agent faces here is a Markov decision
process (MDP)
Mathematically we have
a set of states s P S with an initial state s0 .
A set of actions A ps q in each state.
A transition model P ps 1 |s , a q; and
A reward function R ps q.
Captures any fully observable non-deterministic environment
with a Markovian transition model and additive rewards.
Leslie Pack Kaelbling
39 / 92
Markov decision process
What does a solution to an MDP look like?
40 / 92
Markov decision process
A solution is a policy, which we write as .
This is a choice of action for every state.
that way if we get off track, we still know what to do.
In any state s, ps q identifies what action to take.
41 / 92
Markov decision process
Naturally wed prefer not just any policy but the optimum
policy.
But how to find it?
Need to compare policies by the reward they generate.
Since actions are stochastic, policies wont give the same
reward every time.
So compare the expected value.
The optimum policy is the policy with the highest expected
value.
At every stage the agent should do ps q.
42 / 92
Markov decision process
(40 Acres and a Mule Filmworks/Universal Pictures)
ps q is the right thing.
43 / 92
An example
(a) Optimal policy for the original problem.
(b) Optimal policies for different values of R ps q.
44 / 92
An example
R ps q 1.6284, life is painful so the agent heads for the exit,
even if is a bad state.
0.4278 R ps q 0.0850, life is unpleasant so the agent
heads for the `1 state and is prepared to risk falling into the
1 state.
0.0221 R ps q 0, life isnt so bad, and the optimal policy
doesnt take any risks.
R ps q 0, the agent doesnt want to leave.
45 / 92
How utilities are calculated
So far we have assumed that utilities are summed along a
run.
Not the only way.
In general we need to compute Ur prs0 , s1 , . . . , sn sq.
Can consider finite and infinite horizons.
Is it game over at some point?
Turns out that infinite horizons are mostly easier to deal with.
That is what we will use.
46 / 92
How utilities are calculated
Also have to consider whether utilities are stationary or
non-stationary.
Does the same state always have the same value?
Normally if we prefer one state to another
Passing the AI module to failing it
when we have the exam, today or next week, is irrelevant.
So we can reasonably assume utilities are stationary.
47 / 92
But are they?
Not clear that utilities are always stationary.
In truth, I dont always most want to eat cherry pie.
Despite this, we will assume that utilities are stationary.
48 / 92
How utilities are calculated
With stationary utilities, there are two ways to establish
Ur prs0 , s1 , . . . , sn sq from R ps q.
Additive rewards:
Ur prs0 , s1 , . . . , sn sq R ps0 q ` R ps1 q ` . . . ` R psn q
as above.
Discounted rewards:
Ur prs0 , s1 , . . . , sn sq R ps0 q ` R ps1 q ` . . . ` n R psn q
where the discount factor is a number between 0 and 1.
The discount factor models the preference of the agent
for current over future rewards.
49 / 92
How utilities are calculated
There is an issue with infinite sequences with additive,
undiscounted rewards.
What will the utility of a policy be?
50 / 92
How utilities are calculated
There is an issue with infinite sequences with additive,
undiscounted rewards.
What will the utility of a policy be?
8 or 8.
This is problematic if we want to compare policies.
51 / 92
How utilities are calculated
Some solutions are:
Proper policies
Average reward
Discounted rewards
As follows . . .
52 / 92
How utilities are calculated
Proper policies always end up in a terminal state eventually.
Thus they have a finite expected utility.
53 / 92
How utilities are calculated
We can compute the average reward per time step.
Even for an infinite policy this will (usually) be finite.
54 / 92
How utilities are calculated
With discounted rewards the utility of an infinite sequence is
finite:
8
Ur prs0 , s1 , . . . , sn sq t R p st q
t 0
8
t Rmax
t 0
Rmax
p1 q
where 0 1 and rewards are bounded by Rmax
55 / 92
Optimal policies
With discounted rewards we compare policies by computing
their expected values.
The expected utility of executing starting in s is given by:
ff
8
U ps q E
t R pSt q
t 0
where St is the state the agent gets to at time t.
St is a random variable and we compute the probability of all
its values by looking at all the runs which end up there after t
steps.
56 / 92
Optimal policies
The optimal policy is then:
arg max U ps q
It turns out that this is independent of the state the agent
starts in.
57 / 92
Optimal policies
Here we have the values of states if the agent executes an
optimal policy
U ps q
58 / 92
Optimal policies
Here we have the values of states if the agent executes an
optimal policy
U ps q
What should the agent do if it is in (3, 1)?
59 / 92
Example
Wrong!
60 / 92
Example
The answer is Left.
The best action is the one that maximises expected utility.
(You have to calculate the expected utility of all the actions to
see ahy Left is the best choice.)
61 / 92
Optimal policies
If we have these values, the agent has a simple decision
process
It just picks the action a that maximises the expected utility of
the next state:
P ps 1 |s , a qU ps 1 q
ps q arg max
a PA ps q
s1
Only have to consider the next step.
The big question is how to compute U ps q.
62 / 92
Optimal policies
Note that this is specific to the value of the reward R ps q for
non-terminal states different rewards will give different
values and policies.
63 / 92
Bellman equation
How do we find the best policy (for a given set of rewards)?
Turns out that there is a neat way to do this, by first computing
the utility of each state.
We compute this using the Bellman equation
U ps q R ps q ` max Prps 1 |s , a qU ps 1 q
a PA ps q
s1
is a discount factor.
64 / 92
Not this Bellman
Just the place for a Snark! the Bellman cried,
As he landed his crew with care;
Supporting each man on the top of the tide
By a finger entwined in his hair.
Just the place for a Snark! I have said it twice:
That alone should encourage the crew.
Just the place for a Snark! I have said it thrice:
What I tell you three times is true.
Lewis Carroll
(Mervyn Peakes illustrations to The Hunting of the Snark).
65 / 92
Bellman equation
Apply:
U ps q R ps q ` max Prps 1 |s , a qU ps 1 q
a PA ps q
s1
and we get:
66 / 92
Bellman equation
U p1, 1q 0.04`
max r0.8U p1, 2q ` 0.1U p2, 1q ` 0.1U p1, 1q, pUp q
0.9U p1, 1q ` 0.1U p1, 2q, pLeft q
0.9U p1, 1q ` 0.1U p2, 1q, pDownq
0.8U p2, 1q ` 0.1U p1, 2q ` 0.1U p1, 1qs pRight q
67 / 92
Value iteration
In an MDP wth n states, we will have n Bellman equations.
(Pendleton Ward/Cartoon Network)
Hard to solve these simultaneously because of the
max operation
Makes them non-linear
68 / 92
Value iteration
Luckily an iterative approach works.
Start with arbitrary values for states and apply the Bellman
update:
Ui `1 ps q R ps q ` max P ps 1 |s , a qUi ps 1 q
a PA ps q
s1
simultaneously to all the states.
Continue until the values of states do not change.
After an infinite number of applications, the values are
guaranteed to converge on the optimal values.
69 / 92
Value iteration
How the values of states change as updates occur.
70 / 92
Value iteration
U p4, 3q is pinned to 1.
U p3, 3q quickly settles to a value close to 1
U p1, 1q becomes negative, and then grows as positive utility
form the goal feeds back to it.
71 / 92
Rewards
The example so far has a negative reward R ps q for each state.
Encouragement for an agent not to stick around.
Can also think of R ps q is being the cost of moving to the next
state (where we obatin the utility):
R ps q c ps , a q
where s is the action used.
Bellman becomes:
Ui `1 ps q max p P ps 1 |s , a qUi ps 1 qq c ps , a q
a PA ps q
s1
Note that the action can be dependent on the state.
72 / 92
Policy iteration
Rather than compute optimal utility values, policy iteration
looks through the space of possible policies.
Starting from some initial policy 0 we do:
Policy evaluation
Given a policy i , calculate Ui ps q.
Policy improvement
Given Ui ps q, compute i `1
We will look at each of these steps in turn.
But not in order.
73 / 92
Policy improvement
Easy
Calculate a new policy i `1 by applying:
i `1 ps q arg max P ps 1 |s , a qUi ps 1 q
a PA ps q
s1
For each state we do a one-step lookahead.
A simple decision.
74 / 92
Policy evaluation
How do we calculate the utility of each step given the policy
i ?
Turns out not to be so hard.
Given a policy, the choice of action in a given state is fixed
(that is what a policy tells us) so:
Ui ps q R ps q ` P ps 1 |s , i ps qqUi ps 1 q
s1
Again there are lots of simultaneous equations, but now they
are linear (no max) and so standard linear algebra solutions
will work.
75 / 92
Policy iteration
Put these together to get:
Starting from some initial policy 0 we do:
1 Policy evaluation
Compute:
Ui ps q R ps q ` P ps 1 |s , i ps qqUi ps 1 q
s1
for every state.
2 Policy improvement
Calculate a new policy i `1 by applying:
i `1 ps q arg max P ps 1 |s , a qUi ps 1 q
a PA ps q
s1
for every state s.
Until convergence.
76 / 92
Policy iteration
The iteration will terminate when there is no improvement in
utility from one iteration to the next.
At this point the utility Ui is a fixed point of the Bellman update
and so i must be optimal.
77 / 92
Policy evaluation
There is a problem with the policy evaluation stage of the
policy iteration approach.
If we have n states, we have n linear equations with n
unknowns in the evaluation stage.
Solution in O pn3 q.
For large n, can be a problem.
So, an approximate solution.
78 / 92
Approximate you say?
(Pendleton Ward/Cartoon Network)
79 / 92
Approximate policy evaluation
Run a simplified value iteration.
Policy is fixed, so we know what action to do in each state.
Repeat:
Ui `1 ps q R ps q ` P ps 1 |s , i ps qqUi ps 1 q
s1
a fixed number of times.
80 / 92
Modified policy iteration
Starting from some initial policy 0 we do:
1 Approximate policy evaluation
Repeat
Ui `1 ps q R ps q ` P ps 1 |s , i ps qqUi ps 1 q
s1
a fixed number of times.
2 Policy improvement
i `1 ps q arg max P ps 1 |s , a qUi ps 1 q
a PA ps q
s1
for every state s.
Until convergance
Often more efficient than policy iteration or
value iteration.
81 / 92
Solving MDPs
Have covered three methods for solving MDPs
Value iteration
(Exact)
Policy iteration
(Exact)
Modified policy iteration
(Approximate)
Which to use is somewhat problem specific.
82 / 92
Bellman redux
The Bellman equation(s)/update are widely used.
D. Romer, Its Fourth Down and What Does the Bellman
Equation Say? A Dynamic Programming Analysis of Football
Strategy, NBER Working Paper No. 9024, June 2002
83 / 92
Bellman redux
This paper uses play-by-play accounts of virtually all
regular season National Football League games for
1998-2000 to analyze teams choices on fourth
down between trying for a first down and kicking.
Dynamic programming is used to estimate the
values of possessing the ball at different points on
the field. These estimates are combined with data
on the results of kicks and conventional plays to
estimate the average payoffs to kicking and going for
it under different circumstances. Examination of
teams actual decisions shows systematic,
overwhelmingly statistically significant, and
quantitatively large departures from the decisions
the dynamic-programming analysis implies are
preferable.
84 / 92
Limitations of MDPs?
(Pendleton Ward/Cartoon Network)
85 / 92
Partially observable MDPs
MDPs made the assumption that the environment was fully
observable.
Agent always knows what state it is in.
The optimal policy only depends on the current state.
Not the case in the real world.
We only have a belief about the current state.
POMDPs extend the model to deal with partial observability.
86 / 92
Partially observable MDPs
Basic addition to the MDP model is the sensor model:
P pe |s q
probability of perceiving e in state s.
As a result of noise in the sensor model, the agent only has a
belief about which state it is in.
Probability distribution over the possible states.
The world is a POMDP
87 / 92
Partially observable MDPs
PpS q : P ps1,1 q 0.05, P ps1,2 q 0.01, . . .
88 / 92
Partially observable MDPs
The agent can compute its current belief as the conditional
probability distribution over the states given the sequence of
actions and percepts so far.
89 / 92
Partially observable MDPs
The agent can compute its current belief as the conditional
probability distribution over the states given the sequence of
actions and percepts so far.
We will come across this task again in Lecture 7
Filtering
Computing the state that matches best with a stream of
evidence.
90 / 92
Partially observable MDPs
If b ps q was the distribution before an action and an
observation, then afterwards the distribution is:
b 1 ps 1 q P pe |s 1 q P ps 1 |s , a qb ps q
s
Everything in a POMDP hinges on the belief state b.
Including the optimal action.
Indeed, the optimal policy is a mapping pb q from beliefs to
actions.
If you think you are next to the wall, turn left
The agent executes the optimal action given its beliefs,
receives a percept e and then recomputes the belief
state.
91 / 92
Partially observable MDPs
The big issue in solving POMDPs is that beliefs are
continuous.
When we solved MDPs, we could search through the set of
possible actions in each state to find the best.
To solve a POMDP, we need to look through the possible
actions for each belief state.
But belief is continuous, so there are a lot of belief states.
Exact solutions to POMDPs are intractable for even small
problems (like the example we have been using).
Need (once again) to use approximate techniques.
92 / 92
Mathematical!
(Pendleton Ward/Cartoon Network)
93 / 92
Summary
Today we looked at practical decision making for agents.
Practical in the sense that agents will need this kind of
decision making to do the things they need to do.
This built on the last lecture on probability, and extended that
with expected values.
We looked in detail at solutions for techniques that work in
fully observable worlds
MDPs
We also briefly mentioned the difficulties of extending this
work to partially observable worlds.
94 / 92