0% found this document useful (0 votes)
81 views

Lecture 4: Sequential Decision Making: Simon Parsons

This document summarizes a lecture on sequential decision making. It begins with an introduction and overview of topics to be covered, including probabilistic reasoning, argumentation, game theory, and machine learning. It then discusses how to model single decisions as expected values by considering probabilities and outcomes. The document provides examples of calculating expected values for single bets or decisions. It explains how agents can choose actions by maximizing expected utility. Finally, it discusses how modeling sequential decisions differs from single decisions, as the best short-term choice may not lead to the best long-term outcome when decisions influence later options.

Uploaded by

Vlad Stoica
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views

Lecture 4: Sequential Decision Making: Simon Parsons

This document summarizes a lecture on sequential decision making. It begins with an introduction and overview of topics to be covered, including probabilistic reasoning, argumentation, game theory, and machine learning. It then discusses how to model single decisions as expected values by considering probabilities and outcomes. The document provides examples of calculating expected values for single bets or decisions. It explains how agents can choose actions by maximizing expected utility. Finally, it discusses how modeling sequential decisions differs from single decisions, as the best short-term choice may not lead to the best long-term outcome when decisions influence later options.

Uploaded by

Vlad Stoica
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

Lecture 4: Sequential decision making

Simon Parsons

Department of Informatics
Kings College London

Version 1

1 / 92
Today

Introduction
Probabilistic Reasoning I
Probabilistic Reasoning II
Sequential Decision Making
Argumentation I
Argumentation II
Temporal Probabilistic Reasoning
Game Theory
(A peek at) Machine Learning
AI & Ethics

2 / 92
What to do?

(40 Acres and a Mule Filmworks/Universal Pictures)

3 / 92
What to do?

(mystorybook.com/books/42485)

4 / 92
What to do?

(Sebastian Thrun & Chris Urmson/Google )

5 / 92
Sequential decision making?

(mystorybook.com/books/42485)
One decision leads to another.
Each decision depends on the ones before, and affects
the ones after.

6 / 92
How to decide what to do

Start simple.
Single decision.
Consider being offered a bet in which you pay 2 if an odd
number is rolled on a die, and win 3 if an even number
appears.
Is this a good bet?

7 / 92
How to decide what to do

Consider being offered a bet in which you pay 2 if an odd


number is rolled on a die, and win 3 if an even number
appears.
Is this a good bet?
To analyse this, we need the expected value of the bet.

8 / 92
How to decide what to do

We do this in terms of a random variable, which we will call X .


X can take two values:
3 if the die rolls odd
2 if the die rolls even
And we can also calculate the probability of these two values
PrpX 3q 0.5
PrpX 2q 0.5

9 / 92
How to decide what to do

The expected value is then the weighted sum of the values,


where the weights are the probabilities.
Formally the expected value of X is defined by:

E pX q k PrpX k q
k

where the summation is over all values of k for which


PrpX k q , 0.
Here the expected value is:

E pX q 0.5 3 ` 0.5 2

Thus the expected value of X is 0.5, and we take


this to be the value of the bet.

10 / 92
How to decide what to do

Do you take the bet?


Compare that 0.5 with not taking the bet.
Not taking the bet has (expected) value 0

11 / 92
How to decide what to do

0.5 is not the value you will get.


You can think of it as the long run average if you were offered
the bet many times.

12 / 92
How to decide what to do

(fivethirtyeight.com)
13 / 92
How to decide what to do

Another bet: you get 1 if a 2 or a 3 is rolled, 5 if a six is


rolled, and pay 3 otherwise.
The expected value here is:

E pX q 0.333 1 ` 0.166 5 ` 0.5 3

which is 0.33.

14 / 92
Example

Pacman is at a T-junction
Based on their knowledge, estimates that if they go Left:
Probability of 0.3 of getting a payoff of 10
Probability of 0.2 of getting a payoff of 1
Probability of 0.5 of getting a payoff of -5

What is the expected value of Left?

15 / 92
Example

16 / 92
How an agent might decide what to do

Consider an agent with a set of possible actions A .


Each a P A has a set of possible outcomes sa .
Which action should the agent pick?

17 / 92
How an agent might decide what to do

The action a which a rational agent should choose is that


which maximises the agents utility.
In other words the agent should pick:

a arg max upsa q


a PA

The problem is that in any realistic situation, we dont know


which sa will result from a given a, so we dont know the utility
of a given action.
Instead we have to calculate the expected utility of each
action and make the choice on the basis of that.

18 / 92
How an agent might decide what to do

In other words, for each action a with a set of outcomes sa ,


the agent should calculate:

E pupa qq ups 1 q. Prpsa s 1 q
s 1 Psa

and pick the best.

s a2 s6

a1
s5
s3 s4
s1
s2

19 / 92
How an agent might decide what to do

That is it picks the action that has the greatest expected utility.
The right thing to do.

(40 Acres and a Mule Filmworks/Universal Pictures)


Here rational means rational in the sense of maximising
expected utility.

20 / 92
Non-deterministic

Note that we are dealing with non-deterministic actions here.

s a2 s6

a1
s5
s3 s4
s1
s2
A given action has several possible outcomes.
We dont know, in advance, which one will happen.

21 / 92
Non-deterministic

(fivethirtyeight.com)
A lot like life.

22 / 92
Other notions of rational

There are other criteria for decision-making than maximising


expected utility.
One approach is to look at the option which has the least-bad
worst outcome.
This maximin criterion can be formalised in the same
framework as MEU, making the rational (in this sense) action:

a arg maxt min


1
ups 1 qu
a PA s Psa

Its effect is to ignore the probability of outcomes and


concentrate on optimising the worst case outcome.

23 / 92
Other notions of rational

The opposite attitude, that of optimisitic risk-seeker, is


captured by the maximax criterion:

a arg maxtmax ups 1 qu


a PA s 1 Psa

This will ignore possible bad outcomes and just focus on the
best outcome of each action.

24 / 92
Example

Pacman is at a T-junction
Based on their knowledge, estimates that if they go Left:
Probability of 0.3 of getting a payoff of 10
Probability of 0.2 of getting a payoff of 1
Probability of 0.5 of getting a payoff of -5

If they go Right:
Probability of 0.5 of getting a payoff of -5
Probability of 0.4 of getting a payoff of 3
Probability of 0.1 of getting a payoff of 15

Should they choose Left or Right (MEU)?

25 / 92
Example

26 / 92
Sequential decision problems

These approaches give us a battery of techniques to apply to


individual decisions by agents.
However, they arent really sufficient.
Agents arent usually in the business of taking single
decisions
Life is a series of decisions.
The best overall result is not necessarily obtained by a greedy
approach to a series of decisions.
The current best option isnt the best thing in the long-run.

27 / 92
Sequential decision problems

Otherwise Id only ever eat cherry pie

(pillsbury.com)
(Damn fine pie.)

28 / 92
An example

The agent has to pick a sequence of actions.

A ps q tUp , Down, Left , Right u

for all states s.

29 / 92
An example

The world is fully observable.


End states have values `1 or 1.

30 / 92
An example

If the world were deterministic, the choice of actions would be


easy here.
Up , Up , Right , Right , Right
But actions are stochastic.

31 / 92
An example

80% of the time the agent moves as intended.


20% of the time the agent moves perpendicular to the
intended direction. Half the time to the left, half the time to the
right.
The agent doesnt move if it hits a wall.

32 / 92
An example

So Up , Up , Right , Right , Right succeeds with probability:

0.85 0.32768

33 / 92
An example

Also a small chance of going around the obstacle the other


way.

34 / 92
An example

We can write a transition model to describe these actions.


Since the actions are stochastic, the model looks like:

P ps 1 |s , a q

where a is the action that takes the agent from s to s 1 .


Transitions are assumed to be (first order) Markovian.
They only depend on the current and next states.
So, we could write a large set of probability tables that would
describe all the possible actions executed in all the possible
states.
This would completely specify the actions.

35 / 92
An example

The full description of the problem also has to include the


utility function.
This is defined over sequences of states runs in the
terminology of the first lecture.
We will assume that in each state s the agent receives a
reward R ps q.
This may be positive or negative.

36 / 92
An example

The reward for non-terminal states is 0.04.


We will assume that the utility of a run is the sum of the
utilities of states, so the 0.04 is an incentive to take fewer
steps to get to the terminal state.
(You can also think of it as the cost of an action).

37 / 92
How do we tackle this?

(Pendleton Ward/Cartoon Network)

38 / 92
Markov decision process

The overall problem the agent faces here is a Markov decision


process (MDP)
Mathematically we have
a set of states s P S with an initial state s0 .
A set of actions A ps q in each state.
A transition model P ps 1 |s , a q; and
A reward function R ps q.
Captures any fully observable non-deterministic environment
with a Markovian transition model and additive rewards.

Leslie Pack Kaelbling

39 / 92
Markov decision process

What does a solution to an MDP look like?

40 / 92
Markov decision process

A solution is a policy, which we write as .


This is a choice of action for every state.
that way if we get off track, we still know what to do.
In any state s, ps q identifies what action to take.

41 / 92
Markov decision process

Naturally wed prefer not just any policy but the optimum
policy.
But how to find it?
Need to compare policies by the reward they generate.
Since actions are stochastic, policies wont give the same
reward every time.
So compare the expected value.
The optimum policy is the policy with the highest expected
value.
At every stage the agent should do ps q.

42 / 92
Markov decision process

(40 Acres and a Mule Filmworks/Universal Pictures)


ps q is the right thing.

43 / 92
An example

(a) Optimal policy for the original problem.


(b) Optimal policies for different values of R ps q.

44 / 92
An example

R ps q 1.6284, life is painful so the agent heads for the exit,


even if is a bad state.
0.4278 R ps q 0.0850, life is unpleasant so the agent
heads for the `1 state and is prepared to risk falling into the
1 state.
0.0221 R ps q 0, life isnt so bad, and the optimal policy
doesnt take any risks.
R ps q 0, the agent doesnt want to leave.

45 / 92
How utilities are calculated

So far we have assumed that utilities are summed along a


run.
Not the only way.
In general we need to compute Ur prs0 , s1 , . . . , sn sq.
Can consider finite and infinite horizons.
Is it game over at some point?
Turns out that infinite horizons are mostly easier to deal with.
That is what we will use.

46 / 92
How utilities are calculated

Also have to consider whether utilities are stationary or


non-stationary.
Does the same state always have the same value?
Normally if we prefer one state to another
Passing the AI module to failing it
when we have the exam, today or next week, is irrelevant.
So we can reasonably assume utilities are stationary.

47 / 92
But are they?

Not clear that utilities are always stationary.

In truth, I dont always most want to eat cherry pie.


Despite this, we will assume that utilities are stationary.

48 / 92
How utilities are calculated

With stationary utilities, there are two ways to establish


Ur prs0 , s1 , . . . , sn sq from R ps q.
Additive rewards:

Ur prs0 , s1 , . . . , sn sq R ps0 q ` R ps1 q ` . . . ` R psn q

as above.
Discounted rewards:

Ur prs0 , s1 , . . . , sn sq R ps0 q ` R ps1 q ` . . . ` n R psn q

where the discount factor is a number between 0 and 1.


The discount factor models the preference of the agent
for current over future rewards.

49 / 92
How utilities are calculated

There is an issue with infinite sequences with additive,


undiscounted rewards.
What will the utility of a policy be?

50 / 92
How utilities are calculated

There is an issue with infinite sequences with additive,


undiscounted rewards.
What will the utility of a policy be?
8 or 8.
This is problematic if we want to compare policies.

51 / 92
How utilities are calculated

Some solutions are:


Proper policies
Average reward
Discounted rewards
As follows . . .

52 / 92
How utilities are calculated

Proper policies always end up in a terminal state eventually.


Thus they have a finite expected utility.

53 / 92
How utilities are calculated

We can compute the average reward per time step.


Even for an infinite policy this will (usually) be finite.

54 / 92
How utilities are calculated

With discounted rewards the utility of an infinite sequence is


finite:
8

Ur prs0 , s1 , . . . , sn sq t R p st q
t 0
8

t Rmax
t 0
Rmax

p1 q

where 0 1 and rewards are bounded by Rmax

55 / 92
Optimal policies

With discounted rewards we compare policies by computing


their expected values.
The expected utility of executing starting in s is given by:
ff
8
U ps q E

t R pSt q
t 0

where St is the state the agent gets to at time t.


St is a random variable and we compute the probability of all
its values by looking at all the runs which end up there after t
steps.

56 / 92
Optimal policies

The optimal policy is then:

arg max U ps q

It turns out that this is independent of the state the agent


starts in.

57 / 92
Optimal policies

Here we have the values of states if the agent executes an


optimal policy
U ps q

58 / 92
Optimal policies

Here we have the values of states if the agent executes an


optimal policy
U ps q

What should the agent do if it is in (3, 1)?

59 / 92
Example

Wrong!

60 / 92
Example

The answer is Left.


The best action is the one that maximises expected utility.
(You have to calculate the expected utility of all the actions to
see ahy Left is the best choice.)

61 / 92
Optimal policies

If we have these values, the agent has a simple decision


process
It just picks the action a that maximises the expected utility of
the next state:

P ps 1 |s , a qU ps 1 q

ps q arg max
a PA ps q
s1

Only have to consider the next step.


The big question is how to compute U ps q.

62 / 92
Optimal policies

Note that this is specific to the value of the reward R ps q for


non-terminal states different rewards will give different
values and policies.

63 / 92
Bellman equation

How do we find the best policy (for a given set of rewards)?


Turns out that there is a neat way to do this, by first computing
the utility of each state.
We compute this using the Bellman equation

U ps q R ps q ` max Prps 1 |s , a qU ps 1 q
a PA ps q
s1

is a discount factor.

64 / 92
Not this Bellman

Just the place for a Snark! the Bellman cried,


As he landed his crew with care;
Supporting each man on the top of the tide
By a finger entwined in his hair.

Just the place for a Snark! I have said it twice:


That alone should encourage the crew.
Just the place for a Snark! I have said it thrice:
What I tell you three times is true.

Lewis Carroll

(Mervyn Peakes illustrations to The Hunting of the Snark).

65 / 92
Bellman equation

Apply:

U ps q R ps q ` max Prps 1 |s , a qU ps 1 q
a PA ps q
s1

and we get:

66 / 92
Bellman equation

U p1, 1q 0.04`
max r0.8U p1, 2q ` 0.1U p2, 1q ` 0.1U p1, 1q, pUp q
0.9U p1, 1q ` 0.1U p1, 2q, pLeft q
0.9U p1, 1q ` 0.1U p2, 1q, pDownq
0.8U p2, 1q ` 0.1U p1, 2q ` 0.1U p1, 1qs pRight q

67 / 92
Value iteration

In an MDP wth n states, we will have n Bellman equations.

(Pendleton Ward/Cartoon Network)


Hard to solve these simultaneously because of the
max operation
Makes them non-linear

68 / 92
Value iteration

Luckily an iterative approach works.


Start with arbitrary values for states and apply the Bellman
update:

Ui `1 ps q R ps q ` max P ps 1 |s , a qUi ps 1 q
a PA ps q
s1

simultaneously to all the states.


Continue until the values of states do not change.
After an infinite number of applications, the values are
guaranteed to converge on the optimal values.

69 / 92
Value iteration

How the values of states change as updates occur.

70 / 92
Value iteration

U p4, 3q is pinned to 1.
U p3, 3q quickly settles to a value close to 1
U p1, 1q becomes negative, and then grows as positive utility
form the goal feeds back to it.

71 / 92
Rewards

The example so far has a negative reward R ps q for each state.


Encouragement for an agent not to stick around.
Can also think of R ps q is being the cost of moving to the next
state (where we obatin the utility):

R ps q c ps , a q

where s is the action used.


Bellman becomes:

Ui `1 ps q max p P ps 1 |s , a qUi ps 1 qq c ps , a q
a PA ps q
s1

Note that the action can be dependent on the state.

72 / 92
Policy iteration

Rather than compute optimal utility values, policy iteration


looks through the space of possible policies.
Starting from some initial policy 0 we do:
Policy evaluation
Given a policy i , calculate Ui ps q.
Policy improvement
Given Ui ps q, compute i `1
We will look at each of these steps in turn.
But not in order.

73 / 92
Policy improvement

Easy
Calculate a new policy i `1 by applying:

i `1 ps q arg max P ps 1 |s , a qUi ps 1 q
a PA ps q
s1

For each state we do a one-step lookahead.


A simple decision.

74 / 92
Policy evaluation

How do we calculate the utility of each step given the policy


i ?
Turns out not to be so hard.
Given a policy, the choice of action in a given state is fixed
(that is what a policy tells us) so:

Ui ps q R ps q ` P ps 1 |s , i ps qqUi ps 1 q
s1

Again there are lots of simultaneous equations, but now they


are linear (no max) and so standard linear algebra solutions
will work.

75 / 92
Policy iteration

Put these together to get:


Starting from some initial policy 0 we do:
1 Policy evaluation
Compute:

Ui ps q R ps q ` P ps 1 |s , i ps qqUi ps 1 q
s1

for every state.


2 Policy improvement
Calculate a new policy i `1 by applying:

i `1 ps q arg max P ps 1 |s , a qUi ps 1 q
a PA ps q
s1

for every state s.


Until convergence.

76 / 92
Policy iteration

The iteration will terminate when there is no improvement in


utility from one iteration to the next.
At this point the utility Ui is a fixed point of the Bellman update
and so i must be optimal.

77 / 92
Policy evaluation

There is a problem with the policy evaluation stage of the


policy iteration approach.
If we have n states, we have n linear equations with n
unknowns in the evaluation stage.
Solution in O pn3 q.
For large n, can be a problem.
So, an approximate solution.

78 / 92
Approximate you say?

(Pendleton Ward/Cartoon Network)

79 / 92
Approximate policy evaluation

Run a simplified value iteration.


Policy is fixed, so we know what action to do in each state.
Repeat:

Ui `1 ps q R ps q ` P ps 1 |s , i ps qqUi ps 1 q
s1

a fixed number of times.

80 / 92
Modified policy iteration

Starting from some initial policy 0 we do:


1 Approximate policy evaluation
Repeat

Ui `1 ps q R ps q ` P ps 1 |s , i ps qqUi ps 1 q
s1

a fixed number of times.


2 Policy improvement

i `1 ps q arg max P ps 1 |s , a qUi ps 1 q
a PA ps q
s1

for every state s.


Until convergance
Often more efficient than policy iteration or
value iteration.

81 / 92
Solving MDPs

Have covered three methods for solving MDPs


Value iteration
(Exact)
Policy iteration
(Exact)
Modified policy iteration
(Approximate)
Which to use is somewhat problem specific.

82 / 92
Bellman redux

The Bellman equation(s)/update are widely used.

D. Romer, Its Fourth Down and What Does the Bellman


Equation Say? A Dynamic Programming Analysis of Football
Strategy, NBER Working Paper No. 9024, June 2002

83 / 92
Bellman redux

This paper uses play-by-play accounts of virtually all


regular season National Football League games for
1998-2000 to analyze teams choices on fourth
down between trying for a first down and kicking.
Dynamic programming is used to estimate the
values of possessing the ball at different points on
the field. These estimates are combined with data
on the results of kicks and conventional plays to
estimate the average payoffs to kicking and going for
it under different circumstances. Examination of
teams actual decisions shows systematic,
overwhelmingly statistically significant, and
quantitatively large departures from the decisions
the dynamic-programming analysis implies are
preferable.

84 / 92
Limitations of MDPs?

(Pendleton Ward/Cartoon Network)

85 / 92
Partially observable MDPs

MDPs made the assumption that the environment was fully


observable.
Agent always knows what state it is in.
The optimal policy only depends on the current state.
Not the case in the real world.
We only have a belief about the current state.
POMDPs extend the model to deal with partial observability.

86 / 92
Partially observable MDPs

Basic addition to the MDP model is the sensor model:

P pe |s q

probability of perceiving e in state s.


As a result of noise in the sensor model, the agent only has a
belief about which state it is in.
Probability distribution over the possible states.

The world is a POMDP

87 / 92
Partially observable MDPs

PpS q : P ps1,1 q 0.05, P ps1,2 q 0.01, . . .

88 / 92
Partially observable MDPs

The agent can compute its current belief as the conditional


probability distribution over the states given the sequence of
actions and percepts so far.

89 / 92
Partially observable MDPs

The agent can compute its current belief as the conditional


probability distribution over the states given the sequence of
actions and percepts so far.
We will come across this task again in Lecture 7
Filtering
Computing the state that matches best with a stream of
evidence.

90 / 92
Partially observable MDPs

If b ps q was the distribution before an action and an


observation, then afterwards the distribution is:

b 1 ps 1 q P pe |s 1 q P ps 1 |s , a qb ps q
s

Everything in a POMDP hinges on the belief state b.


Including the optimal action.
Indeed, the optimal policy is a mapping pb q from beliefs to
actions.
If you think you are next to the wall, turn left
The agent executes the optimal action given its beliefs,
receives a percept e and then recomputes the belief
state.

91 / 92
Partially observable MDPs

The big issue in solving POMDPs is that beliefs are


continuous.
When we solved MDPs, we could search through the set of
possible actions in each state to find the best.
To solve a POMDP, we need to look through the possible
actions for each belief state.
But belief is continuous, so there are a lot of belief states.
Exact solutions to POMDPs are intractable for even small
problems (like the example we have been using).
Need (once again) to use approximate techniques.

92 / 92
Mathematical!

(Pendleton Ward/Cartoon Network)

93 / 92
Summary

Today we looked at practical decision making for agents.


Practical in the sense that agents will need this kind of
decision making to do the things they need to do.
This built on the last lecture on probability, and extended that
with expected values.
We looked in detail at solutions for techniques that work in
fully observable worlds
MDPs
We also briefly mentioned the difficulties of extending this
work to partially observable worlds.

94 / 92

You might also like