0% found this document useful (0 votes)
77 views103 pages

CS 747, Autumn 2020: Week 4, Lecture 1: Shivaram Kalyanakrishnan

This document summarizes a lecture on Markov decision processes (MDPs). It defines the key components of an MDP, including states, actions, transition probabilities, rewards, and a discount factor. It also provides examples to illustrate an MDP and the interaction between an agent and its environment in an MDP framework.

Uploaded by

Grafins Care
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views103 pages

CS 747, Autumn 2020: Week 4, Lecture 1: Shivaram Kalyanakrishnan

This document summarizes a lecture on Markov decision processes (MDPs). It defines the key components of an MDP, including states, actions, transition probabilities, rewards, and a discount factor. It also provides examples to illustrate an MDP and the interaction between an agent and its environment in an MDP framework.

Uploaded by

Grafins Care
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

CS 747, Autumn 2020: Week 4, Lecture 1

Shivaram Kalyanakrishnan

Department of Computer Science and Engineering


Indian Institute of Technology Bombay

Autumn 2020

1/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 1 / 26


Markov Decision Problems
1. Definitions
I Markov Decision Problem
I Policy
I Value Function

2. MDP planning

3. Alternative formulations

4. Applications

5. Policy Evaluation
2/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 2 / 26


Markov Decision Problems
1. Definitions
I Markov Decision Problem
I Policy
I Value Function

2. MDP planning

3. Alternative formulations

4. Applications

5. Policy Evaluation
2/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 2 / 26


Markov Decision Problems (MDPs)

0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3

0.5, 3

3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26


Markov Decision Problems (MDPs)
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3

0.5, 3

An MDP M = (S, A, T , R, γ) has these elements.


S: a set of states.

3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26


Markov Decision Problems (MDPs)
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3

0.5, 3

An MDP M = (S, A, T , R, γ) has these elements.


S: a set of states.
Let us assume S = {s1 , s2 , . . . , sn }, and hence |S| = n.
3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26


Markov Decision Problems (MDPs)
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3

0.5, 3

An MDP M = (S, A, T , R, γ) has these elements.


A: a set of actions.

3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26


Markov Decision Problems (MDPs)
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3

0.5, 3

An MDP M = (S, A, T , R, γ) has these elements.


A: a set of actions.
Let us assume A = {a1 , a2 , . . . , ak }, and hence |A| = k .
Here A = {RED, BLUE}. 3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26


Markov Decision Problems (MDPs)
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3

0.5, 3

An MDP M = (S, A, T , R, γ) has these elements.


T : a transition function.

3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26


Markov Decision Problems (MDPs)
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3

0.5, 3

An MDP M = (S, A, T , R, γ) has these elements.


T : a transition function.
- For s, s0 ∈ S, a ∈ A: T (s, a, s0 ) is the probability of reaching
s0 by starting at s and taking action a.
- Thus, T (s, a, ·) is a probability distribution over S. 3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26


Markov Decision Problems (MDPs)
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3

0.5, 3

An MDP M = (S, A, T , R, γ) has these elements.


R: a reward function.

3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26


Markov Decision Problems (MDPs)
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3

0.5, 3

An MDP M = (S, A, T , R, γ) has these elements.


R: a reward function.
- For s, s0 ∈ S, a ∈ A: R(s, a, s0 ) is the (numeric) reward for
reaching s0 by starting at s and taking action a.
- Assume rewards are from [−Rmax , Rmax ] for some Rmax ≥ 0. 3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26


Markov Decision Problems (MDPs)

0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3

0.5, 3

An MDP M = (S, A, T , R, γ) has these elements.


γ, a discount factor—coming up shortly.
3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26


Agent-Environment Interaction
Agent is born in some state s0 , takes action a0 .
Environment generates and provides the agent
t =0
next state s1 ∼ T (s0 , a0 , ·) and
reward r 0 = R(s0 , a0 , s1 ).

4/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 4 / 26


Agent-Environment Interaction
Agent is born in some state s0 , takes action a0 .
Environment generates and provides the agent
t =0
next state s1 ∼ T (s0 , a0 , ·) and
reward r 0 = R(s0 , a0 , s1 ).

Agent is in state s1 , takes action a1 .


Environment generates and provides the agent
t =1
next state s2 ∼ T (s1 , a1 , ·) and
reward r 1 = R(s1 , a1 , s2 ).

4/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 4 / 26


Agent-Environment Interaction
Agent is born in some state s0 , takes action a0 .
Environment generates and provides the agent
t =0
next state s1 ∼ T (s0 , a0 , ·) and
reward r 0 = R(s0 , a0 , s1 ).

Agent is in state s1 , takes action a1 .


Environment generates and provides the agent
t =1
next state s2 ∼ T (s1 , a1 , ·) and
reward r 1 = R(s1 , a1 , s2 ).

..
.

4/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 4 / 26


Agent-Environment Interaction
Agent is born in some state s0 , takes action a0 .
Environment generates and provides the agent
t =0
next state s1 ∼ T (s0 , a0 , ·) and
reward r 0 = R(s0 , a0 , s1 ).

Agent is in state s1 , takes action a1 .


Environment generates and provides the agent
t =1
next state s2 ∼ T (s1 , a1 , ·) and
reward r 1 = R(s1 , a1 , s2 ).

..
.

Resulting trajectory: s0 , a0 , r 0 , s1 , a1 , r 1 , s2 , . . . .
4/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 4 / 26


Describing the Agent’s Behaviour
at
−−−−−−−−→
Agent Environment
r t ,st+1
←−−−−−−−−

5/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 5 / 26


Describing the Agent’s Behaviour
at
−−−−−−−−→
Agent Environment
r t ,st+1
←−−−−−−−−
How does the agent pick at ?

5/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 5 / 26


Describing the Agent’s Behaviour
at
−−−−−−−−→
Agent Environment
r t ,st+1
←−−−−−−−−
How does the agent pick at ?
In principle, it can decide by looking at the preceding history
s 0 , a0 , r 0 , s 1 , a1 , r 1 , s 2 , . . . , s t .

5/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 5 / 26


Describing the Agent’s Behaviour
at
−−−−−−−−→
Agent Environment
r t ,st+1
←−−−−−−−−
How does the agent pick at ?
In principle, it can decide by looking at the preceding history
s 0 , a0 , r 0 , s 1 , a1 , r 1 , s 2 , . . . , s t .

For now let us assume that at is picked based on st alone.

5/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 5 / 26


Describing the Agent’s Behaviour
at
−−−−−−−−→
Agent Environment
r t ,st+1
←−−−−−−−−
How does the agent pick at ?
In principle, it can decide by looking at the preceding history
s 0 , a0 , r 0 , s 1 , a1 , r 1 , s 2 , . . . , s t .

For now let us assume that at is picked based on st alone.


In other words, the agent follows a policy π : S → A.

5/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 5 / 26


Describing the Agent’s Behaviour
at
−−−−−−−−→
Agent Environment
r t ,st+1
←−−−−−−−−
How does the agent pick at ?
In principle, it can decide by looking at the preceding history
s 0 , a0 , r 0 , s 1 , a1 , r 1 , s 2 , . . . , s t .

For now let us assume that at is picked based on st alone.


In other words, the agent follows a policy π : S → A.
Observe that π is Markovian, deterministic, and stationary.
5/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 5 / 26


Describing the Agent’s Behaviour
at
−−−−−−−−→
Agent Environment
r t ,st+1
←−−−−−−−−
How does the agent pick at ?
In principle, it can decide by looking at the preceding history
s 0 , a0 , r 0 , s 1 , a1 , r 1 , s 2 , . . . , s t .

For now let us assume that at is picked based on st alone.


In other words, the agent follows a policy π : S → A.
Observe that π is Markovian, deterministic, and stationary.
We will justify this choice in due course! 5/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 5 / 26


Illustration: Policy
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3

0.5, 3

6/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 6 / 26


Illustration: Policy
0.5, 0
0.25, −1
RED
0.5, −1
1, 1 s1 s2
1, 2
RED 0.75, −2
1, 1 0.5, 3

s3 BLUE

0.5, 3

6/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 6 / 26


Illustration: Policy
0.5, 0
0.25, −1
RED
0.5, −1
1, 1 s1 s2
1, 2
RED 0.75, −2
1, 1 0.5, 3

s3 BLUE

0.5, 3

Illustrated policy π such that


π(s1 ) = RED ; π(s2 ) = RED ; π(s3 ) = BLUE .

6/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 6 / 26


Illustration: Policy
0.5, 0
0.25, −1
RED
0.5, −1
1, 1 s1 s2
1, 2
RED 0.75, −2
1, 1 0.5, 3

s3 BLUE

0.5, 3

Illustrated policy π such that


π(s1 ) = RED ; π(s2 ) = RED ; π(s3 ) = BLUE .

What happens by “following” π, starting at s1 ?

6/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 6 / 26


Illustration: Policy
0.5, 0
0.25, −1
RED
0.5, −1
1, 1 s1 s2
1, 2
RED 0.75, −2
1, 1 0.5, 3

s3 BLUE

0.5, 3

Illustrated policy π such that


π(s1 ) = RED ; π(s2 ) = RED ; π(s3 ) = BLUE .

What happens by “following” π, starting at s1 ?


I s1 , RED, s1 , RED, s2 , RED, s3 , BLUE, s1 , . . . .
I s1 , RED, s2 , RED, s1 , RED, s1 , RED, s1 , . . . 6/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 6 / 26


Illustration: Policy
0.5, 0
0.25, −1
RED
0.5, −1
1, 1 s1 s2
1, 2
RED 0.75, −2
1, 1 0.5, 3

s3 BLUE

0.5, 3

Let Π denote the set of all policies.

7/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 7 / 26


Illustration: Policy
0.5, 0
0.25, −1
RED
0.5, −1
1, 1 s1 s2
1, 2
RED 0.75, −2
1, 1 0.5, 3

s3 BLUE

0.5, 3

Let Π denote the set of all policies.


What is |Π|?

7/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 7 / 26


Illustration: Policy
0.5, 0
0.25, −1
RED
0.5, −1
1, 1 s1 s2
1, 2
RED 0.75, −2
1, 1 0.5, 3

s3 BLUE

0.5, 3

Let Π denote the set of all policies.


What is |Π|? k n .

7/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 7 / 26


Illustration: Policy
0.5, 0
0.25, −1
RED
0.5, −1
1, 1 s1 s2
1, 2
RED 0.75, −2
1, 1 0.5, 3

s3 BLUE

0.5, 3

Let Π denote the set of all policies.


What is |Π|? k n .
Which π ∈ Π is a “good” policy?
7/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 7 / 26


State Values for Policy π
 def 
For s ∈ S, V π (s) = Eπ r 0 + r 1 + r2 + r 3 + . . . |s0 = s

8/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 8 / 26


State Values for Policy π
 def 
For s ∈ S, V π (s) = Eπ r 0 + γr 1 + γ 2 r 2 + γ 3 r 3 + . . . |s0 = s
where γ ∈ [0, 1) is a discount factor.

8/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 8 / 26


State Values for Policy π
 def 
For s ∈ S, V π (s) = Eπ r 0 + γr 1 + γ 2 r 2 + γ 3 r 3 + . . . |s0 = s
where γ ∈ [0, 1) is a discount factor.
γ is an element of the MDP. Larger γ, farther “lookahead”.

8/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 8 / 26


State Values for Policy π
 def 
For s ∈ S, V π (s) = Eπ r 0 + γr 1 + γ 2 r 2 + γ 3 r 3 + . . . |s0 = s
where γ ∈ [0, 1) is a discount factor.
γ is an element of the MDP. Larger γ, farther “lookahead”.
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3

0.5, 3

8/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 8 / 26


State Values for Policy π
 def 
For s ∈ S, V π (s) = Eπ r 0 + γr 1 + γ 2 r 2 + γ 3 r 3 + . . . |s0 = s
where γ ∈ [0, 1) is a discount factor.
γ is an element of the MDP. Larger γ, farther “lookahead”.
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3
γ = 0.9
0.5, 3

8/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 8 / 26


State Values for Policy π
 def 
For s ∈ S, V π (s) = Eπ r 0 + γr 1 + γ 2 r 2 + γ 3 r 3 + . . . |s0 = s
where γ ∈ [0, 1) is a discount factor.
γ is an element of the MDP. Larger γ, farther “lookahead”.
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3
γ = 0.9
0.5, 3

V π (s) is the value of state s under policy π.


8/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 8 / 26


State Values for Policy π
 def 
For s ∈ S, V π (s) = Eπ r 0 + γr 1 + γ 2 r 2 + γ 3 r 3 + . . . |s0 = s
where γ ∈ [0, 1) is a discount factor.
γ is an element of the MDP. Larger γ, farther “lookahead”.
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3
γ = 0.9
0.5, 3

V π (s) is the value of state s under policy π.


V π : S → R is the Value Function of π. 8/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 8 / 26


State Values for Policy π
 def 
For s ∈ S, V π (s) = Eπ r 0 + γr 1 + γ 2 r 2 + γ 3 r 3 + . . . |s0 = s
where γ ∈ [0, 1) is a discount factor.
γ is an element of the MDP. Larger γ, farther “lookahead”.
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3
γ = 0.9
0.5, 3

V π (s) is the value of state s under policy π.


V π : S → R is the Value Function of π. “Larger is better”. 8/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 8 / 26


Markov Decision Problems
1. Definitions
I Markov Decision Problem
I Policy
I Value Function

2. MDP planning

3. Alternative formulations

4. Applications

5. Policy Evaluation
9/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 9 / 26


Optimal Policies
Here are value functions from our example MDP.
π V π (s1 ) V π (s2 ) V π (s3 )
RRR 4.45 6.55 10.82
RRB -5.61 -5.75 -4.05
RBR 2.76 4.48 9.12
RBB 2.76 4.48 3.48
BRR 10.0 9.34 13.10
BRB 10.0 7.25 10.0
BBR 10.0 11 .0 14.45
BBB 10.0 11.0 10.0

10/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 10 / 26


Optimal Policies
Here are value functions from our example MDP.
π V π (s1 ) V π (s2 ) V π (s3 )
RRR 4.45 6.55 10.82
RRB -5.61 -5.75 -4.05
RBR 2.76 4.48 9.12
RBB 2.76 4.48 3.48
BRR 10.0 9.34 13.10
BRB 10.0 7.25 10.0
BBR 10.0 11 .0 14.45
BBB 10.0 11.0 10.0
Which policy would you prefer?

10/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 10 / 26


Optimal Policies
Here are value functions from our example MDP.
π V π (s1 ) V π (s2 ) V π (s3 )
RRR 4.45 6.55 10.82
RRB -5.61 -5.75 -4.05
RBR 2.76 4.48 9.12
RBB 2.76 4.48 3.48
BRR 10.0 9.34 13.10
BRB 10.0 7.25 10.0
BBR 10.0 11.0 14.45 ← Optimal policy
BBB 10.0 11.0 10.0
Which policy would you prefer?

10/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 10 / 26


Optimal Policies
Here are value functions from our example MDP.
π V π (s1 ) V π (s2 ) V π (s3 )
RRR 4.45 6.55 10.82
RRB -5.61 -5.75 -4.05
RBR 2.76 4.48 9.12
RBB 2.76 4.48 3.48
BRR 10.0 9.34 13.10
BRB 10.0 7.25 10.0
BBR 10.0 11.0 14.45 ← Optimal policy
BBB 10.0 11.0 10.0
Which policy would you prefer?
Every MDP is guaranteed to have an optimal policy π ? s.t.
?
∀π ∈ Π, ∀s ∈ S : V π (s) ≥ V π (s).
10/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 10 / 26


MDP Planning

MDP Planning problem: Given M = (S, A, T , R, γ),


find a policy π ? from the set of all policies Π such that
?
∀s ∈ S, ∀π ∈ Π: V π (s) ≥ V π (s).

11/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 11 / 26


MDP Planning

MDP Planning problem: Given M = (S, A, T , R, γ),


find a policy π ? from the set of all policies Π such that
?
∀s ∈ S, ∀π ∈ Π: V π (s) ≥ V π (s).

Every MDP is guaranteed to have a deterministic,


Markovian, stationary optimal policy.

11/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 11 / 26


MDP Planning

MDP Planning problem: Given M = (S, A, T , R, γ),


find a policy π ? from the set of all policies Π such that
?
∀s ∈ S, ∀π ∈ Π: V π (s) ≥ V π (s).

Every MDP is guaranteed to have a deterministic,


Markovian, stationary optimal policy.

An MDP can have more than one optimal policy.

11/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 11 / 26


MDP Planning

MDP Planning problem: Given M = (S, A, T , R, γ),


find a policy π ? from the set of all policies Π such that
?
∀s ∈ S, ∀π ∈ Π: V π (s) ≥ V π (s).

Every MDP is guaranteed to have a deterministic,


Markovian, stationary optimal policy.

An MDP can have more than one optimal policy.

However, the value function of every optimal policy is the


same, unique “optimal value function” V ? .
11/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 11 / 26


Markov Decision Problems
1. Definitions
I Markov Decision Problem
I Policy
I Value Function

2. MDP planning

3. Alternative formulations

4. Applications

5. Policy Evaluation
12/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 12 / 26


Reward and Transition Functions
We had assumed
T : S × A × S → [0, 1], R : S × A × S → [−Rmax , Rmax ].

13/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 13 / 26


Reward and Transition Functions
We had assumed
T : S × A × S → [0, 1], R : S × A × S → [−Rmax , Rmax ].

You might encounter alternative definitions of R, T .

13/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 13 / 26


Reward and Transition Functions
We had assumed
T : S × A × S → [0, 1], R : S × A × S → [−Rmax , Rmax ].

You might encounter alternative definitions of R, T .


Sometimes R(s, a, s0 ) is taken as a random variable
bounded in [−Rmax , Rmax ].

13/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 13 / 26


Reward and Transition Functions
We had assumed
T : S × A × S → [0, 1], R : S × A × S → [−Rmax , Rmax ].

You might encounter alternative definitions of R, T .


Sometimes R(s, a, s0 ) is taken as a random variable
bounded in [−Rmax , Rmax ].
Sometimes there is a reward R(s, a) given on taking action
a from state s, regardless of next state s0 .

13/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 13 / 26


Reward and Transition Functions
We had assumed
T : S × A × S → [0, 1], R : S × A × S → [−Rmax , Rmax ].

You might encounter alternative definitions of R, T .


Sometimes R(s, a, s0 ) is taken as a random variable
bounded in [−Rmax , Rmax ].
Sometimes there is a reward R(s, a) given on taking action
a from state s, regardless of next state s0 .
Sometimes there is a reward R(s0 ) given on reaching next
state s0 , regardless of start state s and action a.

13/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 13 / 26


Reward and Transition Functions
We had assumed
T : S × A × S → [0, 1], R : S × A × S → [−Rmax , Rmax ].

You might encounter alternative definitions of R, T .


Sometimes R(s, a, s0 ) is taken as a random variable
bounded in [−Rmax , Rmax ].
Sometimes there is a reward R(s, a) given on taking action
a from state s, regardless of next state s0 .
Sometimes there is a reward R(s0 ) given on reaching next
state s0 , regardless of start state s and action a.
Sometimes T and R are combined into a single function
P{s0 , r |s, a} for s0 ∈ S, r ∈ [−Rmax , Rmax ].

13/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 13 / 26


Reward and Transition Functions
We had assumed
T : S × A × S → [0, 1], R : S × A × S → [−Rmax , Rmax ].

You might encounter alternative definitions of R, T .


Sometimes R(s, a, s0 ) is taken as a random variable
bounded in [−Rmax , Rmax ].
Sometimes there is a reward R(s, a) given on taking action
a from state s, regardless of next state s0 .
Sometimes there is a reward R(s0 ) given on reaching next
state s0 , regardless of start state s and action a.
Sometimes T and R are combined into a single function
P{s0 , r |s, a} for s0 ∈ S, r ∈ [−Rmax , Rmax ].
Some authors minimise cost rather than maximise reward.

13/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 13 / 26


Reward and Transition Functions
We had assumed
T : S × A × S → [0, 1], R : S × A × S → [−Rmax , Rmax ].

You might encounter alternative definitions of R, T .


Sometimes R(s, a, s0 ) is taken as a random variable
bounded in [−Rmax , Rmax ].
Sometimes there is a reward R(s, a) given on taking action
a from state s, regardless of next state s0 .
Sometimes there is a reward R(s0 ) given on reaching next
state s0 , regardless of start state s and action a.
Sometimes T and R are combined into a single function
P{s0 , r |s, a} for s0 ∈ S, r ∈ [−Rmax , Rmax ].
Some authors minimise cost rather than maximise reward.

It is relatively straightforward to handle all these variations. 13/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 13 / 26


Episodic Tasks
We considered continuing tasks, in which trajectories are
infinitely long.

14/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 14 / 26


Episodic Tasks
We considered continuing tasks, in which trajectories are
infinitely long.
Episodic tasks have a special sink/terminal state s> from
which there are no outgoing transitions on rewards.

14/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 14 / 26


Episodic Tasks
We considered continuing tasks, in which trajectories are
infinitely long.
Episodic tasks have a special sink/terminal state s> from
which there are no outgoing transitions on rewards.
0.5, 0 0.25, −1 0.5, −3

0.5, −1 0.25, 2
s1 s2 s
1, 2 0.4, 2
0.6, 1

14/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 14 / 26


Episodic Tasks
We considered continuing tasks, in which trajectories are
infinitely long.
Episodic tasks have a special sink/terminal state s> from
which there are no outgoing transitions on rewards.
0.5, 0 0.25, −1 0.5, −3

0.5, −1 0.25, 2
s1 s2 s
1, 2 0.4, 2
0.6, 1

Additionally, from every non-terminal state and for every


policy, there is a non-zero probability of reaching the
terminal state in a finite number of steps.

14/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 14 / 26


Episodic Tasks
We considered continuing tasks, in which trajectories are
infinitely long.
Episodic tasks have a special sink/terminal state s> from
which there are no outgoing transitions on rewards.
0.5, 0 0.25, −1 0.5, −3

0.5, −1 0.25, 2
s1 s2 s
1, 2 0.4, 2
0.6, 1

Additionally, from every non-terminal state and for every


policy, there is a non-zero probability of reaching the
terminal state in a finite number of steps.
Hence, trajectories or episodes almost surely terminate
after a finite number of steps. 14/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 14 / 26


Definition of Values
We defined V π (s) as an Infinite discounted reward:
def
V π (s) = Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s].

15/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 15 / 26


Definition of Values
We defined V π (s) as an Infinite discounted reward:
def
V π (s) = Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s].
There are other choices.
Total reward:
def
V π (s) = Eπ [r 0 + r 1 + r 2 + . . . |s0 = s].
Can only be used on episodic tasks.

15/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 15 / 26


Definition of Values
We defined V π (s) as an Infinite discounted reward:
def
V π (s) = Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s].
There are other choices.
Total reward:
def
V π (s) = Eπ [r 0 + r 1 + r 2 + . . . |s0 = s].
Can only be used on episodic tasks.
Finite horizon reward:
def
V π (s) = Eπ [r 0 + r 1 + r 2 + · · · + r T −1 |s0 = s].
Horizon T ≥ 1 specified, rather than γ.
Optimal policies for this setting need not be stationary.

15/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 15 / 26


Definition of Values
We defined V π (s) as an Infinite discounted reward:
def
V π (s) = Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s].
There are other choices.
Total reward:
def
V π (s) = Eπ [r 0 + r 1 + r 2 + . . . |s0 = s].
Can only be used on episodic tasks.
Finite horizon reward:
def
V π (s) = Eπ [r 0 + r 1 + r 2 + · · · + r T −1 |s0 = s].
Horizon T ≥ 1 specified, rather than γ.
Optimal policies for this setting need not be stationary.
Average reward (withholding some technical details):
def r 0 +r 1 +···+r m−1 0
V π (s) = Eπ [limm→∞ m
|s = s].
15/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 15 / 26


Markov Decision Problems
1. Definitions
I Markov Decision Problem
I Policy
I Value Function

2. MDP planning

3. Alternative formulations

4. Applications

5. Policy Evaluation
16/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 16 / 26


Controlling a Helicopter (Ng et al., 2003)
Episodic or continuing task? What are S, A, T , R, γ?

[1]

1. https://www.publicdomainpictures.net/pictures/20000/velka/
police-helicopter-8712919948643Mk.jpg. 17/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 17 / 26


Succeeding at Chess
Episodic or continuing task? What are S, A, T , R, γ?

[1]

1. https://www.publicdomainpictures.net/pictures/80000/velka/chess-board-and-pieces.jpg. 18/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 18 / 26


Preventing Forest Fires (Lauer et al., 2017)
Episodic or continuing task? What are S, A, T , R, γ?

[1]

1. https://www.publicdomainpictures.net/pictures/270000/velka/firemen-1533752293Zsu.jpg. 19/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 19 / 26


A Familiar MDP?
Single state. k actions.
For a ∈ A, treat R(s, a, s0 ) as a random variable.

1, U (2, 4)
1, U (−5, 5) 1, U (−1, 3)

. s1
.
γ = 0.5 . 1, U (0, 1)

Annotation: "probability, reward distribution".

20/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 20 / 26


A Familiar MDP?
Single state. k actions.
For a ∈ A, treat R(s, a, s0 ) as a random variable.

1, U (2, 4)
1, U (−5, 5) 1, U (−1, 3)

. s1
.
γ = 0.5 . 1, U (0, 1)

Annotation: "probability, reward distribution".

Such an MDP is called a 20/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 20 / 26


A Familiar MDP?
Single state. k actions.
For a ∈ A, treat R(s, a, s0 ) as a random variable.

1, U (2, 4)
1, U (−5, 5) 1, U (−1, 3)

. s1
.
γ = 0.5 . 1, U (0, 1)

Annotation: "probability, reward distribution".

Such an MDP is called a multi-armed bandit! 20/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 20 / 26


Markov Decision Problems
1. Definitions
I Markov Decision Problem
I Policy
I Value Function

2. MDP planning

3. Alternative formulations

4. Applications

5. Policy Evaluation
21/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 21 / 26


Structure of State Values
Let us investigate state values. For π ∈ Π, s ∈ S:
def
V π (s) = Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s]

22/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 22 / 26


Structure of State Values
Let us investigate state values. For π ∈ Π, s ∈ S:
def
V π (s) = Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s]
X
= T (s, π(s), s0 )Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s, s1 = s0 ]
s0 ∈S

22/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 22 / 26


Structure of State Values
Let us investigate state values. For π ∈ Π, s ∈ S:
def
V π (s) = Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s]
X
= T (s, π(s), s0 )Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s, s1 = s0 ]
s0 ∈S
X
= T (s, π(s), s0 )Eπ [r 0 |s0 = s, s1 = s0 ]
s0 ∈S
X
+γ T (s, π(s), s0 )Eπ [r 1 + γr 2 + . . . |s0 = s, s1 = s0 ]
s0 ∈S

22/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 22 / 26


Structure of State Values
Let us investigate state values. For π ∈ Π, s ∈ S:
def
V π (s) = Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s]
X
= T (s, π(s), s0 )Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s, s1 = s0 ]
s0 ∈S
X
= T (s, π(s), s0 )Eπ [r 0 |s0 = s, s1 = s0 ]
s0 ∈S
X
+γ T (s, π(s), s0 )Eπ [r 1 + γr 2 + . . . |s0 = s, s1 = s0 ]
s0 ∈S
X
= T (s, π(s), s0 )R(s, π(s), s0 )
s0 ∈S
X
+γ T (s, π(s), s0 )Eπ [r 1 + γr 2 + . . . |s1 = s0 ]
s0 ∈S

22/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 22 / 26


Structure of State Values
Let us investigate state values. For π ∈ Π, s ∈ S:
def
V π (s) = Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s]
X
= T (s, π(s), s0 )Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s, s1 = s0 ]
s0 ∈S
X
= T (s, π(s), s0 )Eπ [r 0 |s0 = s, s1 = s0 ]
s0 ∈S
X
+γ T (s, π(s), s0 )Eπ [r 1 + γr 2 + . . . |s0 = s, s1 = s0 ]
s0 ∈S
X
= T (s, π(s), s0 )R(s, π(s), s0 )
s0 ∈S
X
+γ T (s, π(s), s0 )Eπ [r 1 + γr 2 + . . . |s1 = s0 ]
s0 ∈S
X
= T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
22/26
s0 ∈S
Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 22 / 26
Bellman’s Equations
For π ∈ Π, s ∈ S:

X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S

23/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 23 / 26


Bellman’s Equations
For π ∈ Π, s ∈ S:

X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S

Recall that S = {s1 , s2 , . . . , sn }.

23/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 23 / 26


Bellman’s Equations
For π ∈ Π, s ∈ S:

X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S

Recall that S = {s1 , s2 , . . . , sn }.


n equations, n unknowns—V π (s1 ), V π (s2 ), . . . V π (s2 ).

23/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 23 / 26


Bellman’s Equations
For π ∈ Π, s ∈ S:

X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S

Recall that S = {s1 , s2 , . . . , sn }.


n equations, n unknowns—V π (s1 ), V π (s2 ), . . . V π (s2 ).
Linear!

23/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 23 / 26


Bellman’s Equations
For π ∈ Π, s ∈ S:

X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S

Recall that S = {s1 , s2 , . . . , sn }.


n equations, n unknowns—V π (s1 ), V π (s2 ), . . . V π (s2 ).
Linear!
Guaranteed to have a unique solution if γ < 1.

23/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 23 / 26


Bellman’s Equations
For π ∈ Π, s ∈ S:

X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S

Recall that S = {s1 , s2 , . . . , sn }.


n equations, n unknowns—V π (s1 ), V π (s2 ), . . . V π (s2 ).
Linear!
Guaranteed to have a unique solution if γ < 1.
If task is episodic, guaranteed to have a unique solution
even if γ = 1, after we fix V π (s> ) = 0.

23/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 23 / 26


Bellman’s Equations
For π ∈ Π, s ∈ S:

X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S

Recall that S = {s1 , s2 , . . . , sn }.


n equations, n unknowns—V π (s1 ), V π (s2 ), . . . V π (s2 ).
Linear!
Guaranteed to have a unique solution if γ < 1.
If task is episodic, guaranteed to have a unique solution
even if γ = 1, after we fix V π (s> ) = 0.
Policy evaluation: computing V π for a given policy π.
23/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 23 / 26


Are We Done with this Topic?

24/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 24 / 26


Are We Done with this Topic?

We claimed that among all the policies for a given MDP,


there must be an optimal policy π ? .

24/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 24 / 26


Are We Done with this Topic?

We claimed that among all the policies for a given MDP,


there must be an optimal policy π ? .
Now you know how to compute the value function of any
given policy π.

24/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 24 / 26


Are We Done with this Topic?

We claimed that among all the policies for a given MDP,


there must be an optimal policy π ? .
Now you know how to compute the value function of any
given policy π.
Can you put the two ideas together and construct an
algorithm to find π ? ?

24/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 24 / 26


Are We Done with this Topic?

We claimed that among all the policies for a given MDP,


there must be an optimal policy π ? .
Now you know how to compute the value function of any
given policy π.
Can you put the two ideas together and construct an
algorithm to find π ? ?
Yes! Evaluate each policy and identify one that has a value
function dominating all the others’.

24/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 24 / 26


Are We Done with this Topic?

We claimed that among all the policies for a given MDP,


there must be an optimal policy π ? .
Now you know how to compute the value function of any
given policy π.
Can you put the two ideas together and construct an
algorithm to find π ? ?
Yes! Evaluate each policy and identify one that has a value
function dominating all the others’.
This approach needs poly(n, k ) · k n arithmetic operations.
We hope to be more efficient (wait for next week).

24/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 24 / 26


Action Value Function
For π ∈ Π, s ∈ S, a ∈ A:
def
Q (s, a) = E[r 0 +γr 1 +γ 2 r 2 +. . . |s0 = s; a0 = a; at = π(st ) for t ≥ 1].
π

25/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 25 / 26


Action Value Function
For π ∈ Π, s ∈ S, a ∈ A:
def
Q (s, a) = E[r 0 +γr 1 +γ 2 r 2 +. . . |s0 = s; a0 = a; at = π(st ) for t ≥ 1].
π

Q π (s, a) is the expected long-term reward from starting at s,


taking a at t = 0, and following π for t ≥ 1.

25/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 25 / 26


Action Value Function
For π ∈ Π, s ∈ S, a ∈ A:
def
Q (s, a) = E[r 0 +γr 1 +γ 2 r 2 +. . . |s0 = s; a0 = a; at = π(st ) for t ≥ 1].
π

Q π (s, a) is the expected long-term reward from starting at s,


taking a at t = 0, and following π for t ≥ 1.
Q π : S × A → R is called the action value function of π.

25/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 25 / 26


Action Value Function
For π ∈ Π, s ∈ S, a ∈ A:
def
Q (s, a) = E[r 0 +γr 1 +γ 2 r 2 +. . . |s0 = s; a0 = a; at = π(st ) for t ≥ 1].
π

Q π (s, a) is the expected long-term reward from starting at s,


taking a at t = 0, and following π for t ≥ 1.
Q π : S × A → R is called the action value function of π.
Observe that Q π satisfies, for s ∈ S, a ∈ A:
X
Q π (s, a) = T (s, a, s0 ){R(s, a, s0 ) + γV π (s0 )}.
s0 ∈S

25/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 25 / 26


Action Value Function
For π ∈ Π, s ∈ S, a ∈ A:
def
Q (s, a) = E[r 0 +γr 1 +γ 2 r 2 +. . . |s0 = s; a0 = a; at = π(st ) for t ≥ 1].
π

Q π (s, a) is the expected long-term reward from starting at s,


taking a at t = 0, and following π for t ≥ 1.
Q π : S × A → R is called the action value function of π.
Observe that Q π satisfies, for s ∈ S, a ∈ A:
X
Q π (s, a) = T (s, a, s0 ){R(s, a, s0 ) + γV π (s0 )}.
s0 ∈S

For π ∈ Π, s ∈ S: Q π (s, π(s)) = V π (s).

25/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 25 / 26


Action Value Function
For π ∈ Π, s ∈ S, a ∈ A:
def
Q (s, a) = E[r 0 +γr 1 +γ 2 r 2 +. . . |s0 = s; a0 = a; at = π(st ) for t ≥ 1].
π

Q π (s, a) is the expected long-term reward from starting at s,


taking a at t = 0, and following π for t ≥ 1.
Q π : S × A → R is called the action value function of π.
Observe that Q π satisfies, for s ∈ S, a ∈ A:
X
Q π (s, a) = T (s, a, s0 ){R(s, a, s0 ) + γV π (s0 )}.
s0 ∈S

For π ∈ Π, s ∈ S: Q π (s, π(s)) = V π (s).


Q π needs O(n2 k ) operations to compute if V π is available.

25/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 25 / 26


Action Value Function
For π ∈ Π, s ∈ S, a ∈ A:
def
Q (s, a) = E[r 0 +γr 1 +γ 2 r 2 +. . . |s0 = s; a0 = a; at = π(st ) for t ≥ 1].
π

Q π (s, a) is the expected long-term reward from starting at s,


taking a at t = 0, and following π for t ≥ 1.
Q π : S × A → R is called the action value function of π.
Observe that Q π satisfies, for s ∈ S, a ∈ A:
X
Q π (s, a) = T (s, a, s0 ){R(s, a, s0 ) + γV π (s0 )}.
s0 ∈S

For π ∈ Π, s ∈ S: Q π (s, π(s)) = V π (s).


Q π needs O(n2 k ) operations to compute if V π is available.
All optimal policies have the same action value function Q ? .
25/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 25 / 26


Action Value Function
For π ∈ Π, s ∈ S, a ∈ A:
def
Q (s, a) = E[r 0 +γr 1 +γ 2 r 2 +. . . |s0 = s; a0 = a; at = π(st ) for t ≥ 1].
π

Q π (s, a) is the expected long-term reward from starting at s,


taking a at t = 0, and following π for t ≥ 1.
Q π : S × A → R is called the action value function of π.
Observe that Q π satisfies, for s ∈ S, a ∈ A:
X
Q π (s, a) = T (s, a, s0 ){R(s, a, s0 ) + γV π (s0 )}.
s0 ∈S

For π ∈ Π, s ∈ S: Q π (s, π(s)) = V π (s).


Q π needs O(n2 k ) operations to compute if V π is available.
All optimal policies have the same action value function Q ? .
We will find use for Q π and Q ? next week. 25/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 25 / 26


Markov Decision Problems
1. Definitions
I Markov Decision Problem
I Policy
I Value Function

2. MDP planning

3. Alternative formulations

4. Applications

5. Policy Evaluation
26/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 26 / 26

You might also like