0% found this document useful (0 votes)

77 views103 pages

CS 747, Autumn 2020: Week 4, Lecture 1: Shivaram Kalyanakrishnan

This document summarizes a lecture on Markov decision processes (MDPs). It defines the key components of an MDP, including states, actions, transition probabilities, rewards, and a discount factor. It also provides examples to illustrate an MDP and the interaction between an agent and its environment in an MDP framework.

Uploaded by

Grafins Care

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views103 pages

CS 747, Autumn 2020: Week 4, Lecture 1: Shivaram Kalyanakrishnan

Uploaded by

Grafins Care

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 103

CS 747, Autumn 2020: Week 4, Lecture 1

Shivaram Kalyanakrishnan

Department of Computer Science and Engineering

Indian Institute of Technology Bombay

Autumn 2020

1/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 1 / 26

Markov Decision Problems
1. Definitions
I Markov Decision Problem
I Policy
I Value Function

2. MDP planning

3. Alternative formulations

4. Applications

5. Policy Evaluation
2/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 2 / 26

Markov Decision Problems
1. Definitions
I Markov Decision Problem
I Policy
I Value Function

2. MDP planning

3. Alternative formulations

4. Applications

5. Policy Evaluation
2/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 2 / 26

Markov Decision Problems (MDPs)

0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

0.5, 3

3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26

Markov Decision Problems (MDPs)
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

0.5, 3

An MDP M = (S, A, T , R, γ) has these elements.

S: a set of states.

3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26

Markov Decision Problems (MDPs)
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

0.5, 3

An MDP M = (S, A, T , R, γ) has these elements.

S: a set of states.
Let us assume S = {s1 , s2 , . . . , sn }, and hence |S| = n.
3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26

Markov Decision Problems (MDPs)
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

0.5, 3

An MDP M = (S, A, T , R, γ) has these elements.

A: a set of actions.

3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26

Markov Decision Problems (MDPs)
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

0.5, 3

An MDP M = (S, A, T , R, γ) has these elements.

A: a set of actions.
Let us assume A = {a1 , a2 , . . . , ak }, and hence |A| = k .
Here A = {RED, BLUE}. 3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26

Markov Decision Problems (MDPs)
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

0.5, 3

An MDP M = (S, A, T , R, γ) has these elements.

T : a transition function.

3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26

Markov Decision Problems (MDPs)
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

0.5, 3

An MDP M = (S, A, T , R, γ) has these elements.

T : a transition function.
- For s, s0 ∈ S, a ∈ A: T (s, a, s0 ) is the probability of reaching
s0 by starting at s and taking action a.
- Thus, T (s, a, ·) is a probability distribution over S. 3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26

Markov Decision Problems (MDPs)
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

0.5, 3

An MDP M = (S, A, T , R, γ) has these elements.

R: a reward function.

3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26

Markov Decision Problems (MDPs)
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

0.5, 3

An MDP M = (S, A, T , R, γ) has these elements.

R: a reward function.
- For s, s0 ∈ S, a ∈ A: R(s, a, s0 ) is the (numeric) reward for
reaching s0 by starting at s and taking action a.
- Assume rewards are from [−Rmax , Rmax ] for some Rmax ≥ 0. 3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26

Markov Decision Problems (MDPs)

0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

0.5, 3

An MDP M = (S, A, T , R, γ) has these elements.

γ, a discount factor—coming up shortly.
3/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 3 / 26

Agent-Environment Interaction
Agent is born in some state s0 , takes action a0 .
Environment generates and provides the agent
t =0
next state s1 ∼ T (s0 , a0 , ·) and
reward r 0 = R(s0 , a0 , s1 ).

4/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 4 / 26

Agent is in state s1 , takes action a1 .

Environment generates and provides the agent
t =1
next state s2 ∼ T (s1 , a1 , ·) and
reward r 1 = R(s1 , a1 , s2 ).

4/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 4 / 26

Agent is in state s1 , takes action a1 .

Environment generates and provides the agent
t =1
next state s2 ∼ T (s1 , a1 , ·) and
reward r 1 = R(s1 , a1 , s2 ).

..
.

4/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 4 / 26

Agent is in state s1 , takes action a1 .

Environment generates and provides the agent
t =1
next state s2 ∼ T (s1 , a1 , ·) and
reward r 1 = R(s1 , a1 , s2 ).

..
.

Resulting trajectory: s0 , a0 , r 0 , s1 , a1 , r 1 , s2 , . . . .
4/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 4 / 26

Describing the Agent’s Behaviour
at
−−−−−−−−→
Agent Environment
r t ,st+1
←−−−−−−−−

5/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 5 / 26

Describing the Agent’s Behaviour
at
−−−−−−−−→
Agent Environment
r t ,st+1
←−−−−−−−−
How does the agent pick at ?

5/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 5 / 26

Describing the Agent’s Behaviour
at
−−−−−−−−→
Agent Environment
r t ,st+1
←−−−−−−−−
How does the agent pick at ?
In principle, it can decide by looking at the preceding history
s 0 , a0 , r 0 , s 1 , a1 , r 1 , s 2 , . . . , s t .

5/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 5 / 26

For now let us assume that at is picked based on st alone.

5/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 5 / 26

For now let us assume that at is picked based on st alone.

In other words, the agent follows a policy π : S → A.

5/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 5 / 26

For now let us assume that at is picked based on st alone.

In other words, the agent follows a policy π : S → A.
Observe that π is Markovian, deterministic, and stationary.
5/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 5 / 26

For now let us assume that at is picked based on st alone.

In other words, the agent follows a policy π : S → A.
Observe that π is Markovian, deterministic, and stationary.
We will justify this choice in due course! 5/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 5 / 26

Illustration: Policy
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

0.5, 3

6/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 6 / 26

Illustration: Policy
0.5, 0
0.25, −1
RED
0.5, −1
1, 1 s1 s2
1, 2
RED 0.75, −2
1, 1 0.5, 3

s3 BLUE

0.5, 3

6/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 6 / 26

Illustration: Policy
0.5, 0
0.25, −1
RED
0.5, −1
1, 1 s1 s2
1, 2
RED 0.75, −2
1, 1 0.5, 3

s3 BLUE

0.5, 3

Illustrated policy π such that

π(s1 ) = RED ; π(s2 ) = RED ; π(s3 ) = BLUE .

6/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 6 / 26

Illustration: Policy
0.5, 0
0.25, −1
RED
0.5, −1
1, 1 s1 s2
1, 2
RED 0.75, −2
1, 1 0.5, 3

s3 BLUE

0.5, 3

Illustrated policy π such that

π(s1 ) = RED ; π(s2 ) = RED ; π(s3 ) = BLUE .

What happens by “following” π, starting at s1 ?

6/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 6 / 26

Illustration: Policy
0.5, 0
0.25, −1
RED
0.5, −1
1, 1 s1 s2
1, 2
RED 0.75, −2
1, 1 0.5, 3

s3 BLUE

0.5, 3

Illustrated policy π such that

π(s1 ) = RED ; π(s2 ) = RED ; π(s3 ) = BLUE .

What happens by “following” π, starting at s1 ?

I s1 , RED, s1 , RED, s2 , RED, s3 , BLUE, s1 , . . . .
I s1 , RED, s2 , RED, s1 , RED, s1 , RED, s1 , . . . 6/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 6 / 26

Illustration: Policy
0.5, 0
0.25, −1
RED
0.5, −1
1, 1 s1 s2
1, 2
RED 0.75, −2
1, 1 0.5, 3

s3 BLUE

0.5, 3

Let Π denote the set of all policies.

7/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 7 / 26

Illustration: Policy
0.5, 0
0.25, −1
RED
0.5, −1
1, 1 s1 s2
1, 2
RED 0.75, −2
1, 1 0.5, 3

s3 BLUE

0.5, 3

Let Π denote the set of all policies.

What is |Π|?

7/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 7 / 26

Illustration: Policy
0.5, 0
0.25, −1
RED
0.5, −1
1, 1 s1 s2
1, 2
RED 0.75, −2
1, 1 0.5, 3

s3 BLUE

0.5, 3

Let Π denote the set of all policies.

What is |Π|? k n .

7/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 7 / 26

Illustration: Policy
0.5, 0
0.25, −1
RED
0.5, −1
1, 1 s1 s2
1, 2
RED 0.75, −2
1, 1 0.5, 3

s3 BLUE

0.5, 3

Let Π denote the set of all policies.

What is |Π|? k n .
Which π ∈ Π is a “good” policy?
7/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 7 / 26

State Values for Policy π
def
For s ∈ S, V π (s) = Eπ r 0 + r 1 + r2 + r 3 + . . . |s0 = s

8/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 8 / 26

State Values for Policy π
def
For s ∈ S, V π (s) = Eπ r 0 + γr 1 + γ 2 r 2 + γ 3 r 3 + . . . |s0 = s
where γ ∈ [0, 1) is a discount factor.

8/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 8 / 26

State Values for Policy π
def
For s ∈ S, V π (s) = Eπ r 0 + γr 1 + γ 2 r 2 + γ 3 r 3 + . . . |s0 = s
where γ ∈ [0, 1) is a discount factor.
γ is an element of the MDP. Larger γ, farther “lookahead”.

8/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 8 / 26

State Values for Policy π
def
For s ∈ S, V π (s) = Eπ r 0 + γr 1 + γ 2 r 2 + γ 3 r 3 + . . . |s0 = s
where γ ∈ [0, 1) is a discount factor.
γ is an element of the MDP. Larger γ, farther “lookahead”.
0.5, 0
0.25, −1

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

0.5, 3

8/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 8 / 26

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3
γ = 0.9
0.5, 3

8/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 8 / 26

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3
γ = 0.9
0.5, 3

V π (s) is the value of state s under policy π.

8/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 8 / 26

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3
γ = 0.9
0.5, 3

V π (s) is the value of state s under policy π.

V π : S → R is the Value Function of π. 8/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 8 / 26

0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3

s3
γ = 0.9
0.5, 3

V π (s) is the value of state s under policy π.

V π : S → R is the Value Function of π. “Larger is better”. 8/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 8 / 26

Markov Decision Problems
1. Definitions
I Markov Decision Problem
I Policy
I Value Function

2. MDP planning

3. Alternative formulations

4. Applications

5. Policy Evaluation
9/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 9 / 26

10/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 10 / 26

10/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 10 / 26

Optimal Policies
Here are value functions from our example MDP.
π V π (s1 ) V π (s2 ) V π (s3 )
RRR 4.45 6.55 10.82
RRB -5.61 -5.75 -4.05
RBR 2.76 4.48 9.12
RBB 2.76 4.48 3.48
BRR 10.0 9.34 13.10
BRB 10.0 7.25 10.0
BBR 10.0 11.0 14.45 ← Optimal policy
BBB 10.0 11.0 10.0
Which policy would you prefer?

10/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 10 / 26

MDP Planning

MDP Planning problem: Given M = (S, A, T , R, γ),

find a policy π ? from the set of all policies Π such that
?
∀s ∈ S, ∀π ∈ Π: V π (s) ≥ V π (s).

11/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 11 / 26

MDP Planning

MDP Planning problem: Given M = (S, A, T , R, γ),

find a policy π ? from the set of all policies Π such that
?
∀s ∈ S, ∀π ∈ Π: V π (s) ≥ V π (s).

Every MDP is guaranteed to have a deterministic,

Markovian, stationary optimal policy.

11/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 11 / 26

MDP Planning

MDP Planning problem: Given M = (S, A, T , R, γ),

find a policy π ? from the set of all policies Π such that
?
∀s ∈ S, ∀π ∈ Π: V π (s) ≥ V π (s).

Every MDP is guaranteed to have a deterministic,

Markovian, stationary optimal policy.

An MDP can have more than one optimal policy.

11/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 11 / 26

MDP Planning

MDP Planning problem: Given M = (S, A, T , R, γ),

find a policy π ? from the set of all policies Π such that
?
∀s ∈ S, ∀π ∈ Π: V π (s) ≥ V π (s).

Every MDP is guaranteed to have a deterministic,

Markovian, stationary optimal policy.

An MDP can have more than one optimal policy.

However, the value function of every optimal policy is the

same, unique “optimal value function” V ? .
11/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 11 / 26

Markov Decision Problems
1. Definitions
I Markov Decision Problem
I Policy
I Value Function

2. MDP planning

3. Alternative formulations

4. Applications

5. Policy Evaluation
12/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 12 / 26

Reward and Transition Functions
We had assumed
T : S × A × S → [0, 1], R : S × A × S → [−Rmax , Rmax ].

13/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 13 / 26

Reward and Transition Functions
We had assumed
T : S × A × S → [0, 1], R : S × A × S → [−Rmax , Rmax ].

You might encounter alternative definitions of R, T .

13/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 13 / 26

Reward and Transition Functions
We had assumed
T : S × A × S → [0, 1], R : S × A × S → [−Rmax , Rmax ].

You might encounter alternative definitions of R, T .

Sometimes R(s, a, s0 ) is taken as a random variable
bounded in [−Rmax , Rmax ].

13/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 13 / 26

Reward and Transition Functions
We had assumed
T : S × A × S → [0, 1], R : S × A × S → [−Rmax , Rmax ].

You might encounter alternative definitions of R, T .

Sometimes R(s, a, s0 ) is taken as a random variable
bounded in [−Rmax , Rmax ].
Sometimes there is a reward R(s, a) given on taking action
a from state s, regardless of next state s0 .

13/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 13 / 26

Reward and Transition Functions
We had assumed
T : S × A × S → [0, 1], R : S × A × S → [−Rmax , Rmax ].

You might encounter alternative definitions of R, T .

13/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 13 / 26

Reward and Transition Functions
We had assumed
T : S × A × S → [0, 1], R : S × A × S → [−Rmax , Rmax ].

You might encounter alternative definitions of R, T .

Sometimes R(s, a, s0 ) is taken as a random variable
bounded in [−Rmax , Rmax ].
Sometimes there is a reward R(s, a) given on taking action
a from state s, regardless of next state s0 .
Sometimes there is a reward R(s0 ) given on reaching next
state s0 , regardless of start state s and action a.
Sometimes T and R are combined into a single function
P{s0 , r |s, a} for s0 ∈ S, r ∈ [−Rmax , Rmax ].

13/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 13 / 26

Reward and Transition Functions
We had assumed
T : S × A × S → [0, 1], R : S × A × S → [−Rmax , Rmax ].

You might encounter alternative definitions of R, T .

13/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 13 / 26

Reward and Transition Functions
We had assumed
T : S × A × S → [0, 1], R : S × A × S → [−Rmax , Rmax ].

You might encounter alternative definitions of R, T .

It is relatively straightforward to handle all these variations. 13/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 13 / 26

Episodic Tasks
We considered continuing tasks, in which trajectories are
infinitely long.

14/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 14 / 26

Episodic Tasks
We considered continuing tasks, in which trajectories are
infinitely long.
Episodic tasks have a special sink/terminal state s> from
which there are no outgoing transitions on rewards.

14/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 14 / 26

Episodic Tasks
We considered continuing tasks, in which trajectories are
infinitely long.
Episodic tasks have a special sink/terminal state s> from
which there are no outgoing transitions on rewards.
0.5, 0 0.25, −1 0.5, −3

0.5, −1 0.25, 2
s1 s2 s
1, 2 0.4, 2
0.6, 1

14/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 14 / 26

0.5, −1 0.25, 2
s1 s2 s
1, 2 0.4, 2
0.6, 1

Additionally, from every non-terminal state and for every

policy, there is a non-zero probability of reaching the
terminal state in a finite number of steps.

14/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 14 / 26

0.5, −1 0.25, 2
s1 s2 s
1, 2 0.4, 2
0.6, 1

Additionally, from every non-terminal state and for every

policy, there is a non-zero probability of reaching the
terminal state in a finite number of steps.
Hence, trajectories or episodes almost surely terminate
after a finite number of steps. 14/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 14 / 26

Definition of Values
We defined V π (s) as an Infinite discounted reward:
def
V π (s) = Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s].

15/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 15 / 26

15/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 15 / 26

Definition of Values
We defined V π (s) as an Infinite discounted reward:
def
V π (s) = Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s].
There are other choices.
Total reward:
def
V π (s) = Eπ [r 0 + r 1 + r 2 + . . . |s0 = s].
Can only be used on episodic tasks.
Finite horizon reward:
def
V π (s) = Eπ [r 0 + r 1 + r 2 + · · · + r T −1 |s0 = s].
Horizon T ≥ 1 specified, rather than γ.
Optimal policies for this setting need not be stationary.

15/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 15 / 26

Markov Decision Problems
1. Definitions
I Markov Decision Problem
I Policy
I Value Function

2. MDP planning

3. Alternative formulations

4. Applications

5. Policy Evaluation
16/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 16 / 26

Controlling a Helicopter (Ng et al., 2003)
Episodic or continuing task? What are S, A, T , R, γ?

[1]

1. https://www.publicdomainpictures.net/pictures/20000/velka/
police-helicopter-8712919948643Mk.jpg. 17/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 17 / 26

Succeeding at Chess
Episodic or continuing task? What are S, A, T , R, γ?

[1]

1. https://www.publicdomainpictures.net/pictures/80000/velka/chess-board-and-pieces.jpg. 18/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 18 / 26

Preventing Forest Fires (Lauer et al., 2017)
Episodic or continuing task? What are S, A, T , R, γ?

[1]

1. https://www.publicdomainpictures.net/pictures/270000/velka/firemen-1533752293Zsu.jpg. 19/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 19 / 26

A Familiar MDP?
Single state. k actions.
For a ∈ A, treat R(s, a, s0 ) as a random variable.

1, U (2, 4)
1, U (−5, 5) 1, U (−1, 3)

. s1
.
γ = 0.5 . 1, U (0, 1)

Annotation: "probability, reward distribution".

20/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 20 / 26

A Familiar MDP?
Single state. k actions.
For a ∈ A, treat R(s, a, s0 ) as a random variable.

1, U (2, 4)
1, U (−5, 5) 1, U (−1, 3)

. s1
.
γ = 0.5 . 1, U (0, 1)

Annotation: "probability, reward distribution".

Such an MDP is called a 20/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 20 / 26

A Familiar MDP?
Single state. k actions.
For a ∈ A, treat R(s, a, s0 ) as a random variable.

1, U (2, 4)
1, U (−5, 5) 1, U (−1, 3)

. s1
.
γ = 0.5 . 1, U (0, 1)

Annotation: "probability, reward distribution".

Such an MDP is called a multi-armed bandit! 20/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 20 / 26

Markov Decision Problems
1. Definitions
I Markov Decision Problem
I Policy
I Value Function

2. MDP planning

3. Alternative formulations

4. Applications

5. Policy Evaluation
21/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 21 / 26

Structure of State Values
Let us investigate state values. For π ∈ Π, s ∈ S:
def
V π (s) = Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s]

22/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 22 / 26

Structure of State Values
Let us investigate state values. For π ∈ Π, s ∈ S:
def
V π (s) = Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s]
X
= T (s, π(s), s0 )Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s, s1 = s0 ]
s0 ∈S

22/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 22 / 26

22/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 22 / 26

Structure of State Values
Let us investigate state values. For π ∈ Π, s ∈ S:
def
V π (s) = Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s]
X
= T (s, π(s), s0 )Eπ [r 0 + γr 1 + γ 2 r 2 + . . . |s0 = s, s1 = s0 ]
s0 ∈S
X
= T (s, π(s), s0 )Eπ [r 0 |s0 = s, s1 = s0 ]
s0 ∈S
X
+γ T (s, π(s), s0 )Eπ [r 1 + γr 2 + . . . |s0 = s, s1 = s0 ]
s0 ∈S
X
= T (s, π(s), s0 )R(s, π(s), s0 )
s0 ∈S
X
+γ T (s, π(s), s0 )Eπ [r 1 + γr 2 + . . . |s1 = s0 ]
s0 ∈S

22/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 22 / 26

X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S

23/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 23 / 26

Bellman’s Equations
For π ∈ Π, s ∈ S:

X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S

Recall that S = {s1 , s2 , . . . , sn }.

23/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 23 / 26

Bellman’s Equations
For π ∈ Π, s ∈ S:

X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S

Recall that S = {s1 , s2 , . . . , sn }.

n equations, n unknowns—V π (s1 ), V π (s2 ), . . . V π (s2 ).

23/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 23 / 26

Bellman’s Equations
For π ∈ Π, s ∈ S:

X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S

Recall that S = {s1 , s2 , . . . , sn }.

n equations, n unknowns—V π (s1 ), V π (s2 ), . . . V π (s2 ).
Linear!

23/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 23 / 26

Bellman’s Equations
For π ∈ Π, s ∈ S:

X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S

Recall that S = {s1 , s2 , . . . , sn }.

n equations, n unknowns—V π (s1 ), V π (s2 ), . . . V π (s2 ).
Linear!
Guaranteed to have a unique solution if γ < 1.

23/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 23 / 26

Bellman’s Equations
For π ∈ Π, s ∈ S:

X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S

Recall that S = {s1 , s2 , . . . , sn }.

n equations, n unknowns—V π (s1 ), V π (s2 ), . . . V π (s2 ).
Linear!
Guaranteed to have a unique solution if γ < 1.
If task is episodic, guaranteed to have a unique solution
even if γ = 1, after we fix V π (s> ) = 0.

23/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 23 / 26

Bellman’s Equations
For π ∈ Π, s ∈ S:

X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S

Recall that S = {s1 , s2 , . . . , sn }.

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 23 / 26

Are We Done with this Topic?

24/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 24 / 26

Are We Done with this Topic?

We claimed that among all the policies for a given MDP,

there must be an optimal policy π ? .

24/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 24 / 26

Are We Done with this Topic?

We claimed that among all the policies for a given MDP,

there must be an optimal policy π ? .
Now you know how to compute the value function of any
given policy π.

24/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 24 / 26

Are We Done with this Topic?

We claimed that among all the policies for a given MDP,

there must be an optimal policy π ? .
Now you know how to compute the value function of any
given policy π.
Can you put the two ideas together and construct an
algorithm to find π ? ?

24/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 24 / 26

Are We Done with this Topic?

We claimed that among all the policies for a given MDP,

there must be an optimal policy π ? .
Now you know how to compute the value function of any
given policy π.
Can you put the two ideas together and construct an
algorithm to find π ? ?
Yes! Evaluate each policy and identify one that has a value
function dominating all the others’.

24/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 24 / 26

Are We Done with this Topic?

We claimed that among all the policies for a given MDP,

24/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 24 / 26

Action Value Function
For π ∈ Π, s ∈ S, a ∈ A:
def
Q (s, a) = E[r 0 +γr 1 +γ 2 r 2 +. . . |s0 = s; a0 = a; at = π(st ) for t ≥ 1].
π

25/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 25 / 26

Action Value Function
For π ∈ Π, s ∈ S, a ∈ A:
def
Q (s, a) = E[r 0 +γr 1 +γ 2 r 2 +. . . |s0 = s; a0 = a; at = π(st ) for t ≥ 1].
π

Q π (s, a) is the expected long-term reward from starting at s,

taking a at t = 0, and following π for t ≥ 1.

25/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 25 / 26

Action Value Function
For π ∈ Π, s ∈ S, a ∈ A:
def
Q (s, a) = E[r 0 +γr 1 +γ 2 r 2 +. . . |s0 = s; a0 = a; at = π(st ) for t ≥ 1].
π

Q π (s, a) is the expected long-term reward from starting at s,

taking a at t = 0, and following π for t ≥ 1.
Q π : S × A → R is called the action value function of π.

25/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 25 / 26

Action Value Function
For π ∈ Π, s ∈ S, a ∈ A:
def
Q (s, a) = E[r 0 +γr 1 +γ 2 r 2 +. . . |s0 = s; a0 = a; at = π(st ) for t ≥ 1].
π

Q π (s, a) is the expected long-term reward from starting at s,

taking a at t = 0, and following π for t ≥ 1.
Q π : S × A → R is called the action value function of π.
Observe that Q π satisfies, for s ∈ S, a ∈ A:
X
Q π (s, a) = T (s, a, s0 ){R(s, a, s0 ) + γV π (s0 )}.
s0 ∈S

25/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 25 / 26

Action Value Function
For π ∈ Π, s ∈ S, a ∈ A:
def
Q (s, a) = E[r 0 +γr 1 +γ 2 r 2 +. . . |s0 = s; a0 = a; at = π(st ) for t ≥ 1].
π

Q π (s, a) is the expected long-term reward from starting at s,

For π ∈ Π, s ∈ S: Q π (s, π(s)) = V π (s).

25/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 25 / 26

Action Value Function
For π ∈ Π, s ∈ S, a ∈ A:
def
Q (s, a) = E[r 0 +γr 1 +γ 2 r 2 +. . . |s0 = s; a0 = a; at = π(st ) for t ≥ 1].
π

Q π (s, a) is the expected long-term reward from starting at s,

For π ∈ Π, s ∈ S: Q π (s, π(s)) = V π (s).

Q π needs O(n2 k ) operations to compute if V π is available.

25/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 25 / 26

Action Value Function
For π ∈ Π, s ∈ S, a ∈ A:
def
Q (s, a) = E[r 0 +γr 1 +γ 2 r 2 +. . . |s0 = s; a0 = a; at = π(st ) for t ≥ 1].
π

Q π (s, a) is the expected long-term reward from starting at s,

For π ∈ Π, s ∈ S: Q π (s, π(s)) = V π (s).

Q π needs O(n2 k ) operations to compute if V π is available.
All optimal policies have the same action value function Q ? .
25/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 25 / 26

Action Value Function
For π ∈ Π, s ∈ S, a ∈ A:
def
Q (s, a) = E[r 0 +γr 1 +γ 2 r 2 +. . . |s0 = s; a0 = a; at = π(st ) for t ≥ 1].
π

Q π (s, a) is the expected long-term reward from starting at s,

For π ∈ Π, s ∈ S: Q π (s, π(s)) = V π (s).

Q π needs O(n2 k ) operations to compute if V π is available.
All optimal policies have the same action value function Q ? .
We will find use for Q π and Q ? next week. 25/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 25 / 26

Markov Decision Problems
1. Definitions
I Markov Decision Problem
I Policy
I Value Function

2. MDP planning

3. Alternative formulations

4. Applications

5. Policy Evaluation
26/26

Shivaram Kalyanakrishnan (2020) CS 747, Autumn 2020 26 / 26

Markov Decision Process (MDP)
No ratings yet
Markov Decision Process (MDP)
31 pages
08 MDPs
No ratings yet
08 MDPs
111 pages
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
100% (1)
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
86 pages
Lecture 2 Post
No ratings yet
Lecture 2 Post
65 pages
Lecture 2 Pre
No ratings yet
Lecture 2 Pre
58 pages
06 MDP
No ratings yet
06 MDP
89 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Microsoft PowerPoint - Lecture20Final-Part1
No ratings yet
Microsoft PowerPoint - Lecture20Final-Part1
65 pages
RL DQN PG
No ratings yet
RL DQN PG
65 pages
CS 747, Autumn 2023: Lecture 6: Shivaram Kalyanakrishnan
No ratings yet
CS 747, Autumn 2023: Lecture 6: Shivaram Kalyanakrishnan
68 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
RL Class Notes
No ratings yet
RL Class Notes
68 pages
Lecture3 InsideAnAgent
No ratings yet
Lecture3 InsideAnAgent
35 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
Lecture 2: Making Sequences of Good Decisions Given A Model of The World
No ratings yet
Lecture 2: Making Sequences of Good Decisions Given A Model of The World
60 pages
Policy (RL IITH)
No ratings yet
Policy (RL IITH)
46 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
An Introduction To Reinforcement Learning
No ratings yet
An Introduction To Reinforcement Learning
63 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Lecture4 Model Free Prediction
No ratings yet
Lecture4 Model Free Prediction
34 pages
Lec 12
No ratings yet
Lec 12
60 pages
Lecture Notes
No ratings yet
Lecture Notes
29 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
CS 747, Autumn 2023: Lecture 7: Shivaram Kalyanakrishnan
No ratings yet
CS 747, Autumn 2023: Lecture 7: Shivaram Kalyanakrishnan
28 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Stochastic DP
No ratings yet
Stochastic DP
23 pages
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
No ratings yet
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
23 pages
On State Variables and POMDP-s
No ratings yet
On State Variables and POMDP-s
49 pages
RL Unit-Ii
No ratings yet
RL Unit-Ii
14 pages
Markov Decision Process
No ratings yet
Markov Decision Process
29 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
2008 - Carmon, Shwartz - Markov Decision Processes With Exponentially Representable Discounting
No ratings yet
2008 - Carmon, Shwartz - Markov Decision Processes With Exponentially Representable Discounting
10 pages
CS229
No ratings yet
CS229
17 pages
Policies, Search, Utility
No ratings yet
Policies, Search, Utility
13 pages
Quick Start: Resolving A Markov Decision Process Problem Using The Mdptoolbox in Matlab
No ratings yet
Quick Start: Resolving A Markov Decision Process Problem Using The Mdptoolbox in Matlab
9 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
Handling Uncertainty 03 - Solving MDP
No ratings yet
Handling Uncertainty 03 - Solving MDP
11 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
Week 10
No ratings yet
Week 10
5 pages
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
No ratings yet
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
6 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Markov Decision
No ratings yet
Markov Decision
4 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
No ratings yet
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
22 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
No ratings yet
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
13 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
WEEK 4 - What Is Common Table Expressions
No ratings yet
WEEK 4 - What Is Common Table Expressions
3 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Sample Final AI
No ratings yet
Sample Final AI
9 pages
Formal Language and Automata Theory
No ratings yet
Formal Language and Automata Theory
18 pages
Big-O Algorithm Complexity Cheat Sheet
No ratings yet
Big-O Algorithm Complexity Cheat Sheet
9 pages
Linear Inequalities, Systems & Linear Programming (MGT 208) New
No ratings yet
Linear Inequalities, Systems & Linear Programming (MGT 208) New
41 pages
More Example in Karnaugh Maps
No ratings yet
More Example in Karnaugh Maps
27 pages
Quiz 2 Solutions
No ratings yet
Quiz 2 Solutions
12 pages
Sequencing Problem
No ratings yet
Sequencing Problem
32 pages
UNIT-2 (Part-1)
No ratings yet
UNIT-2 (Part-1)
22 pages
Practice Sheet-I Fuzzy Logic
No ratings yet
Practice Sheet-I Fuzzy Logic
10 pages
Network Diagram
No ratings yet
Network Diagram
17 pages
TuringMachine Must Read Guide
No ratings yet
TuringMachine Must Read Guide
4 pages
Transportation Problem
No ratings yet
Transportation Problem
8 pages
03 - Supervised Learning (BPNN)
No ratings yet
03 - Supervised Learning (BPNN)
14 pages
DLD Lab 01-Number System
No ratings yet
DLD Lab 01-Number System
7 pages
Queue Assignment
No ratings yet
Queue Assignment
25 pages
Interview Question
No ratings yet
Interview Question
8 pages
Simplex Method Is The Most Popular Method Used For The Solution of Linear Programming Problems (LPP)
No ratings yet
Simplex Method Is The Most Popular Method Used For The Solution of Linear Programming Problems (LPP)
32 pages
03.asymptotic Notation. Practice Problems
No ratings yet
03.asymptotic Notation. Practice Problems
5 pages
Transportation Problems
No ratings yet
Transportation Problems
4 pages
Final Exam-2018
No ratings yet
Final Exam-2018
3 pages
AQA GCSE Computer Science - Paper 1 Revision Sheet
No ratings yet
AQA GCSE Computer Science - Paper 1 Revision Sheet
4 pages
Theory of Automata
No ratings yet
Theory of Automata
17 pages
An Improved Algorithm For Matching Large Graphs: L. P. Cordella, P. Foggia, C. Sansone, M. Vento
No ratings yet
An Improved Algorithm For Matching Large Graphs: L. P. Cordella, P. Foggia, C. Sansone, M. Vento
8 pages
Problem 6
No ratings yet
Problem 6
10 pages
Quantile Regression Explained
No ratings yet
Quantile Regression Explained
4 pages
DS, Stacks Types of DS Apni Kaksha
No ratings yet
DS, Stacks Types of DS Apni Kaksha
13 pages
Bucket Sort Algorithm
No ratings yet
Bucket Sort Algorithm
8 pages
Linear Approximations and Differential
No ratings yet
Linear Approximations and Differential
3 pages
Unit4 Quiz - Attempt Review
No ratings yet
Unit4 Quiz - Attempt Review
3 pages