CS 747, Autumn 2020: Week 4, Lecture 1: Shivaram Kalyanakrishnan
CS 747, Autumn 2020: Week 4, Lecture 1: Shivaram Kalyanakrishnan
Shivaram Kalyanakrishnan
Autumn 2020
1/26
2. MDP planning
3. Alternative formulations
4. Applications
5. Policy Evaluation
2/26
2. MDP planning
3. Alternative formulations
4. Applications
5. Policy Evaluation
2/26
0.5, 0
0.25, −1
0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3
s3
0.5, 3
3/26
0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3
s3
0.5, 3
3/26
0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3
s3
0.5, 3
0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3
s3
0.5, 3
3/26
0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3
s3
0.5, 3
0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3
s3
0.5, 3
3/26
0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3
s3
0.5, 3
0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3
s3
0.5, 3
3/26
0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3
s3
0.5, 3
0.5, 0
0.25, −1
0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3
s3
0.5, 3
4/26
4/26
..
.
4/26
..
.
Resulting trajectory: s0 , a0 , r 0 , s1 , a1 , r 1 , s2 , . . . .
4/26
5/26
5/26
5/26
5/26
5/26
0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3
s3
0.5, 3
6/26
s3 BLUE
0.5, 3
6/26
s3 BLUE
0.5, 3
6/26
s3 BLUE
0.5, 3
6/26
s3 BLUE
0.5, 3
s3 BLUE
0.5, 3
7/26
s3 BLUE
0.5, 3
7/26
s3 BLUE
0.5, 3
7/26
s3 BLUE
0.5, 3
8/26
8/26
8/26
0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3
s3
0.5, 3
8/26
0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3
s3
γ = 0.9
0.5, 3
8/26
0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3
s3
γ = 0.9
0.5, 3
0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3
s3
γ = 0.9
0.5, 3
0.5, −1
1, 1 s1 s2
1, 2
0.75, −2
1, 1 0.5, 3
s3
γ = 0.9
0.5, 3
2. MDP planning
3. Alternative formulations
4. Applications
5. Policy Evaluation
9/26
10/26
10/26
10/26
11/26
11/26
11/26
2. MDP planning
3. Alternative formulations
4. Applications
5. Policy Evaluation
12/26
13/26
13/26
13/26
13/26
13/26
13/26
13/26
14/26
14/26
0.5, −1 0.25, 2
s1 s2 s
1, 2 0.4, 2
0.6, 1
14/26
0.5, −1 0.25, 2
s1 s2 s
1, 2 0.4, 2
0.6, 1
14/26
0.5, −1 0.25, 2
s1 s2 s
1, 2 0.4, 2
0.6, 1
15/26
15/26
15/26
2. MDP planning
3. Alternative formulations
4. Applications
5. Policy Evaluation
16/26
[1]
1. https://www.publicdomainpictures.net/pictures/20000/velka/
police-helicopter-8712919948643Mk.jpg. 17/26
[1]
1. https://www.publicdomainpictures.net/pictures/80000/velka/chess-board-and-pieces.jpg. 18/26
[1]
1. https://www.publicdomainpictures.net/pictures/270000/velka/firemen-1533752293Zsu.jpg. 19/26
1, U (2, 4)
1, U (−5, 5) 1, U (−1, 3)
. s1
.
γ = 0.5 . 1, U (0, 1)
20/26
1, U (2, 4)
1, U (−5, 5) 1, U (−1, 3)
. s1
.
γ = 0.5 . 1, U (0, 1)
1, U (2, 4)
1, U (−5, 5) 1, U (−1, 3)
. s1
.
γ = 0.5 . 1, U (0, 1)
2. MDP planning
3. Alternative formulations
4. Applications
5. Policy Evaluation
21/26
22/26
22/26
22/26
22/26
X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S
23/26
X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S
23/26
X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S
23/26
X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S
23/26
X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S
23/26
X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S
23/26
X
V π (s) = T (s, π(s), s0 ) {R(s, π(s), s0 ) + γV π (s0 )} .
s0 ∈S
24/26
24/26
24/26
24/26
24/26
24/26
25/26
25/26
25/26
25/26
25/26
25/26
2. MDP planning
3. Alternative formulations
4. Applications
5. Policy Evaluation
26/26