0% found this document useful (0 votes)
412 views4 pages

Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza

The document provides information about modeling a dice game as a Markov Decision Process (MDP) and solving it using policy iteration and value iteration. Key points: 1) The dice game is modeled as an MDP with states representing the die value and actions to either roll again or stop. 2) An initial policy is evaluated which always rolls for values 1-5 and stops at 6. Policy iteration is then performed, finding this policy is optimal. 3) For value iteration with discount factor γ, the optimal value function converges to the maximum of rolling or stopping at each state, accounting for γ discounting of future rewards.

Uploaded by

AskIIT Ian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
412 views4 pages

Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza

The document provides information about modeling a dice game as a Markov Decision Process (MDP) and solving it using policy iteration and value iteration. Key points: 1) The dice game is modeled as an MDP with states representing the die value and actions to either roll again or stop. 2) An initial policy is evaluated which always rolls for values 1-5 and stops at 6. Policy iteration is then performed, finding this policy is optimal. 3) For value iteration with discount factor γ, the optimal value function converges to the maximum of rolling or stopping at each state, accounting for γ discounting of future rewards.

Uploaded by

AskIIT Ian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

CS 188 Introduction to

Fall 2020 Artificial Intelligence Exam Prep 4 Solutions


Q1. MDPs: Dice Bonanza
A casino is considering adding a new game to their collection, but need to analyze it before releasing it on their floor. They have
hired you to execute the analysis. On each round of the game, the player has the option of rolling a fair 6-sided die. That is, the
die lands on values 1 through 6 with equal probability. Each roll costs 1 dollar, and the player must roll the very first round.
Each time the player rolls the die, the player has two possible actions:

1. 𝑆𝑡𝑜𝑝: Stop playing by collecting the dollar value that the die lands on, or
2. 𝑅𝑜𝑙𝑙: Roll again, paying another 1 dollar.

Having taken CS 188, you decide to model this problem using an infinite horizon Markov Decision Process (MDP). The player
initially starts in state 𝑆𝑡𝑎𝑟𝑡, where the player only has one possible action: 𝑅𝑜𝑙𝑙. State 𝑠𝑖 denotes the state where the die lands
on 𝑖. Once a player decides to 𝑆𝑡𝑜𝑝, the game is over, transitioning the player to the 𝐸𝑛𝑑 state.

(a) In solving this problem, you consider using policy iteration. Your initial policy 𝜋 is in the table below. Evaluate the policy
at each state, with 𝛾 = 1.

State 𝑠1 𝑠2 𝑠3 𝑠4 𝑠5 𝑠6

𝜋(𝑠) 𝑅𝑜𝑙𝑙 𝑅𝑜𝑙𝑙 𝑆𝑡𝑜𝑝 𝑆𝑡𝑜𝑝 𝑆𝑡𝑜𝑝 𝑆𝑡𝑜𝑝

𝑉 𝜋 (𝑠) 3 3 3 4 5 6

We have that 𝑠𝑖 = 𝑖 for 𝑖 ∈ {3, 4, 5, 6}, since the player will be awarded no further rewards according to the policy. From
the Bellman equations, we have that 𝑉 (𝑠1 ) = −1 + 61 (𝑉 (𝑠1 ) + 𝑉 (𝑠2 ) + 3 + 4 + 5 + 6) and that 𝑉 (𝑠2 ) = −1 + 61 (𝑉 (𝑠1 ) +
𝑉 (𝑠2 ) + 3 + 4 + 5 + 6). Solving this linear system yields 𝑉 (𝑠1 ) = 𝑉 (𝑠2 ) = 3.

(b) Having determined the values, perform a policy update to find the new policy 𝜋 ′ . The table below shows the old policy
𝜋 and has filled in parts of the updated policy 𝜋 ′ for you. If both 𝑅𝑜𝑙𝑙 and 𝑆𝑡𝑜𝑝 are viable new actions for a state, write
down both 𝑅𝑜𝑙𝑙∕𝑆𝑡𝑜𝑝. In this part as well, we have 𝛾 = 1.

State 𝑠1 𝑠2 𝑠3 𝑠4 𝑠5 𝑠6

𝜋(𝑠) 𝑅𝑜𝑙𝑙 𝑅𝑜𝑙𝑙 𝑆𝑡𝑜𝑝 𝑆𝑡𝑜𝑝 𝑆𝑡𝑜𝑝 𝑆𝑡𝑜𝑝

𝜋 ′ (𝑠) 𝑅𝑜𝑙𝑙 𝑅𝑜𝑙𝑙 𝑅𝑜𝑙𝑙∕𝑆𝑡𝑜𝑝 𝑆𝑡𝑜𝑝 𝑆𝑡𝑜𝑝 𝑆𝑡𝑜𝑝

For each 𝑠𝑖 in part (a), we compare the values obtained via Rolling and Stopping. The value of Rolling for each state 𝑠𝑖 is
−1 + 61 (3 + 3 + 3 + 4 + 5 + 6) = 3. The value of Stopping for each state 𝑠𝑖 is 𝑖. At each state 𝑠𝑖 , we take the action that
yields the largest value; so, for 𝑠1 and 𝑠2 , we Roll, and for 𝑠4 and 𝑠5 , we stop. For 𝑠3 , we Roll/Stop, since the values from
Rolling and Stopping are equal.

1
(c) Is 𝜋(𝑠) from part (a) optimal? Explain why or why not.
Yes, the old policy is optimal. Looking at part (b), there is a tie between 2 equally good policies that policy iteration
considers employing. One of these policies is the same as the old policy. This means that both new policies are as equally
good as the old policy, and policy iteration has converged. Since policy iteration converges to the optimal policy, we can
be sure that 𝜋(𝑠) from part (a) is optimal.

(d) Suppose that we were now working with some 𝛾 ∈ [0, 1) and wanted to run value iteration. Select the one statement that
would hold true at convergence, or write the correct answer next to Other if none of the options are correct.
{ } { }
𝑖 ∑ 1 ∑ ∑
# 𝑉 ∗ (𝑠𝑖 ) = max −1 + 6 , 𝛾𝑉 ∗ (𝑠𝑗 ) # 𝑉 ∗ (𝑠𝑖 ) = 6 ⋅ max −1 + 𝑖 , 𝑉 ∗ (𝑠𝑗 )
𝑗 𝑗 𝑘
{ [ ]} ∑ { }
1
1 ∑ # 𝑉 ∗ (𝑠𝑖 ) = max −1 + 𝑖 , ⋅ 𝛾𝑉 ∗ (𝑠𝑗 ))
# 𝑉 ∗ (𝑠𝑖 ) = max 𝑖 , 6 ⋅ −1 + 𝛾𝑉 ∗ (𝑠𝑗 ) 𝑗
6
𝑗 ∑ { }
𝑖
{ } # 𝑉 ∗ (𝑠 ) =
𝑖 max
6
, −1 + 𝛾𝑉 ∗
(𝑠𝑗 )
1 ∑ 𝑗
# 𝑉 (𝑠𝑖 ) = max − 6 + 𝑖 ,
∗ ∗
𝛾𝑉 (𝑠𝑗 ) { }
𝑗 ∗ 𝛾∑ ∗
𝑉 (𝑠𝑖 ) = max 𝑖 , −1 + 𝑉 (𝑠𝑗 )
{ } 6 𝑗
∑ { }
# 𝑉 ∗ (𝑠𝑖 ) = max 𝑖 , − 61 + 𝛾𝑉 ∗ (𝑠𝑗 ) # 𝑉 ∗ (𝑠𝑖 ) =
∑ 1
max 𝑖 , − + 𝛾𝑉 ∗ (𝑠𝑗 )
𝑗
𝑗
6
∑ { } ∑ { }
# 𝑉 ∗ (𝑠𝑖 ) = 61 ⋅ max 𝑖 , −1 + 𝛾𝑉 ∗ (𝑠𝑗 ) # 𝑉 ∗ (𝑠𝑖 ) = max −𝑖 6
, −1 + 𝛾𝑉 ∗ (𝑠𝑗 )
𝑗 𝑗

# Other

2
Q2. How do you Value It(eration)?
(a) Fill out the following True/False questions.
(i) True # False: Let 𝐴 be the set of all actions and 𝑆 the set of states for some MDP. Assuming that
|𝐴| ≪ |𝑆|, one iteration of value iteration is generally faster than one iteration of policy iteration that solves a linear
system during policy evaluation. One iteration of value iteration is 𝑂(|𝑆|2 |𝐴|), whereas one iteration of policy
iteration is 𝑂(|𝑆|3 ), so value iteration is generally faster when |𝐴| ≪ |𝑆|
(ii) # True False: For any MDP, changing the discount factor does not affect the optimal policy for the MDP.
Consider an infinite horizon setting where we have 2 states 𝐴, 𝐵, where we can alternate between 𝐴 and 𝐵 forever,
gaining a reward of 1 each transition, or exit from 𝐵 with a reward of 100. In the case that 𝛾 = 1, the optimal policy
is to forever oscillate between 𝐴 and 𝐵. If 𝛾 = 12 , then it is optimal to exit.

The following problem will take place in various instances of a grid world MDP. Shaded cells represent walls. In all states, the
agent has available actions ↑, ↓, ←, →. Performing an action that would transition to an invalid state (outside the grid or into a
wall) results in the agent remaining in its original state. In states with an arrow coming out, the agent has an additional action
𝐸𝑋𝐼𝑇 . In the event that the 𝐸𝑋𝐼𝑇 action is taken, the agent receives the labeled reward and ends the game in the terminal
state 𝑇 . Unless otherwise stated, all other transitions receive no reward, and all transitions are deterministic.

For all parts of the problem, assume that value iteration begins with all states initialized to zero, i.e., 𝑉0 (𝑠) = 0 ∀𝑠. Let
the discount factor be 𝛾 = 12 for all following parts.

(b) Suppose that we are performing value iteration on the grid world MDP below.

(i) Fill in the optimal values for A and B in the given boxes.

25
25 8
𝑉 ∗ (𝐴) ∶ 𝑉 ∗ (𝐵) ∶

(ii) After how many iterations 𝑘 will we have 𝑉𝑘 (𝑠) = 𝑉 ∗ (𝑠) for all states 𝑠? If it never occurs, write “never". Write
your answer in the given box.

(iii) Suppose that we wanted to re-design the reward function. For which of the following new reward functions would
the optimal policy remain unchanged? Let 𝑅(𝑠, 𝑎, 𝑠′ ) be the original reward function.

■ 𝑅1 (𝑠, 𝑎, 𝑠′ ) = 10𝑅(𝑠, 𝑎, 𝑠′ )
■ 𝑅2 (𝑠, 𝑎, 𝑠′ ) = 1 + 𝑅(𝑠, 𝑎, 𝑠′ )
■ 𝑅3 (𝑠, 𝑎, 𝑠′ ) = 𝑅(𝑠, 𝑎, 𝑠′ )2
□ 𝑅4 (𝑠, 𝑎, 𝑠′ ) = −1
□ None
𝑅1 : Scaling the reward function does not affect the optimal policy, as it scales all Q-values by 10, which retains ordering
𝑅2 : Since reward is discounted, the agent would get more reward exiting then infinitely cycling between states

3
𝑅3 : The only positive reward remains to be from exiting state +100 and +1, so the optimal policy doesn’t change
𝑅4 : With negative reward at every step, the agent would want to exit as soon as possible, which means the agent would
not always exit at the bottom-right square.

(c) For the following problem, we add a new state in which we can take the 𝐸𝑋𝐼𝑇 action with a reward of +𝑥.

(i) For what values of 𝑥 is it guaranteed that our optimal policy 𝜋 ∗ has 𝜋 ∗ (𝐶) = ←? Write ∞ and −∞ if there is no
upper or lower bound, respectively. Write the upper and lower bounds in each respective box.

50 ∞
<x<
We go left if 𝑄(𝐶, ←) > 𝑄(𝐶, →). 𝑄(𝐶, ←) = 81 𝑥, and 𝑄(𝐶, →) = 100
16
. Solving for 𝑥, we get 𝑥 > 50.
(ii) For what values of 𝑥 does value iteration take the minimum number of iterations 𝑘 to converge to 𝑉 ∗ for all states?
Write ∞ and −∞ if there is no upper or lower bound, respectively. Write the upper and lower bounds in each re-
spective box.

50 200
≤x≤

The two states that will take the longest for value iteration to become non-zero from either +𝑥 or +100, are states 𝐶, and
𝐷, where 𝐷 is defined as the state to the right of 𝐶. 𝐶 will become nonzero at iteration 4 from +𝑥, and 𝐷 will become
nonzero at iteration 4 from +100. We must bound 𝑥 so that the optimal policy at 𝐷 does not choose to go to +𝑥, or else
value iteration will take 5 iterations. Similar reasoning for 𝐷 and +𝑥. Then our inequalities are 18 𝑥 ≥ 100
16
1
and 16 𝑥 ≤ 100
8
.
Simplifying, we get the following bound on 𝑥: 50 ≤ 𝑥 ≤ 200
(iii) Fill the box with value 𝑘, the minimum number of iterations until 𝑉𝑘 has converged to 𝑉 ∗ for all states.

See the explanation for the part above

You might also like