Assignment 7 (Sol.) : Reinforcement Learning
Assignment 7 (Sol.) : Reinforcement Learning
)
Reinforcement Learning
Prof. B. Ravindran
1. Consider example 7.1 and figure 7.2 in the text book. Which among the following steps
(keeping all other factors unchanged) will result in a decrease in the RMS errors shown in the
graphs?
Sol. (b)
Note that the graphs are generated by averaging over the first 10 episodes. If we increase the
number of episodes considered, the error shown in the graphs would reduce as evaluation of
the policy improves.
2. Considering episodic tasks and for λ ∈ (0, 1), is it true that the one step return always gets
assigned the maximum weight in the λ-return?
(a) no
(b) yes
Sol. (a)
This is not necessarily true and depends on the length of an episode (as well as the value of
λ). For example, consider an episode of length 3 and a value of λ = 0.7.
3. In the TD(λ) algorithm, if λ = 1 and γ = 1, then which among the following are true?
(a) the method behaves like a Monte Carlo method for an undiscounted task
(b) the eligibility traces do not decay
(c) the value of all states are updated by the TD error in each episode
(d) this method is not suitable for continuing tasks
1
4. Assume you have a MDP with |S| states. You decide to use an n-step truncated corrected
return for the evaluation problem on this MDP. Do you think that there is any utility in
considering values of n which exceed |S| for this problem?
(a) no
(b) yes
Sol. (b)
Note that the number of steps in an n-step truncated corrected return is related to the length
of the trajectories which can exceed the number of states in the state space of a problem.
5. Which among the following are reasons to support your answer in the previous question?
(a) only values of n ≤ |S| should be considered as the number of states is only |S|
(b) all implementations with n > |S| will result in the same evaluation at each stage of the
iterative process
(c) the length of each episode may exceed |S|, and hence values of n > |S| should be consid-
ered
(d) regardless of the number of states, different values of n will always lead to different
evaluations (at each step of the iterative process) and hence cannot be disregarded
Sol. (c)
6. Consider the text book figure 5.1 describing the first-visit MC method prediction algorithm
and figure 7.7 describing the TD(λ) algorithm. Will these two algorithms behave identically
for λ = 1? If so, what kind of eligibility trace will result in equivalence?
(a) no
(b) yes, accumulating traces
(c) yes, replacing traces
(d) yes, dutch traces, with α = 0.5
Sol. (a)
The two algorithms are not identical since figure 7.7 describes the online version of the TD(λ)
algorithm, whereas in the MC algorithm described in figure 5.1, updates are obviously not
made after each individual reward is observed.
7. Given the following sequence of states observed from the beginning of an episode,
s2 , s1 , s3 , s2 , s1 , s2 , s1 , s6
what is the eligibility value, e7 (s1 ), of state s1 at time step 7 given trace decay parameter λ,
discount rate γ, and initial value, e0 (s1 ) = 0, when accumulating traces are used?
(a) γ 7 λ7
(b) (γλ)7 + (γλ)6 + (γλ)3 + γλ
(c) γλ(1 + γ 2 λ2 + γ 5 λ5 )
(d) γ 7 λ7 + γ 3 λ3 + γλ
2
Sol. (c)
According to the non-recursive expression for accumulating eligibility trace, we have
t
X
et (s) = (γλ)t−k Issk
k=0
8. For the above question, what is the eligibility value if replacing traces are used?
(a) γ 7 λ7
(b) γλ
(c) γλ + 1
(d) 3γλ
Sol. (b)
We know that when using replacing traces, the eligibility trace of a state is set to 1 if that
state is visited and decayed by a factor of γλ otherwise. Thus, the latest occurrence of state
s1 just before state s6 would cause e6 (s1 ) to be set to 1 and after the occurrence of state s6 ,
this would decay to e7 (s1 ) = γλ.
9. In solving the control problem, suppose that at the start of an episode the first action that
is taken is not an optimal action according to the current policy. Would an update be made
corresponding to this action and the subsequent reward received in Watkin’s Q(λ) algorithm?
(a) no
(b) yes
Sol. (b)
This is immediately clear from the Watkin’s Q(λ) algorithm described in the text.
10. Suppose that in a particular problem, the agent keeps going back to the same state in a loop.
What is the maximum value that can be taken by the eligibility trace of such a state if we
consider accumulating traces with λ = 0.25 and γ = 0.8?
(a) 1.25
(b) 5.0
(c) ∞
(d) insufficient data
Sol. (a)
For accumulating traces maximum increase in eligibility occurs if the state is selected: et (s) =
1
γλet−1 (s) + 1. At maximum, et (s) = et−1 (s), giving, et (s) = et−1 (s) = 1−γλ .