0% found this document useful (0 votes)
91 views4 pages

RL Solution3

The document contains a series of questions and solutions related to reinforcement learning concepts, including the REINFORCE algorithm, policy updates, and the properties of Markov Decision Processes (MDPs). Key topics discussed include the independence of the baseline in updates, the relationship between expected values with and without a baseline, and the implications of using a discount factor greater than one. The document serves as an assignment for students to demonstrate their understanding of these concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views4 pages

RL Solution3

The document contains a series of questions and solutions related to reinforcement learning concepts, including the REINFORCE algorithm, policy updates, and the properties of Markov Decision Processes (MDPs). Key topics discussed include the independence of the baseline in updates, the relationship between expected values with and without a baseline, and the implications of using a discount factor greater than one. The document serves as an assignment for students to demonstrate their understanding of these concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Assignment 3

Reinforcement Learning
Prof. B. Ravindran
1. The baseline in the REINFORCE update should not depend on which of the following (with-
out voiding any of the steps in the proof of REINFORCE)?
(a) rn−1
(b) rn
(c) Action taken(an )
(d) None of the above
Sol. (c)
The baseline must not depend on any action. The baseline can depend on current and past
rewards. An example baseline is given in the videos where the average of rewards obtained so
far is considered to be the baseline.

2. Which of the following statements is true about the RL problem?


(a) Our main aim is to maximize the cumulative reward.
(b) The agent always performs the actions in a deterministic fashion.
(c) We assume that the agent determines the next state based on the current state and action
(d) It is impossible to have zero rewards.
Sol. (a)
The reward is outside the agent’s control. Our main aim is to maximize the return. The agent
can take actions in a stochastic fashion as well.

3. Let us say we are taking actions according to a Gaussian distribution with parameters µ and
σ. We update the parameters according to REINFORCE and at denote the action taken at
step t.
(i) µt+1 = µt + αrt µtσ−a
2
t
t
 2

(ii) σt+1 = σt + αrt (at −µσ3
t)
− 1
σt
t
2
(iii) σt+1 = σt + αrt (at −µ
σt3
t)

(iv) µt+1 = µt + αrt atσ−µ


2
t
t

Which of the above updates are correct?


(a) (i), (iii)
(b) (i), (iv)
(c) (ii), (iv)
(d) (ii), (iii)

1
Sol. (c)
(at −µt )2

The gaussian distribution is given by π(at ; µt , σt ) = √ 1
2
2
e 2σt
. Derive the update
2πσt
according to the REINFORCE formula.

4. The update in REINFORCE is given by θt+1 = θt + αrt ∂ ln π(a ∂θt


t ;θt )
, where rt ∂ ln π(a
∂θt
t ;θt )
is
an unbiased estimator of the true gradient of the performance function. However, there was
another variant of REINFORCE, where a baseline b, that is independent of the action taken, is
subtracted from the obtained reward, i.e, the update is given by θt+1 = θt +α(rt −b) ∂ ln π(a ∂θt
t ;θt )
.
How are E[(rt − b) ∂ ln π(a
∂θt
t ;θt )
] and E[rt ∂ ln π(a
∂θt
t ;θt )
] related?

(a) E[(rt − b) ∂ ln π(a


∂θt
t ;θt )
] = E[rt ∂ ln π(a
∂θt
t ;θt )
]
(b) E[(rt − b) ∂ ln π(a
∂θt
t ;θt )
] < E[rt ∂ ln π(a
∂θt
t ;θt )
]
(c) E[(rt − b) ∂ ln π(a
∂θt
t ;θt )
] > E[rt ∂ ln π(a
∂θt
t ;θt )
]
(d) Could be either of a, b or c, depending on the choice of baseline

Sol. (a)

∂ ln π(at ; θt ) 1 ∂π(at ; θt )
E[b ] = E[b ]
∂θt π(at ; θt ) ∂θt
X 1 ∂π(a; θt )
= [b ]π(a; θt )
a
π(a; θ t ) ∂θt
X ∂π(a; θt )
= b
a
∂θt
∂1
=b
∂θt
=0

Thus, E[(rt − b) ∂ ln π(a


∂θt
t ;θt )
] = E[rt ∂ ln π(a
∂θt
t ;θt )
]
5. Consider the following policy-search algorithm for a multi-armed binary bandit:

∀a, πt+1 (a) = πt (a)(1 − α) + α(1a=at rt + (1 − 1a=at )(1 − rt ))

where 1at =a is 1 if a = at and 0 otherwise. Which of the following is true for the above
algorithm?
(a) It is LR−I algorithm.
(b) It is LR−ϵP algorithm.
(c) It would work well if the best arm had probability of 0.9 of resulting in +1 reward and
the next best arm had probability of 0.5 of resulting in +1 reward
(d) It would work well if the best arm had probability of 0.3 of resulting in +1 reward and
the worst arm had probability of 0.25 of resulting in +1 reward

2
Sol. (c)
The given algorithm is LR=P algorithm. It would work well for the case described in (c) as
it gives equal weightage to penalties and rewards, and as the gap between best arm and next
best arm’s probability of giving +1 reward is significant, it would easily figure out the best
arm.
6. Assertion: Contextual bandits can be modeled as a full reinforcement learning problem.
Reason: We can define an MDP with n states where n is the number of bandits. The number
of actions from each state corresponds to the arms in each bandit, with every action leading
to termination of the episode, and giving a reward according to the corresponding bandit and
arm.

(a) Assertion and Reason are both true and Reason is a correct explanation of Assertion
(b) Assertion and Reason are both true and Reason is not a correct explanation of Assertion
(c) Assertion is true and Reason is false
(d) Both Assertion and Reason are false

Sol. (a)
The given MDP in the Reason correctly models the Contextual Bandit problem. Full RL
problem is just an extension of contextual bandit problem, just that the action taken in a
state affects the state transition in MDP.
7. Let’s assume for some full RL problem we are acting according to a policy π. At some time t,
we are in a state s where we took action a1 . After few time steps, at time t′ , the same state
s was reached where we performed an action a2 (̸= a1 ). Which of the following statements is
true?
(a) π is definitely a Stationary policy
(b) π is definitely a Non-Stationary policy
(c) π can be Stationary or Non-Stationary.

Sol. (c)
A stationary policy can be stochastic and thus the for same state different actions can be
chosen at different time steps. Thus π can be Stationary or Non-Stationary policy.
8. Stochastic gradient ascent/descent update occurs in the right direction at every step

(a) True
(b) False
Sol. (b)
Stochastic gradient descent updates need not always move in the “correct” direction(direction
of gradient). However stochastic gradient approaches are generally expected to move in the
correct direction in an expected sense.
9. Which of the following is true for an MDP?
(a) P r(st+1 , rt+1 |st , at ) = P r(st+1 , rt+1 )
(b) P r(st+1 , rt+1 |st , at , st−1 , at−1 , st−2 , at−2 , ..., s0 , a0 ) = P r(st+1 , rt+1 |st , at )

3
(c) P r(st+1 , rt+1 |st , at ) = P r(st+1 , rt+1 |s0 , a0 )
(d) P r(st+1 , rt+1 |st , at ) = P r(st , rt |st−1 , at−1 )
Sol. (b)
(b) is true for any MDP. (a),(c) and (d) are not true.
10. Remember for discounted returns,

Gt = rt + γrt+1 + γ 2 rt+2 + ...

Where γ is a discount factor. Which of the following best explains what happens when γ > 1,
(say γ = 5)?
(a) Nothing, γ > 1 is common for many RL problems
(b) Theoretically nothing can go wrong, but this case does not represent any real world
problems
(c) The agent will learn that delayed rewards will always be beneficial and so will not learn
properly.
(d) None of the above is true.

Sol. (c)
Due to higher exponent of γ in the future reward, they will be of more impact for the current
return value. So there it is a highly probable that agent learn not to finish the problem/game
but simply, extend it, or continue it.

You might also like