0% found this document useful (0 votes)

91 views4 pages

RL Solution3

The document contains a series of questions and solutions related to reinforcement learning concepts, including the REINFORCE algorithm, policy updates, and the properties of Markov Decision Processes (MDPs). Key topics discussed include the independence of the baseline in updates, the relationship between expected values with and without a baseline, and the implications of using a discount factor greater than one. The document serves as an assignment for students to demonstrate their understanding of these concepts.

Uploaded by

Geethanjali Pawanekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

91 views4 pages

RL Solution3

Uploaded by

Geethanjali Pawanekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Assignment 3

Reinforcement Learning
Prof. B. Ravindran
1. The baseline in the REINFORCE update should not depend on which of the following (with-
out voiding any of the steps in the proof of REINFORCE)?
(a) rn−1
(b) rn
(c) Action taken(an )
(d) None of the above
Sol. (c)
The baseline must not depend on any action. The baseline can depend on current and past
rewards. An example baseline is given in the videos where the average of rewards obtained so
far is considered to be the baseline.

2. Which of the following statements is true about the RL problem?

(a) Our main aim is to maximize the cumulative reward.
(b) The agent always performs the actions in a deterministic fashion.
(c) We assume that the agent determines the next state based on the current state and action
(d) It is impossible to have zero rewards.
Sol. (a)
The reward is outside the agent’s control. Our main aim is to maximize the return. The agent
can take actions in a stochastic fashion as well.

3. Let us say we are taking actions according to a Gaussian distribution with parameters µ and
σ. We update the parameters according to REINFORCE and at denote the action taken at
step t.
(i) µt+1 = µt + αrt µtσ−a
2
t
t
2

(ii) σt+1 = σt + αrt (at −µσ3
t)
− 1
σt
t
2
(iii) σt+1 = σt + αrt (at −µ
σt3
t)

(iv) µt+1 = µt + αrt atσ−µ

2
t
t

Which of the above updates are correct?

(a) (i), (iii)
(b) (i), (iv)
(c) (ii), (iv)
(d) (ii), (iii)

1
Sol. (c)
(at −µt )2
−
The gaussian distribution is given by π(at ; µt , σt ) = √ 1
2
2
e 2σt
. Derive the update
2πσt
according to the REINFORCE formula.

4. The update in REINFORCE is given by θt+1 = θt + αrt ∂ ln π(a ∂θt

t ;θt )
, where rt ∂ ln π(a
∂θt
t ;θt )
is
an unbiased estimator of the true gradient of the performance function. However, there was
another variant of REINFORCE, where a baseline b, that is independent of the action taken, is
subtracted from the obtained reward, i.e, the update is given by θt+1 = θt +α(rt −b) ∂ ln π(a ∂θt
t ;θt )
.
How are E[(rt − b) ∂ ln π(a
∂θt
t ;θt )
] and E[rt ∂ ln π(a
∂θt
t ;θt )
] related?

(a) E[(rt − b) ∂ ln π(a

∂θt
t ;θt )
] = E[rt ∂ ln π(a
∂θt
t ;θt )
]
(b) E[(rt − b) ∂ ln π(a
∂θt
t ;θt )
] < E[rt ∂ ln π(a
∂θt
t ;θt )
]
(c) E[(rt − b) ∂ ln π(a
∂θt
t ;θt )
] > E[rt ∂ ln π(a
∂θt
t ;θt )
]
(d) Could be either of a, b or c, depending on the choice of baseline

Sol. (a)

∂ ln π(at ; θt ) 1 ∂π(at ; θt )
E[b ] = E[b ]
∂θt π(at ; θt ) ∂θt
X 1 ∂π(a; θt )
= [b ]π(a; θt )
a
π(a; θ t ) ∂θt
X ∂π(a; θt )
= b
a
∂θt
∂1
=b
∂θt
=0

Thus, E[(rt − b) ∂ ln π(a

∂θt
t ;θt )
] = E[rt ∂ ln π(a
∂θt
t ;θt )
]
5. Consider the following policy-search algorithm for a multi-armed binary bandit:

∀a, πt+1 (a) = πt (a)(1 − α) + α(1a=at rt + (1 − 1a=at )(1 − rt ))

where 1at =a is 1 if a = at and 0 otherwise. Which of the following is true for the above
algorithm?
(a) It is LR−I algorithm.
(b) It is LR−ϵP algorithm.
(c) It would work well if the best arm had probability of 0.9 of resulting in +1 reward and
the next best arm had probability of 0.5 of resulting in +1 reward
(d) It would work well if the best arm had probability of 0.3 of resulting in +1 reward and
the worst arm had probability of 0.25 of resulting in +1 reward

2
Sol. (c)
The given algorithm is LR=P algorithm. It would work well for the case described in (c) as
it gives equal weightage to penalties and rewards, and as the gap between best arm and next
best arm’s probability of giving +1 reward is significant, it would easily figure out the best
arm.
6. Assertion: Contextual bandits can be modeled as a full reinforcement learning problem.
Reason: We can define an MDP with n states where n is the number of bandits. The number
of actions from each state corresponds to the arms in each bandit, with every action leading
to termination of the episode, and giving a reward according to the corresponding bandit and
arm.

(a) Assertion and Reason are both true and Reason is a correct explanation of Assertion
(b) Assertion and Reason are both true and Reason is not a correct explanation of Assertion
(c) Assertion is true and Reason is false
(d) Both Assertion and Reason are false

Sol. (a)
The given MDP in the Reason correctly models the Contextual Bandit problem. Full RL
problem is just an extension of contextual bandit problem, just that the action taken in a
state affects the state transition in MDP.
7. Let’s assume for some full RL problem we are acting according to a policy π. At some time t,
we are in a state s where we took action a1 . After few time steps, at time t′ , the same state
s was reached where we performed an action a2 (̸= a1 ). Which of the following statements is
true?
(a) π is definitely a Stationary policy
(b) π is definitely a Non-Stationary policy
(c) π can be Stationary or Non-Stationary.

Sol. (c)
A stationary policy can be stochastic and thus the for same state different actions can be
chosen at different time steps. Thus π can be Stationary or Non-Stationary policy.
8. Stochastic gradient ascent/descent update occurs in the right direction at every step

(a) True
(b) False
Sol. (b)
Stochastic gradient descent updates need not always move in the “correct” direction(direction
of gradient). However stochastic gradient approaches are generally expected to move in the
correct direction in an expected sense.
9. Which of the following is true for an MDP?
(a) P r(st+1 , rt+1 |st , at ) = P r(st+1 , rt+1 )
(b) P r(st+1 , rt+1 |st , at , st−1 , at−1 , st−2 , at−2 , ..., s0 , a0 ) = P r(st+1 , rt+1 |st , at )

3
(c) P r(st+1 , rt+1 |st , at ) = P r(st+1 , rt+1 |s0 , a0 )
(d) P r(st+1 , rt+1 |st , at ) = P r(st , rt |st−1 , at−1 )
Sol. (b)
(b) is true for any MDP. (a),(c) and (d) are not true.
10. Remember for discounted returns,

Gt = rt + γrt+1 + γ 2 rt+2 + ...

Where γ is a discount factor. Which of the following best explains what happens when γ > 1,
(say γ = 5)?
(a) Nothing, γ > 1 is common for many RL problems
(b) Theoretically nothing can go wrong, but this case does not represent any real world
problems
(c) The agent will learn that delayed rewards will always be beneficial and so will not learn
properly.
(d) None of the above is true.

Sol. (c)
Due to higher exponent of γ in the future reward, they will be of more impact for the current
return value. So there it is a highly probable that agent learn not to finish the problem/game
but simply, extend it, or continue it.

Reinforcement Learning Assignment
No ratings yet
Reinforcement Learning Assignment
4 pages
Assignment 3: Reinforcement Learning Prof. B. Ravindran
100% (1)
Assignment 3: Reinforcement Learning Prof. B. Ravindran
4 pages
Reinforcement Learning Assignment Solutions
No ratings yet
Reinforcement Learning Assignment Solutions
4 pages
RL-solution 4
No ratings yet
RL-solution 4
4 pages
Reinforcement Learning Assignment 2
No ratings yet
Reinforcement Learning Assignment 2
6 pages
Reinforcement Learning Exam
No ratings yet
Reinforcement Learning Exam
6 pages
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
24 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
Reinforcement Learning Theory Overview
No ratings yet
Reinforcement Learning Theory Overview
169 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
SMDP Q-Learning Assignment Solutions
No ratings yet
SMDP Q-Learning Assignment Solutions
4 pages
Machine Learning Assignment Solutions
No ratings yet
Machine Learning Assignment Solutions
5 pages
Reinforcement Learning Homework Solutions
No ratings yet
Reinforcement Learning Homework Solutions
8 pages
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
0% (1)
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
4 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
18 pages
Notes For Module 4 and 5
No ratings yet
Notes For Module 4 and 5
9 pages
1 DRL Compre Regular
No ratings yet
1 DRL Compre Regular
12 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
30 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
51 pages
Overview of Reinforcement Learning Methods
No ratings yet
Overview of Reinforcement Learning Methods
12 pages
Bits
No ratings yet
Bits
5 pages
MDPs and Reinforcement Learning Insights
No ratings yet
MDPs and Reinforcement Learning Insights
15 pages
Reinforcement Learning Question Bank
No ratings yet
Reinforcement Learning Question Bank
11 pages
RL Exam Tutti
No ratings yet
RL Exam Tutti
47 pages
RL Unitwise Imp Questions
No ratings yet
RL Unitwise Imp Questions
4 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Lecture Notes RL
No ratings yet
Lecture Notes RL
14 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
Unit 4 QP
No ratings yet
Unit 4 QP
19 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
30 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
Reinforcement Learning Concepts Explained
No ratings yet
Reinforcement Learning Concepts Explained
15 pages
RL Paper Deepsk
No ratings yet
RL Paper Deepsk
4 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Reinforcement Learning - Unit 6 - Week 4
0% (1)
Reinforcement Learning - Unit 6 - Week 4
3 pages
Markov Decision Process Algorithms Explained
No ratings yet
Markov Decision Process Algorithms Explained
4 pages
Reinforcement Learning - Unit 6 - Week 3
No ratings yet
Reinforcement Learning - Unit 6 - Week 3
4 pages
RL-Solution 1
No ratings yet
RL-Solution 1
5 pages
Reinforcement Learning Tutorial Questions
No ratings yet
Reinforcement Learning Tutorial Questions
5 pages
Understanding Reinforcement Learning Techniques
No ratings yet
Understanding Reinforcement Learning Techniques
9 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
17 pages
Reinforcement Learning Quiz 2: MDPs & Returns
No ratings yet
Reinforcement Learning Quiz 2: MDPs & Returns
4 pages
Practice Assignment 4: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 4: Reinforcement Learning Prof. B. Ravindran
2 pages
Active Learning in Inverse Reinforcement Learning
No ratings yet
Active Learning in Inverse Reinforcement Learning
16 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Reinforcement Learning Basics Explained
No ratings yet
Reinforcement Learning Basics Explained
15 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Intro RL Paper GPT
No ratings yet
Intro RL Paper GPT
5 pages
Understanding MDP in Reinforcement Learning
No ratings yet
Understanding MDP in Reinforcement Learning
10 pages
Unit-8 - Reinforcement Learning
No ratings yet
Unit-8 - Reinforcement Learning
52 pages
Key Questions in Reinforcement Learning
No ratings yet
Key Questions in Reinforcement Learning
3 pages
Reinforcement Learning - Unit 7 - Week 4
No ratings yet
Reinforcement Learning - Unit 7 - Week 4
2 pages
Resources For The Certified Automation Professional
No ratings yet
Resources For The Certified Automation Professional
3 pages
Presentasi Bisnis Perbankan Syariah
No ratings yet
Presentasi Bisnis Perbankan Syariah
25 pages
CCTV System Standards
No ratings yet
CCTV System Standards
17 pages
Scanmatik 2 Pro Compatibilidad de Apps
No ratings yet
Scanmatik 2 Pro Compatibilidad de Apps
5 pages
Colibritm Crane and Tool March 2023
No ratings yet
Colibritm Crane and Tool March 2023
2 pages
Changing The Column Display in Wireshark
No ratings yet
Changing The Column Display in Wireshark
5 pages
Micro Project Format MSBTE
No ratings yet
Micro Project Format MSBTE
4 pages
21 DHCP Configuration
No ratings yet
21 DHCP Configuration
59 pages
Toshiba NB520 Compal LA-6858p Rev1.0 Schematic
No ratings yet
Toshiba NB520 Compal LA-6858p Rev1.0 Schematic
38 pages
Chapter 8: Network Security
No ratings yet
Chapter 8: Network Security
13 pages
ESL Brains Stay Connected at All Times TV 1157
No ratings yet
ESL Brains Stay Connected at All Times TV 1157
5 pages
AMD APU ZGX System Block Diagram
No ratings yet
AMD APU ZGX System Block Diagram
59 pages
Cambridge International AS & A Level: Computer Science 9608/11
No ratings yet
Cambridge International AS & A Level: Computer Science 9608/11
20 pages
Algorithm Design for Online Biryani Store
No ratings yet
Algorithm Design for Online Biryani Store
4 pages
Understanding the Dark Web and Tor
100% (1)
Understanding the Dark Web and Tor
11 pages
Digital Marketing Unit 1 & 2
No ratings yet
Digital Marketing Unit 1 & 2
7 pages
ACS-30 Heat-Tracing Control System
No ratings yet
ACS-30 Heat-Tracing Control System
9 pages
Computer Skill Test Examination Guide
No ratings yet
Computer Skill Test Examination Guide
7 pages
NS4000 4100 ECDIS Utilities Eng PDF
No ratings yet
NS4000 4100 ECDIS Utilities Eng PDF
185 pages
Horizon MF Training 05-2019
No ratings yet
Horizon MF Training 05-2019
23 pages
Tse Js
No ratings yet
Tse Js
30 pages
51zq6l51 The True Power of Water PDF
No ratings yet
51zq6l51 The True Power of Water PDF
6 pages
Operating System Assignment Fall 2024
No ratings yet
Operating System Assignment Fall 2024
2 pages
1 s2.0 S0167926023001748 Main
No ratings yet
1 s2.0 S0167926023001748 Main
11 pages
NMCOM Computer Centre Complete Profile
No ratings yet
NMCOM Computer Centre Complete Profile
9 pages
Honda Service Bulletin - June 2020
No ratings yet
Honda Service Bulletin - June 2020
9 pages
Theoretical Computer Science Exam
No ratings yet
Theoretical Computer Science Exam
2 pages
EPAM Systems Senior Data Engineer Interview Questions and Answers
No ratings yet
EPAM Systems Senior Data Engineer Interview Questions and Answers
5 pages
Vertical Stacker Upgrade Kit Guide
No ratings yet
Vertical Stacker Upgrade Kit Guide
10 pages
Full Advertisement of TC (PDF - Copy) 18.12.2024
No ratings yet
Full Advertisement of TC (PDF - Copy) 18.12.2024
33 pages

RL Solution3

Uploaded by

RL Solution3

Uploaded by

Assignment 3

2. Which of the following statements is true about the RL problem?

(iv) µt+1 = µt + αrt atσ−µ

Which of the above updates are correct?

4. The update in REINFORCE is given by θt+1 = θt + αrt ∂ ln π(a ∂θt

(a) E[(rt − b) ∂ ln π(a

Thus, E[(rt − b) ∂ ln π(a

∀a, πt+1 (a) = πt (a)(1 − α) + α(1a=at rt + (1 − 1a=at )(1 − rt ))

Gt = rt + γrt+1 + γ 2 rt+2 + ...

You might also like