Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
October 9, 2020
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 1 / 35
Content
Introduction
Policy approximation and its advantages
The policy gradient theorem for episodic problems
REINFORCE: Monte Carlo
REINFORCE with Baseline
Actor-Critic methods
Policy Gradient for continuing problems
Policy parameterization for continuous actions
Summary
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 2 / 35
Introduction
\t )
θt+1 = θt + α∇J(θ (13.1)
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 3 / 35
Policy approximation and its advantages
. e h(s,a,θ)
π(a|s, θ) = P h(s,b,θ) (13.2)
be
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 4 / 35
Policy approximation and its advantages
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 5 / 35
Policy approximation and its advantages
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 6 / 35
The policy gradient theorem for episodic problems
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 7 / 35
The policy gradient theorem for episodic problems
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 8 / 35
The policy gradient theorem for episodic problems
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 9 / 35
The policy gradient theorem for episodic problems
This is the all-actions algorithm because its update involves all of the
actions
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 11 / 35
REINFORCE: Monte Carlo
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 12 / 35
REINFORCE: Monte Carlo
. ∇π(At |St , θt )
θt+1 = θt + αGt (13.8)
π(At |St , θt )
Note that REINFORCE uses the complete return from time t, which
includes all future rewards up until the end of the episode.
In this sense REINFORCE is a Monte Carlo algorithm and is well
defined only for the episodic case with all updates made in retrospect
after the episode is completed.
Eligibility vector:
∇x
∇ln(x) =
x
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 13 / 35
REINFORCE: Monte Carlo
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 14 / 35
REINFORCE: Monte Carlo
Figure: REINFORCE on the short-corridor gridworld. With a good step size, the
total reward per episode approaches the optimal value of the start state.
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 15 / 35
REINFORCE with baseline
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 16 / 35
REINFORCE with baseline
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 17 / 35
REINFORCE with baseline
It is much less clear how to set the step size for the policy
parameters, αθ , whose best value depends on the range of variation of
the rewards and on the policy parameterization.
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 18 / 35
REINFORCE with baseline
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 19 / 35
Actor-Critic methods
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 20 / 35
Actor-Critic methods
One-step actor–critic methods replace the full return of REINFORCE
(13.11) with the one-step return (and use a learned state-value
function as the baseline) as follows:
!
. ∇π(At |St , θt )
θt+1 = θt + α Gt:t+1 − v̂(St , w) (13.12)
π(At |St , θt )
!
∇π(At |St , θt )
= θt + α Rt+1 + γv̂(St+1 , w) − v̂(St , w)
π(At |St , θt )
(13.13)
∇π(At |St , θt )
= θt + αδt (13.14)
π(At |St , θt )
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 21 / 35
Actor-Critic methods
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 22 / 35
Actor-Critic methods
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 23 / 35
Actor-Critic methods
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 24 / 35
Policy gradient for continuing problems
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 25 / 35
Policy gradient for continuing problems
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 26 / 35
Policy gradient for continuing problems
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 27 / 35
Policy gradient for continuing problems
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 28 / 35
Policy gradient for continuing problems
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 29 / 35
Policy parameterization for continuous actions
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 30 / 35
Policy parameterization for continuous actions
where xµ (s) and xσ (s) are state feature vectors perhaps constructed
by one of the methods described in Section 9.5.
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 31 / 35
Summary
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 32 / 35
Summary
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 33 / 35
The End
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 34 / 35
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 35 / 35