0% found this document useful (0 votes)
101 views

Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto

The document summarizes policy gradient methods for reinforcement learning. It introduces policy approximation using parameterized policies and discusses advantages over action-value methods. The policy gradient theorem states that the gradient of expected return with respect to policy parameters does not depend on the derivative of the state distribution. The REINFORCE algorithm is presented as an instantiation of the policy gradient method using Monte Carlo returns, with high variance that can be addressed using a baseline.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views

Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto

The document summarizes policy gradient methods for reinforcement learning. It introduces policy approximation using parameterized policies and discusses advantages over action-value methods. The policy gradient theorem states that the gradient of expected return with respect to policy parameters does not depend on the derivative of the state distribution. The REINFORCE algorithm is presented as an instantiation of the policy gradient method using Monte Carlo returns, with high variance that can be addressed using a baseline.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Chapter 13: Policy Gradient Methods

by Richard Sutton and Andrew Barto

October 9, 2020

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 1 / 35
Content

Introduction
Policy approximation and its advantages
The policy gradient theorem for episodic problems
REINFORCE: Monte Carlo
REINFORCE with Baseline
Actor-Critic methods
Policy Gradient for continuing problems
Policy parameterization for continuous actions
Summary

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 2 / 35
Introduction

In this chapter we consider methods that learn a parameterized policy


that can select actions without consulting a value function.
A value function may still be used to learn the policy parameter, but
is not required for action selection.
Learning the policy parameter is based on the gradient of some scalar
performance measure J(θ) with respect to the policy parameter
0
θ ∈ Rd , where d 0 is the dimension of the parameter vector.
Maximize performance by updating the parameter using the rule:

\t )
θt+1 = θt + α∇J(θ (13.1)

\t ) is a stochastic estimate whose expectation approximates the


∇J(θ
gradient of the performance measure with respect to its argument θt .

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 3 / 35
Policy approximation and its advantages

Numerical preferences for each state-action pair, h(s, a, θ) ∈ R


Soft-max in action preferences:

. e h(s,a,θ)
π(a|s, θ) = P h(s,b,θ) (13.2)
be

h(s, a, θ) can be the output of a Neural Network(NN) where θ is the


weight vector of all connections of the NN.
Or, it could be linear in features:

h(s, a, θ) = θ> x(s, a) (13.3)

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 4 / 35
Policy approximation and its advantages

Advantages with respect to action-value (AV) methods:


Policy gradient (PG) methods can approach a deterministic policy. In
contrast, -greedy AV methods there is always a  probability of
selecting a random action.
PG methods can learn arbitrary probabilities for action selection. In
problems with significant function approximation, the best
approximate policy may be stochastic.
PG methods are driven to produce optimal stochastic policies. If the
optimal policy is deterministic, then the preferences of the optimal
actions will be driven infinitely higher than all suboptimal actions (if
permitted by the parameterization).

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 5 / 35
Policy approximation and its advantages

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 6 / 35
The policy gradient theorem for episodic problems

In particular, it is the continuity of the policy dependence on the


parameters that enables policy-gradient methods to approximate
gradient ascent.
Largely because of this, stronger convergence guarantees are available
for policy-gradient methods than for action-value methods
In the episodic case we define performance as the value of the start of
the episode:
.
J(θ) = vπθ (s0 ) (13.4)

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 7 / 35
The policy gradient theorem for episodic problems

Claim: The gradient of performance with respect to the policy


parameter does not involve the derivative of the state distribution.
X X
∇J(θ) ∝ µ(s) qπ (s, a)∇π(a|s, θ) (13.5)
s a

Why? With function approximation it may seem challenging to


change the policy parameter in a way that ensures improvement.
Performance depends on both the action selections (agent) and the
distribution of states (model/environment).

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 8 / 35
The policy gradient theorem for episodic problems

Figure: Proof of Policy Gradient Theorem Part 1

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 9 / 35
The policy gradient theorem for episodic problems

Figure: Proof of Policy Gradient Theorem Part 2

For more details, check Lilian Weng’s webpage 1 .


1
https://bit.ly/30OxHLj
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 10 / 35
REINFORCE: Monte Carlo
The right-hand side of the policy gradient theorem is a sum over
states weighted by how often the states occur under the target policy
π; if π is followed, then states will be encountered in these
proportions:
X X
∇J(θ) ∝ µ(s) qπ (s, a)∇π(a|s, θ)
s a
hX i
= Eπ qπ (St , a)∇π(a|St , θ) (13.6)
a

Instantiating our stochastic gradient-ascent algorithm as:


. X
θt+1 = θt + α q̂(St , a, w)∇π(a|St , θ) (13.7)
a

This is the all-actions algorithm because its update involves all of the
actions

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 11 / 35
REINFORCE: Monte Carlo

REINFORCE classical algorithm focus only in the action At taken at


time t:
" #
X ∇π(a|St , θ)
∇J(θ) ∝ Eπ π(a|St , θ)qπ (St , a)
a
π(a|St , θ)
" #
∇π(At |St , θ)
= Eπ qπ (St , At )
π(At |St , θ)
" #
∇π(At |St , θ)
= Eπ Gt
π(At |St , θ)

The final expression in brackets is exactly what is needed, a quantity


that can be sampled on each time step whose expectation is
proportional to the gradient.

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 12 / 35
REINFORCE: Monte Carlo

Using this sample to instantiate our generic stochastic gradient ascent


algorithm 13.1 yields the REINFORCE update:

. ∇π(At |St , θt )
θt+1 = θt + αGt (13.8)
π(At |St , θt )

Note that REINFORCE uses the complete return from time t, which
includes all future rewards up until the end of the episode.
In this sense REINFORCE is a Monte Carlo algorithm and is well
defined only for the episodic case with all updates made in retrospect
after the episode is completed.
Eligibility vector:
∇x
∇ln(x) =
x

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 13 / 35
REINFORCE: Monte Carlo

Figure: REINFORCE Algorithm

As a stochastic gradient method, REINFORCE has good theoretical


convergence properties for sufficiently small α.
However, as a Monte Carlo method REINFORCE may be of high
variance and thus produce slow learning.

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 14 / 35
REINFORCE: Monte Carlo

Figure: REINFORCE on the short-corridor gridworld. With a good step size, the
total reward per episode approaches the optimal value of the start state.

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 15 / 35
REINFORCE with baseline

The policy gradient theorem (13.5) can be generalized to include a


comparison of the action value to an arbitrary baseline b(s):
X X 
∇J(θ) ∝ µ(s) qπ (s, a) − b(s) ∇π(a|s, θ) (13.10)
s a

The baseline can be any function, even a random variable, as long as


it does not vary with a; the equation remains valid because the
subtracted quantity is zero:
X X
b(s)∇π(a|s, θ) = b(s)∇ π(a|s, θ) = b(s)∇1 = 0
a a

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 16 / 35
REINFORCE with baseline

The update rule that we end up with is a new version of REINFORCE


that includes a general baseline:
 ∇π(A |S , θ )
.

t t t
θt+1 = θt + α Gt − b(St ) (13.11)
π(At |St , θt )

Because the baseline could be uniformly zero, this update is a strict


generalization of REINFORCE
In general, the baseline leaves the expected value of the update
unchanged, but it can have a large effect on its variance.
One natural choice for the baseline is an estimate of the state value,
v̂ (St , w), where w ∈ Rd is a weight vector learned by one of the
methods presented in previous chapters.

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 17 / 35
REINFORCE with baseline

Choose αw according to section 9.6, αw = 0.1/E k∇v̂(St , w)k2µ .


 

It is much less clear how to set the step size for the policy
parameters, αθ , whose best value depends on the range of variation of
the rewards and on the policy parameterization.

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 18 / 35
REINFORCE with baseline

Figure: Adding a baseline to REINFORCE can make it learn much faster, as


illustrated here on the short-corridor gridworld. The step size used here for plain
REINFORCE is that at which it performs best (to the nearest power of two).

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 19 / 35
Actor-Critic methods

In REINFORCE with baseline, the learned state-value function


estimates the value of the only the first state of each state transition.
This estimate sets a baseline for the subsequent return, but is made
prior to the transition’s action and thus cannot be used to assess that
action.
In actor–critic methods, on the other hand, the state-value function is
applied also to the second state of the transition.
The estimated value of the second state, when discounted and added
to the reward, constitutes the one-step return, Gt:t+1 , which is a
useful estimate of the actual return and thus is a way of assessing the
action.
When the state-value function is used to assess actions in this way it
is called a critic, and the overall policy-gradient method is termed an
actor–critic method.

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 20 / 35
Actor-Critic methods
One-step actor–critic methods replace the full return of REINFORCE
(13.11) with the one-step return (and use a learned state-value
function as the baseline) as follows:
!
. ∇π(At |St , θt )
θt+1 = θt + α Gt:t+1 − v̂(St , w) (13.12)
π(At |St , θt )
!
∇π(At |St , θt )
= θt + α Rt+1 + γv̂(St+1 , w) − v̂(St , w)
π(At |St , θt )
(13.13)
∇π(At |St , θt )
= θt + αδt (13.14)
π(At |St , θt )

The natural state-value-function learning method to pair with this is


semi-gradient TD(0).

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 21 / 35
Actor-Critic methods

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 22 / 35
Actor-Critic methods

The generalizations to the forward view of n-step methods and then


to a λ-return algorithm are straightforward.
The one-step return in (13.12) is merely replaced by Gt:t+n or Gtλ
respectively.
The backward view of the λ-return algorithm is also straightforward,
using separate eligibility traces for the actor and critic, each after the
patterns in Chapter 12.

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 23 / 35
Actor-Critic methods

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 24 / 35
Policy gradient for continuing problems

For continuing problems without episode boundaries we need to define


performance in terms of the average rate of reward per time step:
h
. . 1X  
J(θ) = r (π) = lim E Rt |S0 , A0:t−1 π (13.15)
x→∞ h
t=1

= lim E[Rt |S0 , A0:t−1 π
t→∞
X X X
= µ(s) π(a|s) p(s 0 , r |s, a)r
s a s 0 ,r

where µ is the steady-state distribution under π:


X X
µ(s) π(a|s, θ)p(s 0 |s, a) = µ(s 0 ), for all s 0 ∈ S (13.16)
s a

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 25 / 35
Policy gradient for continuing problems

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 26 / 35
Policy gradient for continuing problems

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 27 / 35
Policy gradient for continuing problems

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 28 / 35
Policy gradient for continuing problems

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 29 / 35
Policy parameterization for continuous actions

Policy-based methods offer practical ways of dealing with large actions


spaces, even continuous spaces with an infinite number of actions.
Instead of computing learned probabilities for each of the many
actions, we instead learn statistics of the probability distribution.
For example, the action set might be the real numbers, with actions
chosen from a normal (Gaussian) distribution.

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 30 / 35
Policy parameterization for continuous actions

To produce a policy parameterization, the policy can be defined as


the normal probability density over a real-valued scalar action, with
mean and standard deviation given by parametric function
approximators that depend on the state. That is,
!
. 1 (a − µ(s, θ))2
π(a|s, θ) = exp − (13.19)
σ(s, θ) 2σ(s, θ)2
.
µ(s, θ) = θµ> xµ (s) (13.20)
.
σ(s, θ) = exp(θσ> xσ (s)) (13.21)

where xµ (s) and xσ (s) are state feature vectors perhaps constructed
by one of the methods described in Section 9.5.

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 31 / 35
Summary

In this chapter, we considered methods that learn a parameterized


policy that enables actions to be taken without consulting
action-value estimates.
In particular, we have considered policy-gradient methods—meaning
methods that update the policy parameter on each step in the
direction of an estimate of the gradient of performance with respect
to the policy parameter.
Methods that learn and store a policy parameter have many
advantages: learn specific probabilities for taking the actions, learn
appropriate levels of exploration, approach deterministic policies
asymptotically, naturally handle continuous action spaces.

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 32 / 35
Summary

The policy gradient theorem gives an exact formula for how


performance is affected by the policy parameter that does not involve
derivatives of the state distribution.
The REINFORCE method follows directly from the policy gradient
theorem.
Adding a state-value function as a baseline reduces REINFORCE’s
variance without introducing bias.
If the state-value function is also used to assess the policy’s action
selections, then the value function is called a critic and the policy is
called an actor ; the overall method is called an actor–critic method.
The critic introduces bias into the actor’s gradient estimates, but it
substantially reduces variance.

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 33 / 35
The End

by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 34 / 35
by Richard Sutton and Andrew Barto Chapter 13: Policy Gradient Methods October 9, 2020 35 / 35

You might also like