Policy Gradient Methods For Reinforcement Learning PDF
Policy Gradient Methods For Reinforcement Learning PDF
net/publication/2503757
CITATIONS READS
2,512 2,976
4 authors, including:
Satinder Singh
University of Michigan
229 PUBLICATIONS 17,447 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Richard Sutton on 02 April 2015.
Definition
A policy gradient method is a reinforcement learning approach that directly
optimizes a parametrized control policy by gradient descent. It belongs to the
class of policy search techniques that maximize the expected return of a pol-
icy in a fixed policy class while traditional value function approximation
approaches derive policies from a value function. Policy gradient approaches
have various advantages: they allow the straightforward incorporation of do-
main knowledge in the policy parametrization and often significantly fewer pa-
rameters are needed for representing the optimal policy than the corresponding
value function. They are guaranteed to converge to at least a locally optimal
policy and can handle continuous states and action, and often even imperfect
state information. Their major drawbags are the difficult use in off-policy set-
tings, their slow convergence in discrete problems and that global optima are
not attained.
Expected Return
The goal of policy gradient methods is to optimize the expected return of a
policy πθ with respect to the expected return
"H $
#
J(θ) = Zγ E k
γ rk ,
k=0
1
where γ ∈ [0, 1] denotes a discount factor, a normalization Zγ and H the plan-
ning horizon. For finite H, we have an episodic reinforcement learning scenario
where the truly optimal policy is non-stationary and the normalization does not
matter. For an infinite horizon H = ∞ , we choose the normalization to be
Zγ ≡ (1 − γ) for γ < 1 and Z1 ≡ limγ→1 (1 − γ) = 1/H for average reward
reinforcement learning problem where γ = 1.
2
Likelihood-Ratio Gradients
Likelihood ratio gradients rely upon the stochasticity of either the policy for
model-free approaches, or the system in the model-based case, and, hence, they
may cope better with noise and the sensitivity problems.
!H Assume that you have
a path distribution pθ (τ ) and rewards R(τ ) = Zγ k=0 γ k rk along a path τ .
Thus, you can write the gradient of the expected return as
∇θ J(θ) = ∇θ pθ (τ )R(τ )dτ = pθ (τ )∇θ log pθ (τ )R(τ )dτ = E{∇θ log pθ (τ )R(τ )}.
%H
If our system p(s$ |s, a) is Markovian, we can use pθ (τ ) = p(s0 ) h=0 p(sk+1 |sk , ak )πθ (ak |sk )
for a stochastic policy a ∼ πθ (a|s) to obtain the model-free policy gradient es-
timator known as Episodic REINFORCE [8]
"H H
$
# #
∇θ J(θ) = Zγ E γ k ∇θ log πθ (ak |sk ) γ k−h rk ,
h=0 k=h
and for the deterministic policy a = πθ (s), the model-based policy gradient
"H H
$
# & '#
∇θ J(θ) = Zγ E γ ∇a log p(sk+1 |sk , ak ) ∇θ πθ (s)
k T
γ k−h
rk ,
h=0 k=h
%H
follows from pθ (τ ) = p(s0 ) h=0 p(sk+1 |sk , πθ (sk )). Note that all rewards pre-
ceeding an action may be omitted as the cancel ( ! out in expectation.
) * Using a
H k−h )
state-action value function Qπθ (s, a, h) = E k=h γ r k ) s, a, π θ (see value
function approximation), we can rewrite REINFORCE in its modern form
"H $
#
∇θ J(θ) = Zγ E γ ∇θ log πθ (ak |sk ) (Q (s, a, h) − b(s, h)) ,
k πθ
h=0
known as the policy gradient theorem where the baseline b(s, h) is an arbitrary
function that may be used to reduce the variance.
While likelihood ratio gradients have been known since the late 1980s, they
have recently experienced an upsurge of interest due to progress towards a reduc-
tion variance using optimal baselines [4] a compatible function approximation
[7], policy gradients in reproducing kernel Hilbert space [1] as well as faster,
more robust convergence using natural policy gradients, see [1, 5] for these de-
velopments.
See Also
reinforcement learning, policy search, value function approximation
3
References and Recommended Reading
[1] James Andrew Bagnell. Learning Decisions: Robustness, Uncertainty, and
Approximation. Doctoral dissertation, Robotics Institute, Carnegie Mellon
University, 5000 Forbes Avenue, Pittsburgh, PA 15213, August 2004.
[3] Peter Glynn. Likelihood ratio gradient estimation for stochastic systems.
Communications of the ACM, 33(10):75–84, October 1990.
[4] Greg Lawrence, Noah Cowan, and Stuart Russell. Efficient gradient esti-
mation for motor control learning. In Proceedings of the International Con-
ference on Uncertainty in Artificial Intelligence (UAI), Acapulco, Mexico,
2003.
[5] J. Peters and S. Schaal. Reinforcement learning of motor skills with policy
gradients. Neural Networks, 21(4):682–97, 2008.