0% found this document useful (0 votes)
69 views

Policy Gradient Methods For Reinforcement Learning PDF

This document summarizes policy gradient methods for reinforcement learning. Policy gradient methods directly optimize a parameterized policy using gradient descent. The goal is to optimize the expected return of the policy with respect to the environment. Model-free approaches estimate the policy gradient using finite differences or likelihood ratio methods. Likelihood ratio methods rely on the stochasticity of the policy or environment dynamics to estimate the gradient. The REINFORCE algorithm is a widely used policy gradient method that estimates the gradient using likelihood ratios.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Policy Gradient Methods For Reinforcement Learning PDF

This document summarizes policy gradient methods for reinforcement learning. Policy gradient methods directly optimize a parameterized policy using gradient descent. The goal is to optimize the expected return of the policy with respect to the environment. Model-free approaches estimate the policy gradient using finite differences or likelihood ratio methods. Likelihood ratio methods rely on the stochasticity of the policy or environment dynamics to estimate the gradient. The REINFORCE algorithm is a widely used policy gradient method that estimates the gradient using likelihood ratios.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2503757

Policy Gradient Methods for Reinforcement Learning with Function


Approximation

Article  in  Advances in Neural Information Processing Systems · February 2000


Source: CiteSeer

CITATIONS READS
2,512 2,976

4 authors, including:

Richard Sutton David Allen Mcallester


University of Alberta TTI-Chicago
175 PUBLICATIONS   45,045 CITATIONS    190 PUBLICATIONS   20,387 CITATIONS   

SEE PROFILE SEE PROFILE

Satinder Singh
University of Michigan
229 PUBLICATIONS   17,447 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Computer Go View project

Reinforcement Learning Models View project

All content following this page was uploaded by Richard Sutton on 02 April 2015.

The user has requested enhancement of the downloaded file.


Policy Gradient Methods

Jan Peters, Max Planck Institute for Biological Cybernetics


J. Andrew Bagnell, Carnegie Mellon University

Definition
A policy gradient method is a reinforcement learning approach that directly
optimizes a parametrized control policy by gradient descent. It belongs to the
class of policy search techniques that maximize the expected return of a pol-
icy in a fixed policy class while traditional value function approximation
approaches derive policies from a value function. Policy gradient approaches
have various advantages: they allow the straightforward incorporation of do-
main knowledge in the policy parametrization and often significantly fewer pa-
rameters are needed for representing the optimal policy than the corresponding
value function. They are guaranteed to converge to at least a locally optimal
policy and can handle continuous states and action, and often even imperfect
state information. Their major drawbags are the difficult use in off-policy set-
tings, their slow convergence in discrete problems and that global optima are
not attained.

Structure of the Learning System


Policy gradient methods are centered around a parametrized policy πθ with pa-
rameters θ that allows the selection of actions a given the state s, also known
as a direct controller. Such a policy may either be deterministic a = πθ (s) or
stochastic a ∼ πθ (a|s). This choice also affects the policy gradient approach
(e.g., a deterministic policy requires a model-based formulation when used
for likelihood ratio policy gradients), chooses how the exploration-exploitation
dilemma is addressed (e.g., a stochastic policy tries new actions while a de-
terministic policy requires the perturbation of policy parameters or sufficient
stochasticity in the system), and may affect the optimal solution (e.g., for a time-
invariant or stationary policy, the optimal policy can be stochastic! [7]). Fre-
quently used policy are Gibbs policies πθ (a|s) = exp(φ(s, a)T θ)/ b exp(φ(s, b)T θ)
for discrete problems [7, 1] and, for continuous problems, Gaussian policies
πθ (a|s) = N (φ(s, a)T θ1 , θ2 ) with an exploration parameter θ2 , see [8, 5].

Expected Return
The goal of policy gradient methods is to optimize the expected return of a
policy πθ with respect to the expected return
"H $
#
J(θ) = Zγ E k
γ rk ,
k=0

1
where γ ∈ [0, 1] denotes a discount factor, a normalization Zγ and H the plan-
ning horizon. For finite H, we have an episodic reinforcement learning scenario
where the truly optimal policy is non-stationary and the normalization does not
matter. For an infinite horizon H = ∞ , we choose the normalization to be
Zγ ≡ (1 − γ) for γ < 1 and Z1 ≡ limγ→1 (1 − γ) = 1/H for average reward
reinforcement learning problem where γ = 1.

Gradient Descent in Policy Space


Policy gradient methods follow the gradient of the expected return
θk+1 = θk + αk ∇θ J(πθ )|θ=θk ,
where θk denotes the parameters after update k with initial policy!∞ θ0 and αk de-
notes
!∞ a learning rate. If the gradient estimator is unbiased, k=0 αk → ∞ and
k=0 α 2
k = const, the convergence to a local minimum can be guaranteed. In
optimal control, model-based gradient methods have been used for optimizing
policies since the late 1960s. While these are used machine learning community
(e.g., differential dynamic programming with learned models), they are numer-
ically very brittle and rely on accurate, deterministic models. Hence, they may
suffer significantly from optimization biases and are not generally applicable, es-
pecially not in a model-free case. Several model-free alternatives can be found in
the simulation optimization literature [2], i.e., finite-difference gradients, like-
lihood ratio approaches, response-surface methods, and mean-valued, “weak”
derivatives. The advantages and disadvantages of these different approaches are
still a fiercely debated topic [2]. In machine learning, the first two approaches
have been dominating the field.

Finite Difference Gradients


The simplest policy gradient approaches with the most practical applications
(see [5] for a list of robotics application of this method) estimate the gradient
by perturbing the policy parameters. For a current policy is θk with expected
return J(θk ), this approach will create explorative policies θ̂i = θk + δθi with
the approximated expected returns given by J(θ̂i ) ≈ J(θk ) + δθiT g where g =
∇θ J(πθ )|θ=θk . In this case, it fully sufficies to determine the gradient by linear
regression, i.e., we obtain
g = (∆ΘT ∆Θ)−1 ∆ΘT ∆J,
with parameter perturbations ∆Θ = [δθ1 , . . . , δθn ] and the mean-subtracted
rollout returns δJn = J(θ̂i ) − J(θk ) form ∆J = [δJ1 , . . . , δJn ]. The choice of the
parameter perturbation determines the performance of the approach [6]. Prob-
lems of this approach the sensitivity of the system with respect to each param-
eter differs by orders of magnitude, that a small changes in a single parameter
may render a system unstable and that it cannot cope well with stochastic-
ity unless used in simulation with a deterministically re-used random numbers
[3, 6].

2
Likelihood-Ratio Gradients
Likelihood ratio gradients rely upon the stochasticity of either the policy for
model-free approaches, or the system in the model-based case, and, hence, they
may cope better with noise and the sensitivity problems.
!H Assume that you have
a path distribution pθ (τ ) and rewards R(τ ) = Zγ k=0 γ k rk along a path τ .
Thus, you can write the gradient of the expected return as

∇θ J(θ) = ∇θ pθ (τ )R(τ )dτ = pθ (τ )∇θ log pθ (τ )R(τ )dτ = E{∇θ log pθ (τ )R(τ )}.

%H
If our system p(s$ |s, a) is Markovian, we can use pθ (τ ) = p(s0 ) h=0 p(sk+1 |sk , ak )πθ (ak |sk )
for a stochastic policy a ∼ πθ (a|s) to obtain the model-free policy gradient es-
timator known as Episodic REINFORCE [8]
"H H
$
# #
∇θ J(θ) = Zγ E γ k ∇θ log πθ (ak |sk ) γ k−h rk ,
h=0 k=h

and for the deterministic policy a = πθ (s), the model-based policy gradient
"H H
$
# & '#
∇θ J(θ) = Zγ E γ ∇a log p(sk+1 |sk , ak ) ∇θ πθ (s)
k T
γ k−h
rk ,
h=0 k=h

%H
follows from pθ (τ ) = p(s0 ) h=0 p(sk+1 |sk , πθ (sk )). Note that all rewards pre-
ceeding an action may be omitted as the cancel ( ! out in expectation.
) * Using a
H k−h )
state-action value function Qπθ (s, a, h) = E k=h γ r k ) s, a, π θ (see value
function approximation), we can rewrite REINFORCE in its modern form
"H $
#
∇θ J(θ) = Zγ E γ ∇θ log πθ (ak |sk ) (Q (s, a, h) − b(s, h)) ,
k πθ

h=0

known as the policy gradient theorem where the baseline b(s, h) is an arbitrary
function that may be used to reduce the variance.
While likelihood ratio gradients have been known since the late 1980s, they
have recently experienced an upsurge of interest due to progress towards a reduc-
tion variance using optimal baselines [4] a compatible function approximation
[7], policy gradients in reproducing kernel Hilbert space [1] as well as faster,
more robust convergence using natural policy gradients, see [1, 5] for these de-
velopments.

See Also
reinforcement learning, policy search, value function approximation

3
References and Recommended Reading
[1] James Andrew Bagnell. Learning Decisions: Robustness, Uncertainty, and
Approximation. Doctoral dissertation, Robotics Institute, Carnegie Mellon
University, 5000 Forbes Avenue, Pittsburgh, PA 15213, August 2004.

[2] Michael C. Fu. Handbook on Operations Research and Management Sci-


ence: Simulation, chapter Stochastic Gradient Estimation, pages 575–616.
Number 19. Elsevier, 2006.

[3] Peter Glynn. Likelihood ratio gradient estimation for stochastic systems.
Communications of the ACM, 33(10):75–84, October 1990.
[4] Greg Lawrence, Noah Cowan, and Stuart Russell. Efficient gradient esti-
mation for motor control learning. In Proceedings of the International Con-
ference on Uncertainty in Artificial Intelligence (UAI), Acapulco, Mexico,
2003.
[5] J. Peters and S. Schaal. Reinforcement learning of motor skills with policy
gradients. Neural Networks, 21(4):682–97, 2008.

[6] J. C. Spall. Introduction to Stochastic Search and Optimization: Estimation,


Simulation, and Control. Wiley, Hoboken, NJ, 2003.

[7] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient


methods for reinforcement learning with function approximation. In S. A.
Solla, T. K. Leen, and K.-R. Mueller, editors, Advances in Neural Informa-
tion Processing Systems (NIPS), Denver, CO, 2000. MIT Press.

[8] R. J. Williams. Simple statistical gradient-following algorithms for connec-


tionist reinforcement learning. Machine Learning, 8:229–256, 1992.

View publication stats

You might also like