0% found this document useful (0 votes)
2 views

13 ML Reinforcement Learning - Policy Search

The document discusses policy gradient methods in reinforcement learning, focusing on the explicit representation of policies and various techniques for optimizing them. It highlights the challenges of parameter search, the advantages of gradient-based methods, and introduces the REINFORCE algorithm along with improvements like temporal-difference estimation. Additionally, it touches on the actor-critic approach and techniques to enhance stability and efficiency in policy gradient algorithms.

Uploaded by

tdr2mqm6gr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

13 ML Reinforcement Learning - Policy Search

The document discusses policy gradient methods in reinforcement learning, focusing on the explicit representation of policies and various techniques for optimizing them. It highlights the challenges of parameter search, the advantages of gradient-based methods, and introduces the REINFORCE algorithm along with improvements like temporal-difference estimation. Additionally, it touches on the actor-critic approach and techniques to enhance stability and efficiency in policy gradient algorithms.

Uploaded by

tdr2mqm6gr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Overview

1. Motivation
2. General Idea of Policy Gradient
3. Search in Parameter Space
4. Search in Action Space

Explicit Representation of the Policy


The policy is represented explicitly using a function that maps state into actions. The
policy policy can be either stochastic or deterministic, and can handle both continuous and
discrete state-action spaces.
Most commonly, the policy is represented with a neural network that takes the state as
input and

outputs the corresponding action (deterministic policy) OR


outputs the parameters of the action's distributions (often Gaussian when the action-
space is continuous, or Boltzmann when the action-space is discrete).

Explicit Representation of the Policy


Deterministic Policy:

Boltzmann Policy:

Gaussian Policy:

Objective
In policy gradient methods, we aim to maximize the expected return:

1 / 10
Parameter Search
Grid Search: One possibility is to organize the possible values of the parameters into a grid,
and for each combination perform a Monte-Carlo evaluation. Usually, grid search are
unfeasible, since the policy is encoded with thousands or millions of parameters.
Furthermore, parameters are typically continuous.
Genetic Algorithms: Genetic algorithms are faster than grid search. However, they are still
problematic when the size of the neural network is large.

Furthermore, Monte-Carlo evaluations requires a large number of evaluations, especially


in highly stochastic environments.

Policy Gradient in Parameter Space


Gradient algorithms are widely used in machine learning, as they are typically very
computational efficient. Instead of performing a global search (like grid-search or genetic
algorithms), gradients improve locally the set of parameters. The downside it is that
gradient-based algorithm can get stuck in local optima.

Gradient Ascent:

where

Unfortunately, we don't have a direct access to . How can we compute it?

Policy Gradient in Parameter Space


Idea: We want to estimate the gradient from samples. Usually, to have that, we need to have
an expression that involves the average, which we can then transform into an empirical
average (see all the estimators we have seen so far!).

As we seen, writing things in probabilistic terms, often helps (see all the estimators we have
seen so far!).
To this end we introduce a probability over parameters

where are the parameters defining such distribution.

2 / 10
Log-Ratio Trick

The log-ratio trick can be derived by the derivative of the logarithm

Policy Gradient in Parameter Space

Policy Gradient in Parameter Space


1: Input: a policy , a distribution over parameters , number of episodes ,
number of iterations

2: for do
3: for do
4: Sample .

5: Execute the policy in the environment, collect the return .


6: end for

7:

8: end for
9: Return .

Policy Gradient in Parameter Space


Advantages: This estimator is very simple to implement, and does not require specific
assumption (e.g., it works also for non-Markov environments).

Disadvantages: Monte-Carlo log-ratio estimators are high-variance. This high variance is


even higher in stochastic environment, where for a single , the return could greatly vary
due to stochastic transitions, stochastic policies and long horizons. Furthermore, the vector

3 / 10
is often 'very high dimensional, since it needs to encode distribution over parameters.
When are parameters of a (deep) neural network, this estimator become intractable.
Overall. In some applications, where the policy needs to be deterministic, can have small
dimension and the environment is partially observable, this estimator makes a lot of sense.

Baseline Subtraction for Variance Reduction


A simple trick to reduce the variance of estimator, is to subtract a constant to the return.
Such subtraction maintain (theoretically) the estimator unbiased while reducing the
variance.

Baseline Subtraction for Variance Reduction

Baseline Subtraction for Variance Reduction


There is a optimal baseline subtraction estimator that minimize the variance. However, to
keep math simple is often taken as the average.

Policy Gradient in Parameter Space with Variance


Reduction
1: Input: a policy , a distribution over parameters , number of episodes ,
number of iterations
2: for do

3: for do
4: Sample .

5: Execute the policy in the environment, collect the return .

4 / 10
6: end for

7:

8:

9: end for

10: Return .

Principle of Policy Gradient in Action Space


Is is possible to avoid a distribution over parameters, and use (somehow) only the policy
? Yes!
Once again, we have to look things under the probabilistic lenses.
Consider the probability of a trajectory ,

We can write the return as

Naïve Policy Gradient

5 / 10
Naïve Policy Gradient

Naïve Policy Gradient


1: Input: a policy , a distribution over parameters , number of episodes ,
number of iterations , episode truncation
2: for do

3: for do
4: Sample

5:

6: for do
7: Choose action

8: Execute the action on the environment, observe reward and next state
9:
10: end for
11: end for

12:

13: end for


14: Return .

Towards REINFORCE
What did we gain w.r.t. parameter search? We gained that now we directly use the gradient
of the policy without the need of modeling a distribution of parameters.

However, this estimator still exhibits extremely high variance! Can we do better? Yes! the
trick is to keep in account that a reward at time depends only on the current and previous
transitions: future transitions don't matter.
Let's define the truncated trajectory , and, more in general,
.

6 / 10
Towards REINFORCE

Let's see what happens to each single term ...

Towards REINFORCE
Plugging it back together,

where is the return from time step . This estimator is called REINFORCE.

REINFORCE

REINFORCE
1: Input: a policy , a distribution over parameters , number of episodes ,
number of iterations , episode truncation
2: for do
3: for do

4: Sample

5:

6: for do
7: Choose action

8: Execute the action on the environment, observe reward and next state

7 / 10
9:
10: end for

11: end for

12:

13: end for

14: Return .

REINFORCE
REINFORCE has much less variance w.r.t. the naïve policy gradient, but still prohibitive.
Furthermore, the gradient can be taken only after the execution of many episodes, which
makes the improvement of the policy very slow and sample inefficient.
There are two improvements that we can make:
(1) Instead of a Monte-Carlo estimation of the return , we can use a temporal difference
estimator .

(2) Instead of waiting the termination of many episodes, we can improve the policy each
step.

Temporal-Difference
The following equality holds:

hence, by estimating with temporal difference, we can greatly reduce the variance. We
can use the temporal difference with function approximation seen in the previous lecture.
Policy gradient algorithms that use a temporal-difference approximation (or alike) of the
-function are called actor-critic.

Bootstrapping
The following equality holds:

8 / 10
where is a discounted state-action visitation. This last expression, allow us to write a
gradient estimator based on single samples, allowing us to improve the policy at each step.
Online estimator:

where are observed by the interaction of the agent with the environment.

Policy Gradient
Notice that the equation in the previous slide forms the Policy Gradient Theorem.

Policy Gradient Theorem


The policy gradient can be expressed as

where and
.

Online Actor-Critic

Online Actor-Critic
1: Input: a policy , a distribution over parameters , number of episodes ,
episode truncation , learning rates for actor and critic

2: for do
3: Sample

4:

5: for do
6: Choose action

7: Execute the action on the environment, observe reward and next state

8:

9 / 10
9:
10: end for

11: end for


12: Return .

Towards State-of-the-art Actor-Critic


The algorithm outlined in the previous slide, is efficient, but carries instabilities due to
highly correlated samples, aggressive updates of the critic, collapse of the actor variance
which makes local optima more likely, etc.
To address these issues, one can use the replay buffer and the target networks used in the
previous lecture. Additional entropic regularization can prevent getting stuck in local
optima. Many other techniques like baseline subtraction and off-policy estimation can help
making policy gradients even more efficient.

Nevertheless, the presented estimator captures well the the core idea of actor-critic
algorithms.

10 / 10

You might also like