13 ML Reinforcement Learning - Policy Search
13 ML Reinforcement Learning - Policy Search
1. Motivation
2. General Idea of Policy Gradient
3. Search in Parameter Space
4. Search in Action Space
Boltzmann Policy:
Gaussian Policy:
Objective
In policy gradient methods, we aim to maximize the expected return:
1 / 10
Parameter Search
Grid Search: One possibility is to organize the possible values of the parameters into a grid,
and for each combination perform a Monte-Carlo evaluation. Usually, grid search are
unfeasible, since the policy is encoded with thousands or millions of parameters.
Furthermore, parameters are typically continuous.
Genetic Algorithms: Genetic algorithms are faster than grid search. However, they are still
problematic when the size of the neural network is large.
Gradient Ascent:
where
As we seen, writing things in probabilistic terms, often helps (see all the estimators we have
seen so far!).
To this end we introduce a probability over parameters
2 / 10
Log-Ratio Trick
2: for do
3: for do
4: Sample .
7:
8: end for
9: Return .
3 / 10
is often 'very high dimensional, since it needs to encode distribution over parameters.
When are parameters of a (deep) neural network, this estimator become intractable.
Overall. In some applications, where the policy needs to be deterministic, can have small
dimension and the environment is partially observable, this estimator makes a lot of sense.
3: for do
4: Sample .
4 / 10
6: end for
7:
8:
9: end for
10: Return .
5 / 10
Naïve Policy Gradient
3: for do
4: Sample
5:
6: for do
7: Choose action
8: Execute the action on the environment, observe reward and next state
9:
10: end for
11: end for
12:
Towards REINFORCE
What did we gain w.r.t. parameter search? We gained that now we directly use the gradient
of the policy without the need of modeling a distribution of parameters.
However, this estimator still exhibits extremely high variance! Can we do better? Yes! the
trick is to keep in account that a reward at time depends only on the current and previous
transitions: future transitions don't matter.
Let's define the truncated trajectory , and, more in general,
.
6 / 10
Towards REINFORCE
Towards REINFORCE
Plugging it back together,
where is the return from time step . This estimator is called REINFORCE.
REINFORCE
REINFORCE
1: Input: a policy , a distribution over parameters , number of episodes ,
number of iterations , episode truncation
2: for do
3: for do
4: Sample
5:
6: for do
7: Choose action
8: Execute the action on the environment, observe reward and next state
7 / 10
9:
10: end for
12:
14: Return .
REINFORCE
REINFORCE has much less variance w.r.t. the naïve policy gradient, but still prohibitive.
Furthermore, the gradient can be taken only after the execution of many episodes, which
makes the improvement of the policy very slow and sample inefficient.
There are two improvements that we can make:
(1) Instead of a Monte-Carlo estimation of the return , we can use a temporal difference
estimator .
(2) Instead of waiting the termination of many episodes, we can improve the policy each
step.
Temporal-Difference
The following equality holds:
hence, by estimating with temporal difference, we can greatly reduce the variance. We
can use the temporal difference with function approximation seen in the previous lecture.
Policy gradient algorithms that use a temporal-difference approximation (or alike) of the
-function are called actor-critic.
Bootstrapping
The following equality holds:
8 / 10
where is a discounted state-action visitation. This last expression, allow us to write a
gradient estimator based on single samples, allowing us to improve the policy at each step.
Online estimator:
where are observed by the interaction of the agent with the environment.
Policy Gradient
Notice that the equation in the previous slide forms the Policy Gradient Theorem.
where and
.
Online Actor-Critic
Online Actor-Critic
1: Input: a policy , a distribution over parameters , number of episodes ,
episode truncation , learning rates for actor and critic
2: for do
3: Sample
4:
5: for do
6: Choose action
7: Execute the action on the environment, observe reward and next state
8:
9 / 10
9:
10: end for
Nevertheless, the presented estimator captures well the the core idea of actor-critic
algorithms.
10 / 10