0% found this document useful (0 votes)
5 views

Unit 5 - Policy Based

Uploaded by

Sanjay Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Unit 5 - Policy Based

Uploaded by

Sanjay Reddy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

UNIT -5

POLICY BASED REINFORCEMENT LEARNING

SYLLABUS:

Policy Based Reinforcement Learning: Policy Gradient and Actor-Critic Methods—


REINFORCE Algorithm and Stochastic Policy Search, Vanilla Policy Gradient(VPG),
Asynchronous Advantage Actor-Critic (A3C), Generalized Advantage Estimation (GAE),
Advantage Actor-Critic(A2C), Deep Deterministic Policy Gradient (DDPG), Twin-Delayed
DDPG (TD3), Soft Actor-Critic (SAC), proximal policy optimization (PPO). (Chapter-11,12)

Policy Gradient Method:


Policy gradient methods are policy iterative method that means modelling and
optimizing the policy directly. Policy gradient methods are based on the idea that the
agent's policy, which is a function that maps states to actions, can be improved by adjusting
its parameters in the direction that maximizes the expected reward. This can be done by
estimating the gradient of the objective function, which is the expected return over all
possible trajectories, with respect to the policy parameters. The gradient can be computed
using various methods, such as Monte Carlo, TD, or GAE. Policy gradient methods are
attractive because they can handle continuous action spaces, stochastic policies, and non-
linear function approximation.
The agent in reinforcement learning models is a neural network known as the policy network.
With an RL Accent Value-based vs. policy-based vs. policy-gradient vs. actor-critic
methods
Value-based methods: Refers to algorithms that learn value functions and only value
functions. Q-learning, SARSA, DQN, and company are all value-based methods.
Policy-based methods: Refers to a broad range of algorithms that optimize policies,
including black-box optimization methods, such as genetic algorithms.
Policy-gradient methods: Refers to methods that solve an optimization problem on the
gradient of the performance of a parameterized policy, methods you’ll learn in this chapter.
Actor-critic methods: Refers to methods that learn both a policy and a value function,
primarily if the value function is learned with bootstrapping and used as the score for the
stochastic policy gradient.

Algorithms in Policy Gradient and Value Based Methods:


Policy Gradient algorithm
Policy gradient algorithm is a policy iteration approach where policy is directly manipulated
to reach the optimal policy that maximises the expected return. This type of algorithms is
model-free reinforcement learning (RL). The model-free indicates that there is no prior
knowledge of the model of the environment. In other words, we do not know the environment
dynamics or transition probability. The environment dynamics or transition probability is
indicated as below:

It can be read the probability of reaching the next state st+1 by taking the action from the
current state s. Sometimes transition probability is confused with policy. policy 𝜋 is a
distribution over actions given states. In other words, the policy defines the behaviour of the
agent.

Whereas transition probability explains the dynamics of the environment which is not readily
available in many practical applications.

Advantages of policy-gradient methods


 Policy-based methods can more easily learn stochastic policies, which in turn has
multiple additional advantages. First, learning stochastic policies means better
performance under partially observable environments. The intuition is that because
we can learn arbitrary probabilities of actions, the agent is less dependent on the
Markov assumption. For example, if the agent can’t distinguish a handful of states
from their emitted observations, the best strategy is often to act randomly with
specific probabilities.
 Another advantage of learning stochastic policies is that it could be more
straightforward for function approximation to represent a policy than a value function.
Sometimes value functions are too much information for what’s truly needed. It could
be that calculating the exact value of a state or state-action pair is complicated or
unnecessary.

 A final advantage to mention is that because policies are parameterized with


continuous values, the action probabilities change smoothly as a function of the
learned parameters. Therefore, policy-based methods often have better convergence
properties. As you remember from previous chapters, value-based methods are prone
to oscillations and even divergence. One of the reasons for this is that tiny changes in
value-function space may imply significant changes in action space. A significant
difference in actions can create entirely unusual new trajectories, and therefore create
instabilities.

Policy Gradient algorithm

Policy gradient algorithm is a policy iteration approach where policy is directly manipulated to
reach the optimal policy that maximises the expected return. This type of algorithms is model-
free reinforcement learning(RL). The model-free indicates that there is no prior knowledge of
the model of the environment. In other words, we do not know the environment dynamics or
transition probability. The environment dynamics or transition probability is indicated as
below:
It can be read the probability of reaching the next state st+1 by taking the action from the
current state s. Sometimes transition probability is confused with policy. policy 𝜋 is a
distribution over actions given states. In other words, the policy defines the behaviour of the
agent.

Whereas transition probability explains the dynamics of the environment which is not readily
available in many practical applications.

Return and reward


We can define our return as the sum of rewards from the current state to the goal state i.e.
the sum of rewards in a trajectory(we are just considering finite undiscounted horizon).

Where τ = (s0,a0,…,sT−1,aT−1).

Objective function
In policy gradient, the policy is usually modelled with a parameterized function respect
to θ, πθ(a|s). From a mathematical perspective, an objective function is to minimise or
maximise something. We consider a stochastic, parameterized policy πθ and aim to maximise
the expected return using objective function J(πθ)[7].

Here R(st, at) is defined as reward obtained at timestep t by performing an action at from the
state st. We know the fact that R(st, at) can be represented as R(τ).
We can maximise the objective function J to maximises the return by adjusting the policy
parameter θ to get the best policy. The best policy will always maximise the return. The
gradient ascent is the optimisation algorithm that iteratively searches for optimal parameters
that maximise the objective function.

If we can find out the gradient ∇ of the objective function J, as shown below:

The gradient ∇ of the objective function J


Then, we can update the policy parameter θ(for simplicity, we are going to use θ instead of
πθ), using the gradient ascent rule. This way, we can update the parameters θ in the direction
of the gradient(Remember the gradient gives the direction of the maximum change, and the
magnitude indicates the maximum rate of change ). The gradient update rule is as shown
below:
Gradient ascent update rule
Let’s derive the policy gradient expression.
The expectation of a discrete random variable X can be defined as:

Expectation general equation


where x is the value of random variable X and P(x) is the probability function of x.
Now we can rewrite our gradient as below:

We can derive this equation as follows:

Derivation of the expectation


Probability of trajectory with respect to parameter θ, P(τ|θ) can be expanded as follows:

Where p(s0) is the probability distribution of starting state and P(st+1|st, at) is the transition
probability of reaching new state st+1 by performing the action at from the state st.
If we take the log-probability of the trajectory, then it can be derived as below:

We can take the gradient of the log-probability of a trajectory thus gives[6][7]:


Fig: The gradient of the log-probability of a trajectory

We can modify this function as shown below based on the transition probability model, P(st+1
∣st, at) disappears because we are considering the model-free policy gradient algorithm where
the transition probability model is not necessary.

Final derivation of the gradient of the log-probability of a trajectory

We can now go back to the expectation of our algorithm and time to replace the gradient of the
log-probability of a trajectory with the derived equation above.

The gradient of the objective function.

Now the policy gradient expression is derived as

Policy gradient expression


The left-hand side of the equation can be replaced as below:

Policy gradient expression

Actor Critic Methods


Before we return to baselines, let’s first look at the vanilla policy gradient again to see how the
Actor Critic architecture comes in (and what is really is). Recall that:

We can then decompose the expectation into:


The second expectation term should be familiar; it is the Q value! (If you did not already know

this, I would suggest that you read up on value iteration and Q learning).

Plugging that in, we can rewrite the update equation as such:

As we know, the Q value can be learned by parameterizing the Q function with a neural
network (denoted by subscript w above).

This leads us to Actor Critic Methods, where:

 The “Critic” estimates the value function. This could be the action-value (the Q value)
or state-value (the V value).
 The “Actor” updates the policy distribution in the direction suggested by the Critic
(such as with policy gradients).
and both the Critic and Actor functions are parameterized with neural networks. In the
derivation above, the Critic neural network parameterizes the Q value — so, it is
called Q Actor Critic.
Below is the pseudocode for Q-Actor-Critic:
REINFORCE Algorithm

Reinforce algorithm, also known as the REINFORCE algorithm, is a popular policy-


based reinforcement learning algorithm that is used to optimize the policy of an agent in an
environment to maximize the cumulative reward.

 In the REINFORCE algorithm, the agent learns a parameterized policy that maps
from states to actions. The policy is typically represented as a neural network, with
the parameters of the neural network being the policy parameters that the algorithm
updates during training.

 The REINFORCE algorithm uses a technique called policy gradient to update the
policy parameters. The policy gradient is the gradient of the expected total reward
with respect to the policy parameters.

 The REINFORCE algorithm also typically uses a baseline to reduce the variance of
the gradient estimates. The baseline is an estimate of the expected total reward, and it
is subtracted from the total reward of each trajectory to reduce the variance of the
policy gradient estimates. The baseline can be a constant value, a learned value
function, or an estimate based on the current policy.

REINFORCE (REward INcrement = Nonnegative Factor × Offset Reinforcement ×

Characteristic Eligibility) method,

The objective of the policy is to maximize the “Expected reward”.

Each policy generates the probability of taking an action in each station of the environment.

Policy 1 vs Policy 2 — Different Trajectories

The agent samples from these probabilities and selects an action to perform in the environment.

At the end of an episode, we know the total rewards the agent can get if it follows that policy.
We backpropagate the reward through the path the agent took to estimate the “Expected

reward” at each state for a given policy.

Here the discounted reward is the sum of all the rewards the agent receives in that future

discounted by a factor Gamma.

The discounted reward at any stage is the reward it receives at the next step + a discounted sum

of all rewards the agent receives in the future.

The discounted factor is calculated for each state by backpropagating the rewards.
The objective of the policy is to maximize the “Expected reward”.
For the above equation this is how we calculate the Expected Reward:
As per the original implementation of the REINFORCE algorithm, the Expected reward is the

sum of products of a log of probabilities and discounted rewards.

Algorithm steps

The steps involved in the implementation of REINFORCE would be as follows:

1. Initialize a Random Policy (a NN that takes the state as input and returns the probability

of actions)

2. Use the policy to play N steps of the game — record action probabilities-from policy,

reward-from environment, action — sampled by agent

3. Calculate the discounted reward for each step by backpropagation

4. Calculate expected reward G

5. Adjust weights of Policy (back-propagate error in NN) to increase G

6. Repeat from 2

Stochastic Policy Search:


Stochastic policy search is a type of policy-based reinforcement
learning method that involves learning a stochastic policy directly. Instead of
trying to estimate the optimal value function or the optimal action-value
function, as in value-based reinforcement learning methods, stochastic policy
search algorithms directly optimize the policy by searching for the policy
parameters that maximize the expected total reward.

The basic idea behind stochastic policy search is to maintain a distribution over
policies, and to update this distribution in the direction of better policies by
performing gradient ascent on the expected reward objective. Specifically, given a
current policy parameterized by θ, the algorithm samples a set of perturbations δ_i
from a distribution N(0, σ^2), and then evaluates the resulting policies θ_i = θ + δ_i.
The objective is then to maximize the expected reward of the perturbed policies:

J(θ) = E_π_θ [R]

where R is the total reward obtained by following the policy π_θ


Vanilla Policy Gradient (VPG):

 VPG is an on-policy algorithm.


 VPG can be used for environments with either discrete or continuous action spaces.
 The Spinning Up implementation of VPG supports parallelization with MPI.

Exploration vs. Exploitation


VPG trains a stochastic policy in an on-policy way. This means that it explores by sampling actions
according to the latest version of its stochastic policy. The amount of randomness in action selection
depends on both initial conditions and the training procedure. Over the course of training, the policy
typically becomes progressively less random, as the update rule encourages it to exploit rewards that it
has already found. This may cause the policy to get trapped in local optima.
Asynchronous Advantage Actor-Critic (A3C) Algorithm

The A3C algorithm is one of RL’s state-of-the-art algorithms, which beats DQN in few domains
lso, A3C can be beneficial in experiments that involve some global network optimization with
different environments in parallel for generalization purposes.

Asynchronous stands for the principal difference of this algorithm from DQN, where a single
neural network interacts with a single environment. On the contrary, in this case, we’ve got a
global network with multiple agents having their own set of parameters. It creates every agent’s
situation interacting with its environment and harvesting the different and unique learning
experience for overall training. That also deals partially with RL sample correlation, a big
problem for neural networks, which are optimized under the assumption that input samples are
independent of each other (not possible in games).
A3C (Asynchronous Advantage Actor-critic) is a reinforcement learning algorithm that combines the
actor-critic method with the advantage function. It is an asynchronous version of the A2C (Advantage
actor-critic) algorithm, which means that multiple agents can run in parallel and update the same network
weights independently.

In this case, we’ve got a global network with multiple agents having their own set of parameters
Actor-Critic stands for two neural networks — Actor and Critic.
The goal of the Actor is in optimizing the policy (“How to act?”), and the Critic aims at
optimizing the value (“How good action is?”).
Thus, it creates a complementary situation for an agent to gain the best experience of fast
learning.

Advantage: imagine that advantage is the value that brings us an answer to the question: “How
much better the reward for an agent is than it could be expected?” It is the other factor of
making the overall situation better for an agent. In this way, the agent learns which actions were
rewarding or penalizing for it. Formally it looks like this:
Q(s, a) stands for the expected future reward of taking action at a particular state
V(s) stands for the value of being in a specific state.

GAE: Robust advantage estimation


A3C uses n-step returns for reducing the variance of the targets. Still, as you probably remember
from chapter 5, there’s a more robust method that combines multiple n-step bootstrapping targets in a
single target, creating even more robust targets than a single n-step: the λ-target. Generalized advantage
estimation (GAE) is analogous to the λ-target in TD(λ), but for advantages.
Generalized advantage estimation:
GAE is not an agent on its own, but a way of estimating targets for the advantage function that
most actor-critic methods can leverage. More specifically, GAE uses an exponentially weighted
combination of n-step action-advantage function targets, the same way the λ-target is an exponentially
weighted combination of n-step state-value function targets. This type of target, which we tune in the
same way as the λ-target, can substantially reduce the variance of policy-gradient estimates at the cost of
some bias.
Advantage Actor Critic (A2C):
Actor-Critic. The actor-Critic algorithm is a Reinforcement Learning agent that combines value
optimization and policy optimization approaches. More specifically, the Actor-Critic combines
the Q-learning and Policy Gradient algorithms. The resulting algorithm obtained at the high level
involves a cycle that shares features between:
 Actor: a PG algorithm that decides on an action to take;
 Critic: Q-learning algorithm that critiques the action that the Actor selected, providing
feedback on how to adjust. It can take advantage of efficiency tricks in Q-learning, such
as memory replay.

The advantage of the Actor-Critic algorithm is that it can solve a broader range of problems
than DQN (Deep Q-network ), while it has a lower variance in performance relative to
REINFORCE. That said, because of the presence of the PG algorithm within it, the Actor-
Critic is still somewhat sampling inefficient.

How Actor-Critic works:


Imagine you play a video game with a friend that provides you some feedback. You're the
Actor, and your friend is the Critic:

In the beginning, you don't know how to play, so you try some action randomly. The Critic
observes your action and provides feedback. Let's first take a look at the vanilla policy
gradient again to see how the Actor-Critic architecture comes in (and what it is):

We update both the Critic network and the Value network at each update step.

Intuitively, this means how better it is to take a specific action than the average general action at
the given state. So, using the Value function as the baseline function, we subtract the Q value
term with the Value. We will call this Value the advantage value.
Restoring order in policy updates:
In A2C, we move the workers from the agent down to the environment. Instead of having
multiple actor-learners, we have multiple actors with a single learner. As it turns out, having
workers rolling out experiences is where the gains are in policy-gradient methods.
NOTE:
For more information about the “cart-pole” problem refer to reference no:6

Deep Deterministic Policy Gradient (DDPG)


Deep deterministic policy gradient (DDPG) can be seen as an approximate DQN, or
better yet, a DQN for continuous action spaces. DDPG uses many of the same techniques found
in DQN: it uses a replay buffer to train an action-value function in an off-policy manner, and
target networks to stabilize training. However, DDPG also trains a policy that approximates the
optimal action. Because of this, DDPG is a deterministic policy-gradient method restricted to
continuous action spaces.
DDPG uses many tricks from DQN:
Start by visualizing DDPG as an algorithm with the same architecture as DQN. The
training process is similar: the agent collects experiences in an online manner and stores these
online experience samples into a replay buffer. On every step, the agent pulls out a mini-batch
from the replay buffer that is commonly sampled uniformly at random. The agent then uses this
mini-batch to calculate a bootstrapped TD target and train a Q-function.
The main difference between DQN and DDPG is that while DQN uses the target Q-
function for getting the greedy action using an argmax, DDPG uses a target deterministic policy
function that is trained to approximate that greedy action. Instead of using the argmax of the Q-
function of the next state to get the greedy action as we do in DQN, in DDPG, we directly
approximate the best action in the next state using a policy function. Then, in both, we use that
action with the Q-function to get the max value.

Network Schematics
DDPG uses four neural networks: a Q network, a deterministic policy network, a target Q
network, and a target policy network.

The Q network and policy network is very much like simple Advantage Actor-Critic, but in DDPG, the
Actor directly maps states to actions (the output of the network directly the output) instead of outputting
the probability distribution across a discrete action space.
The target networks are time-delayed copies of their original networks that slowly track

thelearned networks. Using these target value networks greatly improve stability in learning.

Here’s why: In methods that do not use target networks, the update equations of the network are

interdependent on the values calculated by the network itself, which makes it prone to divergence.

For example:

So, we have the standard Actor & Critic architecture for the deterministic policy
network and the Q network.

Twin Delayed DDPG (TD3PG):


TD3 introduces three main changes to the main DDPG algorithm. First, it adds a double
learning technique, like what you learned in double Q-learning and DDQN, but this time with a
unique “twin” network architecture. Second, it adds noise, not only to the action passed into the
environment but also to the target actions, making the policy network more robust to
approximation error. And third, it delays updates to the policy network, its target network, and
the twin target network, so that the twin network updates more frequently.
I. Double learning in DDPG
In TD3, we use a particular kind of Q-function network with two separate streams that
end on two separate estimates of the state-action pair in question. For the most part, these
two streams are totally independent, so one can think about them as two separate
networks. However, it’d make sense to share feature layers if the environment was
image-based. That way CNN would extract common features and potentially learn faster.
Nevertheless, sharing layers is also usually harder to train, so this is something you’d
have to experiment with and decide by yourself.

II. Smoothing the targets used for policy updates.


To improve exploration in DDPG, we inject Gaussian noise into the action used for the
environment. In TD3, we take this concept further and add noise, not only to the action
used for exploration, but also to the action used to calculate the targets. Training the
policy with noisy targets can be seen as a regularizer because now the network is forced
to generalize over similar actions. This technique prevents the policy network from
converging to incorrect actions because, early on during training, Q-functions can
prematurely inaccurately value certain actions. The noise over the actions spreads that
value over a more inclusive range of actions than otherwise.
III. Delaying updates:
The final improvement that TD3 applies over DDPG is delaying the updates to the policy network
and target networks so that the online Q-function updates at a higher rate than the rest. Delaying
these networks is beneficial because often, the online Q-function changes shape abruptly early
in the training process. Slowing down the policy so that it updates after a couple of value
function updates allows the value function to settle into more accurate values before we let it
guide the policy. The recommended delay for the policy and target networks is every other
update to the online Q-function. The other thing that you may notice in the policy updates is
that we must use one of the streams of the online value model for getting the estimated Q-value
for the action coming from the policy. In TD3, we use one of the two streams, but the same
stream every time.
NOTE:
Pseudo Code for TD3 and DDPG are in Reference links no. 8 and 7 respectively.
Soft Actor Critic (SAC):
SAC is defined for RL tasks involving continuous actions. The biggest feature of SAC is
that it uses a modified RL objective function. Instead of only seeking to maximize the lifetime
rewards, SAC seeks to also maximize the entropy of the policy. The term ‘entropy’ has a rather
esoteric definition and many interpretations depending on the application, but I’d like to share an
intuitive explanation here. We can think of entropy as how unpredictable a random variable is. If
a random variable always takes a single value, then it has zero entropy because it’s not
unpredictable at all. If a random variable can be any Real Number with equal probability, then it
has very high entropy as it is very unpredictable. Why do we want our policy to have high
entropy? We want a high entropy in our policy to explicitly encourage exploration, to encourage

the policy to assign equal probabilities to actions that have same or nearly equal Q-values, and to
ensure that it does not collapse into repeatedly selecting a particular action that could exploit
some inconsistency in the approximated Q function. Therefore, SAC overcomes the brittleness
problem by encouraging the policy network to explore and not assign a very high probability to
any one part of the range of actions.

Objective Function consisting of both a reward term and an entropy term H weighted by α

Now that we know what we are optimizing for, let us understand how we go about doing the
optimization. SAC makes use of three networks: a state value function V parameterized by ψ,a
soft Q-function Q parameterized by θ, and a policy function π parameterized by ϕ. While there is
no need in principle to have separate approximators for the V and Q functions which are related
through the policy, the authors say that in practice having separate functionapproximators help in
convergence. So we need to train the three function approximators as follows:

1.We train the Value network by minimizing the following error:


The equation says is that across all the states that we sample from our experience replay buffer,
we need to decrease the squared difference between the prediction of our value network and the
expected prediction of the Q function plus the entropy of the policy function π (measured here by
the negative log of the policy function).

2. We train the Q network by minimizing the following error:

Where,

Minimizing this objective function amounts to the following: For all (state, action) pairs in the
experience replay buffer, we want to minimize the squared difference between the prediction of
our Q function and the immediate (one time-step) reward plus the discounted expected Value of
the next state. Note that the Value comes from a Value function parameterized by ψ with a bar on
top of it. This is an additional Value function called the target value function. We’ll get into why
we need this but for now, don’t worry about it and just think of it as a Value function that we’re

training.

We’ll use the below approximation of the derivative of the above objective is to update the
parameters of the Q function:

3. We train the Policy network π by minimizing the following error:

This objective function looks complex but it’s saying something very simple. The DKL function
that you see inside the expectation is called the Kullback-Leibler Divergence. I highly
recommend that you read up on the KL divergence since it shows up a lot in deep learning
research and applications these days. For the purposes of this tutorial, you can interpret it as how
different the two distributions are. So, this objective function is basically trying to make the
distribution of our Policy function look more like the distribution of the exponentiation of our Q
Function normalized by another function Z.

To minimize this objective, the authors use something called the reparameterization trick. This
trick is used to make sure that sampling from the policy is a differentiable process so that there are
no problems in backpropagating the errors. The policy is now parameterized as follows:
The epsilon term is a noise vector sampled from a Gaussian distribution. We will explain it more
in the implementation section.

Now, we can express the objective function as follows:

The normalizing function Z is dropped since it does not depend on the parameter ϕ. An unbiased
estimator for the gradient of the above objective is given as follows:

NOTE: For Implementation go through reference link – 10


For theory of SAC can also go through textbook- Chapter 12 (Pg No: 391)

Proximal policy optimization (PPO)


PPO is an algorithm with the same underlying architecture as A2C. PPO can reuse much
of the code developed for A2C.
The critical innovation in PPO is a surrogate objective function that allows an on-policy
algorithm to perform multiple gradient steps on the same mini-batch of experiences.
PPO introduces a clipped objective function that prevents the policy from getting too
different after an optimization step. By optimizing the policy conservatively, we not only
prevent performance collapse due to the innate high variance of on-policy policy gradient
methods but also can reuse mini-batches of experiences and perform multiple
optimization steps per mini-batch. The ability to reuse experiences makes PPO a more
sample-efficient method than other on-policy methods.

PPO is an on-policy algorithm.


PPO can be used for environments with either discrete or continuous action spaces.

a) Using the same actor-critic architecture as A2C


Think of PPO as an improvement to A2C. What I mean by that is that even though in
this chapter we have learned about DDPG, TD3, and SAC, and all these algorithms
have commonality. PPO should not be confused as an improvement to SAC. TD3 is a
direct improvement to DDPG. SAC was developed concurrently with TD3. However,
the SAC author published a second version of the SAC paper shortly after the first one,
which includes several of the features of TD3. While SAC isn’t a direct improvement
to TD3, it does share several features. PPO, however, is an improvement to A2C, and
we reuse part of the A2C code. More specifically, we sample parallel environments to
gather the mini-batches of data and use GAE for policy targets

b) Batching experiences
One of the features of PPO that A2C didn’t have is that with PPO, we can reuse
experience samples. To deal with this, we could gather large trajectory batches, as in
NFQ, and “fit” the model to the data, optimizing it repeatedly. However, a better
approach is to create a replay buffer and sample a large mini batch from it on every
optimization step. That gives the effect of stochasticity on each mini-batch because
samples aren’t always the same, yet we likely reuse all samples in the long term.
c) Clipping the policy updates:
The main issue with the regular policy gradient is that even a small change in
parameter space can lead to a big difference in performance. The discrepancy between
parameter space and performance is why we need to use small learning rates in policy-
gradient methods, and even so, the variance of these methods can still be too large. The
whole point of clipped PPO is to put a limit on the objective such that on each training
step, the policy is only allowed to be so far away. Intuitively, you can think of this
clipped objective as a coach preventing overreacting to outcomes. Did the team get a
good score last night with a new tactic? Great, but don’t exaggerate. Don’t throw away
a whole season of results for a new result. Instead, keep improving a little bit at a time.
d) Clipping the value function updates:
We can apply a similar clipping strategy to the value function with the same core concept: let the
changes in parameter space change the Q-values only this much, but not more. As you can tell,
this clipping technique keeps the variance of the things we care about smooth, whether changes
in parameter space are smooth or not. We don’t necessarily need small changes in parameter
space; however, we’d like level changes in performance and values.
Some Other References:
1. Policy Gradient Algorithms | Lil'Log (lilianweng.github.io)
2. PyLessons
3. Advantage Actor Critic (A2C) (huggingface.co)
4. Baseline for Policy Gradients that All Deep Learning Enthusists Must Know
(analyticsvidhya.com)
5. Advantage Actor Critic Tutorial: minA2C | by Mike Wang | Towards Data Science
6. Using Q-Learning for OpenAI’s CartPole-v1 | by Ali Fakhry | The Startup | Medium
7. Deep Deterministic Policy Gradient — Spinning Up documentation (openai.com)
8. Twin Delayed DDPG — Spinning Up documentation (openai.com)
9. Soft Actor-Critic — Spinning Up documentation (openai.com)
10. Soft Actor-Critic Demystified. An intuitive explanation of the theory… | by VaishakV.Kumar |
Towards Data Science
11. Proximal Policy Optimization — Spinning Up documentation (openai.com)
12. Proximal Policy Optimization (PPO) (huggingface.co)

You might also like