Unit 5 - Policy Based
Unit 5 - Policy Based
SYLLABUS:
It can be read the probability of reaching the next state st+1 by taking the action from the
current state s. Sometimes transition probability is confused with policy. policy 𝜋 is a
distribution over actions given states. In other words, the policy defines the behaviour of the
agent.
Whereas transition probability explains the dynamics of the environment which is not readily
available in many practical applications.
Policy gradient algorithm is a policy iteration approach where policy is directly manipulated to
reach the optimal policy that maximises the expected return. This type of algorithms is model-
free reinforcement learning(RL). The model-free indicates that there is no prior knowledge of
the model of the environment. In other words, we do not know the environment dynamics or
transition probability. The environment dynamics or transition probability is indicated as
below:
It can be read the probability of reaching the next state st+1 by taking the action from the
current state s. Sometimes transition probability is confused with policy. policy 𝜋 is a
distribution over actions given states. In other words, the policy defines the behaviour of the
agent.
Whereas transition probability explains the dynamics of the environment which is not readily
available in many practical applications.
Where τ = (s0,a0,…,sT−1,aT−1).
Objective function
In policy gradient, the policy is usually modelled with a parameterized function respect
to θ, πθ(a|s). From a mathematical perspective, an objective function is to minimise or
maximise something. We consider a stochastic, parameterized policy πθ and aim to maximise
the expected return using objective function J(πθ)[7].
Here R(st, at) is defined as reward obtained at timestep t by performing an action at from the
state st. We know the fact that R(st, at) can be represented as R(τ).
We can maximise the objective function J to maximises the return by adjusting the policy
parameter θ to get the best policy. The best policy will always maximise the return. The
gradient ascent is the optimisation algorithm that iteratively searches for optimal parameters
that maximise the objective function.
If we can find out the gradient ∇ of the objective function J, as shown below:
Where p(s0) is the probability distribution of starting state and P(st+1|st, at) is the transition
probability of reaching new state st+1 by performing the action at from the state st.
If we take the log-probability of the trajectory, then it can be derived as below:
We can modify this function as shown below based on the transition probability model, P(st+1
∣st, at) disappears because we are considering the model-free policy gradient algorithm where
the transition probability model is not necessary.
We can now go back to the expectation of our algorithm and time to replace the gradient of the
log-probability of a trajectory with the derived equation above.
this, I would suggest that you read up on value iteration and Q learning).
As we know, the Q value can be learned by parameterizing the Q function with a neural
network (denoted by subscript w above).
The “Critic” estimates the value function. This could be the action-value (the Q value)
or state-value (the V value).
The “Actor” updates the policy distribution in the direction suggested by the Critic
(such as with policy gradients).
and both the Critic and Actor functions are parameterized with neural networks. In the
derivation above, the Critic neural network parameterizes the Q value — so, it is
called Q Actor Critic.
Below is the pseudocode for Q-Actor-Critic:
REINFORCE Algorithm
In the REINFORCE algorithm, the agent learns a parameterized policy that maps
from states to actions. The policy is typically represented as a neural network, with
the parameters of the neural network being the policy parameters that the algorithm
updates during training.
The REINFORCE algorithm uses a technique called policy gradient to update the
policy parameters. The policy gradient is the gradient of the expected total reward
with respect to the policy parameters.
The REINFORCE algorithm also typically uses a baseline to reduce the variance of
the gradient estimates. The baseline is an estimate of the expected total reward, and it
is subtracted from the total reward of each trajectory to reduce the variance of the
policy gradient estimates. The baseline can be a constant value, a learned value
function, or an estimate based on the current policy.
Each policy generates the probability of taking an action in each station of the environment.
The agent samples from these probabilities and selects an action to perform in the environment.
At the end of an episode, we know the total rewards the agent can get if it follows that policy.
We backpropagate the reward through the path the agent took to estimate the “Expected
Here the discounted reward is the sum of all the rewards the agent receives in that future
The discounted reward at any stage is the reward it receives at the next step + a discounted sum
The discounted factor is calculated for each state by backpropagating the rewards.
The objective of the policy is to maximize the “Expected reward”.
For the above equation this is how we calculate the Expected Reward:
As per the original implementation of the REINFORCE algorithm, the Expected reward is the
Algorithm steps
1. Initialize a Random Policy (a NN that takes the state as input and returns the probability
of actions)
2. Use the policy to play N steps of the game — record action probabilities-from policy,
6. Repeat from 2
The basic idea behind stochastic policy search is to maintain a distribution over
policies, and to update this distribution in the direction of better policies by
performing gradient ascent on the expected reward objective. Specifically, given a
current policy parameterized by θ, the algorithm samples a set of perturbations δ_i
from a distribution N(0, σ^2), and then evaluates the resulting policies θ_i = θ + δ_i.
The objective is then to maximize the expected reward of the perturbed policies:
The A3C algorithm is one of RL’s state-of-the-art algorithms, which beats DQN in few domains
lso, A3C can be beneficial in experiments that involve some global network optimization with
different environments in parallel for generalization purposes.
Asynchronous stands for the principal difference of this algorithm from DQN, where a single
neural network interacts with a single environment. On the contrary, in this case, we’ve got a
global network with multiple agents having their own set of parameters. It creates every agent’s
situation interacting with its environment and harvesting the different and unique learning
experience for overall training. That also deals partially with RL sample correlation, a big
problem for neural networks, which are optimized under the assumption that input samples are
independent of each other (not possible in games).
A3C (Asynchronous Advantage Actor-critic) is a reinforcement learning algorithm that combines the
actor-critic method with the advantage function. It is an asynchronous version of the A2C (Advantage
actor-critic) algorithm, which means that multiple agents can run in parallel and update the same network
weights independently.
In this case, we’ve got a global network with multiple agents having their own set of parameters
Actor-Critic stands for two neural networks — Actor and Critic.
The goal of the Actor is in optimizing the policy (“How to act?”), and the Critic aims at
optimizing the value (“How good action is?”).
Thus, it creates a complementary situation for an agent to gain the best experience of fast
learning.
Advantage: imagine that advantage is the value that brings us an answer to the question: “How
much better the reward for an agent is than it could be expected?” It is the other factor of
making the overall situation better for an agent. In this way, the agent learns which actions were
rewarding or penalizing for it. Formally it looks like this:
Q(s, a) stands for the expected future reward of taking action at a particular state
V(s) stands for the value of being in a specific state.
The advantage of the Actor-Critic algorithm is that it can solve a broader range of problems
than DQN (Deep Q-network ), while it has a lower variance in performance relative to
REINFORCE. That said, because of the presence of the PG algorithm within it, the Actor-
Critic is still somewhat sampling inefficient.
In the beginning, you don't know how to play, so you try some action randomly. The Critic
observes your action and provides feedback. Let's first take a look at the vanilla policy
gradient again to see how the Actor-Critic architecture comes in (and what it is):
We update both the Critic network and the Value network at each update step.
Intuitively, this means how better it is to take a specific action than the average general action at
the given state. So, using the Value function as the baseline function, we subtract the Q value
term with the Value. We will call this Value the advantage value.
Restoring order in policy updates:
In A2C, we move the workers from the agent down to the environment. Instead of having
multiple actor-learners, we have multiple actors with a single learner. As it turns out, having
workers rolling out experiences is where the gains are in policy-gradient methods.
NOTE:
For more information about the “cart-pole” problem refer to reference no:6
Network Schematics
DDPG uses four neural networks: a Q network, a deterministic policy network, a target Q
network, and a target policy network.
The Q network and policy network is very much like simple Advantage Actor-Critic, but in DDPG, the
Actor directly maps states to actions (the output of the network directly the output) instead of outputting
the probability distribution across a discrete action space.
The target networks are time-delayed copies of their original networks that slowly track
thelearned networks. Using these target value networks greatly improve stability in learning.
Here’s why: In methods that do not use target networks, the update equations of the network are
interdependent on the values calculated by the network itself, which makes it prone to divergence.
For example:
So, we have the standard Actor & Critic architecture for the deterministic policy
network and the Q network.
the policy to assign equal probabilities to actions that have same or nearly equal Q-values, and to
ensure that it does not collapse into repeatedly selecting a particular action that could exploit
some inconsistency in the approximated Q function. Therefore, SAC overcomes the brittleness
problem by encouraging the policy network to explore and not assign a very high probability to
any one part of the range of actions.
Objective Function consisting of both a reward term and an entropy term H weighted by α
Now that we know what we are optimizing for, let us understand how we go about doing the
optimization. SAC makes use of three networks: a state value function V parameterized by ψ,a
soft Q-function Q parameterized by θ, and a policy function π parameterized by ϕ. While there is
no need in principle to have separate approximators for the V and Q functions which are related
through the policy, the authors say that in practice having separate functionapproximators help in
convergence. So we need to train the three function approximators as follows:
Where,
Minimizing this objective function amounts to the following: For all (state, action) pairs in the
experience replay buffer, we want to minimize the squared difference between the prediction of
our Q function and the immediate (one time-step) reward plus the discounted expected Value of
the next state. Note that the Value comes from a Value function parameterized by ψ with a bar on
top of it. This is an additional Value function called the target value function. We’ll get into why
we need this but for now, don’t worry about it and just think of it as a Value function that we’re
training.
We’ll use the below approximation of the derivative of the above objective is to update the
parameters of the Q function:
This objective function looks complex but it’s saying something very simple. The DKL function
that you see inside the expectation is called the Kullback-Leibler Divergence. I highly
recommend that you read up on the KL divergence since it shows up a lot in deep learning
research and applications these days. For the purposes of this tutorial, you can interpret it as how
different the two distributions are. So, this objective function is basically trying to make the
distribution of our Policy function look more like the distribution of the exponentiation of our Q
Function normalized by another function Z.
To minimize this objective, the authors use something called the reparameterization trick. This
trick is used to make sure that sampling from the policy is a differentiable process so that there are
no problems in backpropagating the errors. The policy is now parameterized as follows:
The epsilon term is a noise vector sampled from a Gaussian distribution. We will explain it more
in the implementation section.
The normalizing function Z is dropped since it does not depend on the parameter ϕ. An unbiased
estimator for the gradient of the above objective is given as follows:
b) Batching experiences
One of the features of PPO that A2C didn’t have is that with PPO, we can reuse
experience samples. To deal with this, we could gather large trajectory batches, as in
NFQ, and “fit” the model to the data, optimizing it repeatedly. However, a better
approach is to create a replay buffer and sample a large mini batch from it on every
optimization step. That gives the effect of stochasticity on each mini-batch because
samples aren’t always the same, yet we likely reuse all samples in the long term.
c) Clipping the policy updates:
The main issue with the regular policy gradient is that even a small change in
parameter space can lead to a big difference in performance. The discrepancy between
parameter space and performance is why we need to use small learning rates in policy-
gradient methods, and even so, the variance of these methods can still be too large. The
whole point of clipped PPO is to put a limit on the objective such that on each training
step, the policy is only allowed to be so far away. Intuitively, you can think of this
clipped objective as a coach preventing overreacting to outcomes. Did the team get a
good score last night with a new tactic? Great, but don’t exaggerate. Don’t throw away
a whole season of results for a new result. Instead, keep improving a little bit at a time.
d) Clipping the value function updates:
We can apply a similar clipping strategy to the value function with the same core concept: let the
changes in parameter space change the Q-values only this much, but not more. As you can tell,
this clipping technique keeps the variance of the things we care about smooth, whether changes
in parameter space are smooth or not. We don’t necessarily need small changes in parameter
space; however, we’d like level changes in performance and values.
Some Other References:
1. Policy Gradient Algorithms | Lil'Log (lilianweng.github.io)
2. PyLessons
3. Advantage Actor Critic (A2C) (huggingface.co)
4. Baseline for Policy Gradients that All Deep Learning Enthusists Must Know
(analyticsvidhya.com)
5. Advantage Actor Critic Tutorial: minA2C | by Mike Wang | Towards Data Science
6. Using Q-Learning for OpenAI’s CartPole-v1 | by Ali Fakhry | The Startup | Medium
7. Deep Deterministic Policy Gradient — Spinning Up documentation (openai.com)
8. Twin Delayed DDPG — Spinning Up documentation (openai.com)
9. Soft Actor-Critic — Spinning Up documentation (openai.com)
10. Soft Actor-Critic Demystified. An intuitive explanation of the theory… | by VaishakV.Kumar |
Towards Data Science
11. Proximal Policy Optimization — Spinning Up documentation (openai.com)
12. Proximal Policy Optimization (PPO) (huggingface.co)