Deep Reinforcement Learning
Deep Reinforcement Learning
Harry Zhang
December 2019
Contents
Preface v
1 Introduction 1
1.1 Important Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Value Function and Q Function . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Q Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Reinforcement Learning Anatomy . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Imitation Learning 4
2.1 Distribution Mismatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Dataset Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 When Does Imitation Learning Fail? . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 Non-Markovian Behaviors . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.2 Multimodal Behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Theoretical Analysis of Imitation Learning’s Error . . . . . . . . . . . . . . . 7
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
i
CONTENTS ii
4 Actor-Critic Algorithms 17
4.1 Reward-to-Go . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Using Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Value Function Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4.1 Why Do We Evaluate a Policy . . . . . . . . . . . . . . . . . . . . . . 19
4.4.2 How to Evaluate a Policy . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4.3 Monte Carlo Evaluation with Function Approximation . . . . . . . . 20
4.4.4 Improving the Estimate Using Bootstrap . . . . . . . . . . . . . . . . 20
4.5 Batch Actor-Critic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.6 Aside: Discount Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.7 Online Actor-Critic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.8 Critics as State-Dependent Baselines . . . . . . . . . . . . . . . . . . . . . . 23
4.9 Eligibility Traces and n-Step Returns . . . . . . . . . . . . . . . . . . . . . . 23
6 Q-Function Methods 30
6.1 Replay Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.2 Target Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.3 Inaccuracy in Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.3.1 Double Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.3.2 N-Step Return Estimator . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.3.3 Q-Learning with Continuous Actions . . . . . . . . . . . . . . . . . . 33
11 Control as Inference 64
11.1 Probabilistic Graphical Model of Decision Making . . . . . . . . . . . . . . . 64
11.1.1 Inference in the Optimality Model . . . . . . . . . . . . . . . . . . . . 65
11.1.2 Inferring the Backward Messages . . . . . . . . . . . . . . . . . . . . 65
11.1.3 A Closer Look . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
11.1.4 Aside: The Action Prior . . . . . . . . . . . . . . . . . . . . . . . . . 66
11.1.5 Inferring the Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
11.1.6 Inferring the Forward Messages . . . . . . . . . . . . . . . . . . . . . 67
11.2 The Optimism Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
CONTENTS iv
13 Transfer Learning 76
14 Exploration 77
14.1 Multi-arm Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
14.1.1 Defining a Bandit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
14.1.2 Optimistic Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . 78
14.1.3 Probability Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
14.1.4 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
14.2 Exploration in MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
14.2.1 Counting the Exploration Bonus . . . . . . . . . . . . . . . . . . . . . 80
14.3 Exploration with Q-functions . . . . . . . . . . . . . . . . . . . . . . . . . . 81
14.4 Revisiting Information Gain in MDP Exploration . . . . . . . . . . . . . . . 81
14.4.1 Prediction Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
14.4.2 Variational Information Maximization for Exploration (VIME) . . . . 82
14.5 Improving RL with Imitation . . . . . . . . . . . . . . . . . . . . . . . . . . 83
14.5.1 Pretrain and Finetune . . . . . . . . . . . . . . . . . . . . . . . . . . 83
14.5.2 Off-policy RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
14.5.3 Q-learning with Demonstrations . . . . . . . . . . . . . . . . . . . . . 84
14.5.4 Imitation as an Auxiliary Loss Function . . . . . . . . . . . . . . . . 84
15 Offline RL 85
15.1 Offline RL Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Preface
v
Chapter 1: Introduction
Here we review some of the terminologies that frequently appear in the field of Reinforcement
Learning.
1
CHAPTER 1. INTRODUCTION 2
as possible in the environment. Without loss of generality, we assume that the environment
is stochastic. Define a trajectory distribution pθ (τ ) according to Bayes’ Rule:
T
Y
p(τ ) := p(s1 , a1 , ..., sT , aT ) = p(s1 ) πθ (at |st )p(st+1 |st , at )
t=1
where T is the length of episode horizon. Therefore, given this trajectory distribution, we can
calculate thePexpected value of the total reward function induced by following this trajectory
as Eτ ∼pθ (τ ) [ t r(st , at )]. Therefore, to optimize this objective, we want to find a parameter
θ, such that θ maximizes the above expectation:
" #
X
θ = arg max Eτ ∼pθ (τ ) r(st , at )
θ t
Furthermore, if we sum the value function over all possible initial states, we essentially
recovered the objective of reinforcement learning: Es1 ∼p(s1 ) [V π (s1 )], where p(s1 ) is a known
distribution of all possible initial states.
Imitation learning is also called behavioral cloning. The basic idea of imitation learning is
“train to fit the expert behavior”. In other words, given a demonstration, we want to make
the agent follow the demonstration as closely as possible, to best imitate the demonstration’s
behaviors.
4
CHAPTER 2. IMITATION LEARNING 5
would have done based on the observations in Dπ . Take an autonomous car for example, the
training data would be images labeled with steering commands, and we let the car collect
more data, which are only images. Then we give those images to a human expert, and let
the human determine, based on each image, what action (steer left, right, or go straight)
that the human would have applied if he observed such an image.
It can be proven that DAgger resolves the distribution “drift” issue. However, one
problem with DAgger is that human might be error-prone, so the human labelled data
might be flawed to use. Furthermore, more subtly, human, in most cases, does not make
decisions based on a Markovian process. Therefore, the current time step’s action might be
dependent on a state/observation some number of time steps ago.
to associate the brake action with the brake light rather than with the red light/obstacle in
front of the car. The causal confusion issue can be alleviated with the use of DAgger because
the human annotator is able to provide the correct causal relation. For more information,
please refer to this paper [2].
2.3.2 Multimodal Behaviors
Another scenario where fitting expert might fail is that the expert has multi-modal behav-
iors. An example of this is that when you are controlling a drone to dodge a tree ahead,
you either steer left or steer right. However, if you choose the wrong parametric form of
the distribution (e.g. a simple Gaussian) of the actions, the distribution might average out
left and right and choose to go straight, as shown in figure 2.3. Some methods to mitigate
this issue include: first, one can use a mixture of different Gaussian distributions, instead of
just one. Second, construct a latent space variables model, which we will talk more about in
variational inference. Third, we can use autogregressive discretization. Specifically, a mix-
ture of Gaussians means that the policy distribution should be a weighted sum of different
Gaussians with different means and variances.
CHAPTER 2. IMITATION LEARNING 7
To analyze this, let’s introduce a lower bound on the probability of making mistakes: πθ (a 6=
π ∗ (s)|s) ≤ for all s ∼ ptrain (s), where ptrain is training data distribution. The fit distribution
of states pθ (s) is consisted of two parts: the first part is the probability of no mistakes made,
and the second part is the probability of making some mistakes. Using Bayes’ rule, we can
calculate pθ (s) as follows:
pθ (st ) = (1 − )t ptrain (st ) + (1 − (1 − )t )pmistake (st )
so to measure the divergence of pθ from ptrain , we take the difference of the two distributions
(naive, total variation divergence):
|pθ (st ) − ptrain (st )| = (1 − (1 − )t )|pmistake − ptrain | ≤ 2(1 − (1 − )t )
≤ 2t
where we used the identity that (1 − )t ≥ 1 − t for ∈ [0, 1]. Thus, we can calculate the
expected number of mistakes the agent makes using this scheme by:
X XX XX
Epθ (st ) [ct ] = pθ (st )ct (st ) ≤ ptrain (st )ct (st ) + |pθ (st ) − ptrain (st )|cmax
t t st t st
X
≤ + 2t
t
∈ O(T 2 )
Also note that with DAgger ptrain (s) → pθ (s). So we no longer have the second item inside
the summation for DAgger. Thus for DAgger, the expected value should be in O(T ).
As we see, when we have longer horizon length, the errors are going to be aggregated,
thus making more mistakes, and this is one of the most fundamental disadvantages of imi-
tation learning, as discussed in [1].
2.5 Summary
Overall, what are some disadvantages of imitation learning? We have a human factor to
provide data in the entire loop, which is potentially finite, and to generate a good policy,
one need to learn from a lot of data. Moreover, human cannot provide all kinds of data.
Specifically, a human may have trouble with providing data such as the joint angle/torque
of a robotic arm. Therefore, we wish that machines can learn automatically, from unlimited
data.
Chapter 3: Policy Gradient Methods
. With this integral, we can easily take the gradient to perform gradient descent/ascent. A
convenient expression of the gradient of J(θ) is shown below.
∇θ πθ (τ )
πθ (τ )∇θ log πθ (τ ) = πθ (τ ) = ∇θ πθ (τ )
πθ (τ )
Using this identity, we can take the gradient of J(θ) in a cleaner fashion:
Z
∇θ J(θ) = ∇θ πθ r(τ )dτ
Z
= πθ (τ )∇θ log πθ (τ )r(τ )dτ
Now, we want to get rid of the huge log πθ (τ ) from our equation.QRecall that a trajectory
τ is a list of states and actions, so πθ (s1 , a1 , ..., sT , aT ) = p(s1 ) Tt=1 πθ (at |st )p(st+1 |st , at )
8
CHAPTER 3. POLICY GRADIENT METHODS 9
PT Then we take the log on both sides, and we end up getting log πθ (τ ) =
by Bayes’ rule.
log p(s1 ) + t=1 log πθ (at |st ) + log p(st+1 |st , at ). Plugging into our original gradient:
" T
! #
X
∇θ J(θ) = Eτ ∼πθ (τ ) ∇θ log p(s1 ) + log πθ (at |st ) + log p(st+1 |st , at ) r(τ )
t=1
" T
! T
!# (3.1)
X X
= Eτ ∼πθ (τ ) ∇θ log πθ (at |st ) r(st , at )
t=1 t=1
Note that in the above calculation, we cancel out log p(s1 ) and log p(st+1 |st , at ) because we
are taking the gradient with respect to θ, but those two expressions do not depend on θ.
The first item in the final expectation is similar to maximum likelihood.
where the subscripts i, t means time step t in the i-th rollout. With the above gradient, we
can do gradient descent (ascent) on the parameter θ by:
θ ← θ + α∇θ J(θ)
Now we are ready to propose a vanilla policy gradient algorithm by direct gradient ascent
on the Monte Carlo-approximated policy gradient parameters, the REINFORCE Algorithm,
as shown in Algorithm 2.
πθ (at |st ) = N (fneural net (st ); Σ). One advantage of using a Gaussian policy is that it is easy
to obtain a closed-form expression for the Gaussian derivative. We simply write out the
quadratic discriminant function in a multivariate Gaussian distribution:
1 1
log πθ (at |st ) = − (f (st ) − at )T Σ−1 (f (st ) − at ) + C = − kf (st ) − at k2Σ + C
2 2
Taking the derivative, we have:
1 df
∇θ log πθ (at |st ) = − Σ−1 (f (st ) − at )
2 dθ
And we use Gradient ascent as discussed above.
As we discussed before, the first term in the policy gradient is exactly the same as the
definition of maximum likelihood!
So what are we doing here when taking this gradient? Intuitively, we are assigning
more weight to more rewarding trajectories by making trajectories with higher rewards more
probable. Equivalently, higher-reward trajectories are likely to have more probability to be
chosen. This intuition is crucial to the policy gradient methods and is illustrated in Fig. 3.1.
CHAPTER 3. POLICY GRADIENT METHODS 11
In this expression, we do not even have the transition function in it. Long story short,
the Markovian property is not actually used! So we can use policy gradient on a POMDP
without any modification except instead of st , we use ot .
Note that we do not care about what the state actually is. Any Non-Markovian proces
can be made Markovian by setting the state as the whole history.
and we define the second item in the summation as the “reward-to-go”. Notice that in the
reward-to-go term, we start the summation from time t instead of 1, by causality. The idea
is that we are multiplying the likelihood by smaller numbers due to the reduction of the
summation term, so we can reduce the variance to some extent.
CHAPTER 3. POLICY GRADIENT METHODS 12
3.7.2 Baselines
Another common approach is to use baselines. By baselines, we mean that instead of making
all high-reward trajectories more likely, we only make trajectories better than average
more likely. So naturally, we define a baseline b as the average reward:
N
1 X
b= r(τ )
N i=1
But, are we allowed to that? Yes, in fact, we can show that the expectation is the same with
baseline b. To show this, we can express the expectation of baseline as:
Z
Eπθ (τ ) [∇θ log πθ (τ )b] = πθ (τ )∇ log πθ (τ )b dτ
Z
= ∇θ πθ (τ )b dτ
Z
= b∇θ πθ (τ ) dτ
= b∇θ 1
=0
Var = Eτ ∼πθ (τ ) (∇θ log πθ (τ ) (r(τ ) − b))2 − Eτ ∼πθ (τ ) [∇θ log πθ (τ ) (r(τ ) − b)]2
Note that in the second squared expectation term of variance, it can be equivalently written
as Eτ ∼πθ (τ ) [∇θ log πθ (τ )r(τ )]2 since baselines are unbiased in expectation.
CHAPTER 3. POLICY GRADIENT METHODS 13
Now we have an expression of variance with respect to baseline b, we can calculate the
optimal b that minimizes the variance by setting the gradient of variance to 0:
dVar d
= E g(τ )2 (r(τ ) − b)2
db db
d
= E g(τ )2 r(τ )2 − 2E g(τ )2 r(τ )b + b2 E g(τ )2
db
= −2E g(τ )2 r(τ ) + 2bE g(τ )2
=0
E [g(τ )2 r(τ )]
bopt =
E [g(τ )2 ]
where bopt is the optimal baseline value for reducing the variance.
In practice, we just use the average reward for baseline.
Then we can plug it into the off-policy policy gradient. Say we have a trained policy
πθ (τ ), and we have samples from another policy π̄(τ ), we can use the samples from π̄(τ ) to
CHAPTER 3. POLICY GRADIENT METHODS 14
and we defined J(θ) as J(θ) = Eτ ∼πθ (τ ) [r(τ )]. Now if we want to estimate J with some new
parameter θ0 , we can use importance sampling as discussed above:
0 πθ0 (τ )
J(θ ) = Eτ ∼πθ (τ ) r(τ )
πθ (τ )
Now if we estimate it locally, by setting θ = θ0 , then we will cancel out the importance
ratio, ending up with Eτ ∼πθ (τ ) [∇θ0 log πθ0 (τ )r(τ )].
Now there is a problem in the equation. Note that the ratio of the two products can be very
small or very big if T is big, thus increasing the variance. To alleviate the issue, one can
make use of causality as we discussed before:
" Q ! T ! T !#
T
t=1 πθ0 (at |st )
X X
0
∇θ0 J(θ ) = Eτ ∼πθ (τ ) QT ∇θ0 log πθ0 (at |st ) r(st , at )
t=1 πθ (at |st ) t=1 t=1
t0
" T t
! T !!#
X Y πθ0 (at0 |st0 ) X Y πθ0 (at00 |st00 )
= Eτ ∼πθ (τ ) ∇θ0 log πθ0 (at |st ) r(st0 , at0 )
t=1 t0 =1
πθ (at0 |st0 ) t0 =t t00 =t
πθ (at00 |st00 )
Here we used the fact of causality that future actions don’t affect the current weight. Also
note that the last ratio of products can be deleted, and we essentially get the policy iteration
algorithm, which we will discuss in later chapters.
So when we delete the last weight, we end up having
" T t
! T !#
X Y π θ 0 (a t 0 |s t0 ) X
∇θ0 J(θ0 ) = Eτ ∼πθ (τ ) ∇θ0 log πθ0 (at |st ) r(st0 , at0 )
t=1 0
t =1
π θ (a t 0 |s t 0)
0
t =t
In later chapters, we can see that we can pretty much ignore the first states priors ratio.
3.9.1 Advanced Policy Gradients
Recall the policy gradients update rule:
θ ← θ + α∇θ J(θ)
In many cases, some parameters have more impact on the outcome than others. Therefore,
intuitively, we would like to set higher learning rate for parameters with less impact and lower
learning rate for parameters with more impact. To do this, we leverage covariant/natural
policy gradient. Let us look at at the constrained view of iterative gradient descent:
where this controls how far we should go. But this is defined in the parameters’ space,
which means that we do not have much control over individual parameters. To resolve this,
we would like to rescale this constraint so that we can constrain the step size in terms of
the policy space, thus giving us more control on individual parameters. For example, we can
use:
Thus, with F , the rescaled constraint optimization problem can be equivalently rewritten
as:
θ0 ← arg max(θ0 − θ)T ∇θ J(θ)
θ0 (3.4)
s.t. ||θ0 − θ||2F ≤
Using Lagrangian, one could solve this optimization problem iteeratively as follows:
θ ← θ + αF −1 ∇θ J(θ)
where we defined the summed reward as the “reward-to-go” function Q̂i,t , and it represents
the estimate of expected reward if we take action ai,t in state si,t . We have shown that this
estimate has very high variance, and we shall see how we can improve policy gradients from
using better estimation of the reward-to-go function.
4.1 Reward-to-Go
Let us take a closer look at the reward-to-go. To improve the estimation, one way is to
get closer to the precise value of the reward-to-go. We can define the reward-to-go using
expectation:
T
X
Q(st , at ) = Ep(θ) [r(st0 , at0 )|st , at ]
t0 =t
17
CHAPTER 4. ACTOR-CRITIC ALGORITHMS 18
Similarly, we can use the average reward-to-go as a baseline to reduce the variance.
Specifically, we could use the value function V (st ) as the baseline, thus improving the esti-
mate of the gradient in the following way:
N T
1 XX
∇θ J(θ) ' ∇θ log πθ (ai,t |si,t ) (Q(si,t , ai,t ) − V (si,t ))
N i=1 t=1
and
1
P the value function we used is a better approximation of the baseline bt =
N i Q(si,t , ai,t ).
What have we done here? What is the intuition behind subtracting the value function
from the Q-function? Essentially, we are quantifying how much an action ai,t is better than
the average actions. In some sense, it measures the advantage of applying an action over
the average action. Therefore, to formalize our intuition, let us define the advantage as
follows:
Aπ (st , at ) = Qπ (st , at ) − V π (st )
which quantitatively measures how much better action at is.
Putting it all together, now a better baseline-backed policy gradient estimate using
Monte Carlo estimate can be written as:
N T
1 XX
∇θ J(θ) ' ∇θ log πθ (ai,t |si,t )Aπ (si,t , ai,t )
N i=1 t=1
.
= r(st , at ) + V π (st+1 )
= r(st , at ) + Est+1 ∼p(st+1 |st ,at ) [V π (st+1 )]
The last expectation of the value function is used because we do not know what the next
state actually is. Note that we can be a little crude with respect to that expectation in such
a way that we just use the full value function V π (·) on one single sample of the next state,
and use the value as the expectation, ignoring the fact that there are multiple other next
states. With this estimate, we can plug into the advantage function:
Therefore, it is almost enough to just approximate the value function, which solely
depends on state, to generate approximations of other functions. To achieve this, we can use
a neural network to fit our value function V (s), and use the fit value function to approximate
our policy gradient, as illustrated in Fig. 4.1
Having the value function allows us to figure out how good the policy is because the rein-
forcement learning objective can be equivalently written as J(θ) = Es1 ∼p(s1 ) [V π (s1 )], where
we take the expectation of the value function value of the initial state over all possible initial
states.
4.4.2 How to Evaluate a Policy
To evaluate a policy, we can use an approach similar to the policy gradient - Monte Carlo
approximation. Specifically, we can estimate the value function by summing up the reward
collected from time step t:
XT
π
V (st ) ' r(st0 , at0 )
t0 =t
and if we are able to reset the simulator, we could indeed ameliorate this estimate by taking
multiple samples (N ) as follows:
N T
1 XX
V π (st ) ' r(si,t0 , ai,t0 )
N i=1 t0 =t
answer is yes, because we are using a neural net to fit the Monte Carlo targets from a variety
of different states, so even though we do single sample estimate, the value function does
generalize when we visit similar states.
4.4.3 Monte Carlo Evaluation with Function Approximation
To fit our value function, we could use a supervised learning approach. Essentially, we
can use our single sample estimation of the value function as our function value, and fit
a function n thatPmaps the statesoto the value function values. Therefore, our training data
will be (si,t , Tt0 =t r(si,t0 , ai,t0 )) , and we denote the function value labels as yi,t , and we
define
P aπ typical supervised regression loss function which we try to minimize as L(φ) =
1 2
2 i ||V̂φ (si ) − yi || .
, compared with our Monte Carlo targets: yi,t = Tt0 =t r(si,t0 , ai,t0 ).
P
Bootstrapping means applying our previous estimation on our current estimation. In
our ideal targets, the last estimation is accurate if we knew the actual V π . But if the actual
value function is not known, we can just apply bootstrapping by using the current fit estimate
V̂φπ to estimate the next state’s value: V̂φπ (si,t+1 ). Such an estimate is biased, but it has low
variance.
Consequently, our training data using bootstrapping becomes:
n o
(si,t , r(si,t , ai,t ) + V̂φπ (si,t+1 ))
Intuitively, the second option assigns less weight to later step’s gradient, so it essentially
means that later steps matter less in our discount.
In practice, we can show that option 1 gives us better variance, so it is actually what
we use. The full derivation can be found in this paper [3]. Now in our actor-critic algorithm,
after we impose the discount factor, we have the following gradient:
N T
!
1 X X
∇θ J(θ) ' ∇θ log πθ (ai,t |si,t ) r(si,t , ai,t ) + γ V̂φπ (si,t+1 ) − V̂φπ (si,t )
N i=1 t=1
CHAPTER 4. ACTOR-CRITIC ALGORITHMS 22
Now we can incorporate the discount factor with our actor-critic algorithm in Algorithm
4.
every time.
and in actor-critic algorithm, we estimate the gradient by estimating the advantage function:
N T
!
1 X X
∇θ J(θ) ' ∇θ log πθ (ai,t |si,t ) r(si,t , ai,t ) + γ V̂φπ (si,t+1 ) − V̂φπ (si,t )
N i=1 t=1
So what are the pros and cons of the two approaches? In policy gradient with baselines,
we have shown that there is no bias in our estimation, but there might be high variance
due to our single-sample estimation of the cost-to-go function. On the other hand, in the
actor-critic algorithm, we have shown that we have lower variance due to the critic, but we
end up having a biased estimation because of the possibly bad critic as we are bootstrapping.
So can we somehow keep the estimator unbiased while lowering the variance with the critic
V̂φπ ?
The solution is obvious and straightforward, we can just use V̂φπ in place of b:
N T
! T !
1 X X X
∇θ J(θ) ' ∇θ log πθ (ai,t |si,t ) r(si,t0 , ai,t0 ) − V̂φπ (si,t )
N i=1 t=1 t0 =1
the advantage function has lower bias but higher variance. The reason why this tradeoff
exists is that as we go further in our trajectory into the future, the variance increases due to
the fact that the current single sample approximation is not representative enough for the
future. Therefore, the Monte Carlo advantage function is good for getting accurate values
in the near term, but not the long term. In contrast, in actor-critic advantage, the bias
potentially skews the values in the near term, but the fact that the bias incorporates a lot
of states will likely make it a better approximator in the long run. Therefore, it would be
better if we could use the actor-critic based advantage for further in the future, and use the
Monte Carlo based one for the near term in order to control the bias-variance tradeoff.
As a result, we can cut the trajectory before the variance gets too big. Mathematically,
we can estimate the advantage function by combining the two approaches: use the Monte
Carlo approach only for the first n steps:
t+n
0
X
Âπn (st , at ) = γ t −t r(st0 , at0 ) − V̂φπ (st ) + γ n V̂φπ (st+n )
t0 =t
here we applied an n-step estimator, which sums the reward from now to n steps from now,
and n > 1 often gives us better performance.
Furthermore, if we don’t want to choose just one n, we can use a weighted combination
of different n-steps returns, which we can define as the General Advantage Estimation(GAE):
∞
X
ÂGAE (st , at ) = wn Âπn (st , at )
n=1
To choose the weights, we should prefer cutting earlier, so we can assign the weights accord-
ingly: wn ∝ λn−1 , where we call λ the chance of getting cut.
Chapter 5: Value Function Methods
In the last two chapters, we discussed some policy gradient-based algorithms. We have also
seen the fact that the policy gradient methods have high variance. Therefore, it would be
nice if we could completely omit the gradient step. To achieve this, we are going to talk
about the value function methods in this chapter.
which means we take the best action from st , if we follow π. Even though we have no
knowledge of what the policy π actually is, by doing the arg max, we can guarantee that
the action produced is at least as good as the action from the policy function that we do
not know. Therefore, as long as we have a accurate representation of the advantage function
Aπ (st , at ), we can implicitly generate a parameter-free policy function:
(
1, if at = arg maxat Aπ (st , at )
π 0 (at |st ) =
0, otherwise
and as we have shown above, this implicit policy is at least as good as the unknown policy
π.
25
CHAPTER 5. VALUE FUNCTION METHODS 26
Therefore, if we can evaluate the value function V π (s) then we can also evaluate Aπ (s, a).
So in the high-level policy iteration algorithm, we can just use the value function in place of
the advantage function.
5.2.2 Dynamic Programming
Now let us make a simple assumption. Suppose we know a priori the transition probabil-
ity p(s0 |s, a) and both states s and action a are discrete. Then a very natural dynamic
programming update is the bootstrapped update, as we have seen before:
and we can just use the current estimate inside the nested expectation for simplicity.
According to our definition of the implicit policy function π 0 , the policy is actually
deterministic. Therefore, we can completely get rid of the outside expectation, and the value
function update can be further simplified as:
Having this, we can simplify the policy iteration algorithm further, as illustrated in Alg. 7.
As we skipped the policy update part, we call this new, simplified algorithm “value iteration
algorithm”.
to construct a tabular expression of the value function and the Q function. Apparently, the
tables are going to explode in dimensions if there are a lot of states. We call this the Curse
of Dimensionality. To resolve this problem, we can use a neural network to approximate the
functions instead of constructing a tabular expression of the function.
5.3.1 Fitted Supervised Value Iteration Algorithm
Since we know that the value function is defined as maxa Qπ (s, a), we can use this definition
as the labels for the value function in order to define an L2 loss function:
1
L(φ) = ||Vφ (s) − max Qπ (s, a)||2
2 a
Then we can sketch out a simple fitted value iteration algorithm using this loss function
in Algorithm 8. Note that when setting the label, the ideal step to take is to enumerate
all the states and find the corresponding label. However, when it is impractical, one could
just use some samples and enumerate all the actions to find the labels. Moreover, when we
take the maximum over all the actions from a state, we implicitly assume that the transition
dynamics are known. Why? Because we want to take an action, record the value of that
action, and then roll back to the previous state in order to check the values of other actions.
Thus, without the transition dynamics, we cannot easily take the maximum.
5.3.2 Fitted Q-Iteration Algorithm
To address this problem, we can apply the same “max” trick in policy iteration. In policy
iteration, we skip the policy update and calculate the values directly. Here in fitted value
iteration, we can get around the transition dynamics by looking up the Q function table,
because Vφ (s) ' maxa Qφ (s, a), and this max operation is merely a table lookup from the Q
value table. Consequently, we are now iterating on the Q values. Such a method works for
off-policy samples (unlike actor-critic), and it only needs one network, so it does not have any
high-variance policy gradient. However, as we shall see in later sections, such methods do not
have convergence guarantees on non-linear functions, which could potentially be problematic.
The full fitted Q-iteration algorithm is shown in Algorithm 9.
CHAPTER 5. VALUE FUNCTION METHODS 28
So in this particular step, we are optimizing the Bellman Error, and if = 0, we have opti-
mal Q-function, corresponding to optimal policy π, which can be recovered by the arg max
operation. However, rather ironically, we do not know what we are optimizing in the pre-
vious steps, and this is a potential problem of the fitted Q-learning algorithm, and most
convergence guarantees are lost when we do not have the tabular case.
5.3.4 Online Q-Iteration Algorithm
We can also make the samples more efficient by making the Q-iteration algorithm completely
online. By online we mean that we do not store any transition. Instead, we take one transition
and immediately apply the transition to our value update. The online version of Q-Iteration
Algorithm is sketched in Alg. 10. As we see in step 2 of the algorithm, we are taking an
action off-policy, so we have a lot of choices to make.
where ra is a stacked vector of rewards at all states for action a. Ta is a matrix of transitions
for action a such that Ta,i,j = p(s0 = i|s = j, a). We also define a fixed point of the Bellman
backup operator B, denoted as V ∗ :
V ∗ (s) = max r(s, a) + γE[V ∗ (s0 )]
a
CHAPTER 5. VALUE FUNCTION METHODS 29
Recall the algorithms that we discussed in last chapter: Alg. 9 and Alg. 10, where we
devised a fitted Q-learning algorithm and a fully online version of it. We have also shown
that Q-learning is fully off-policy, meaning that we do not care about the trajectory we are
taking, we only care about the current transition and the next state we land in. So what is
the problem with the above Q-learning algorithms? To see this, let us carefully look at step
4 of Alg. 10. The gradient step that we are taking is equivalently written as:
dQφ h
0
i
φ←φ−α (si , ai ) Qφ (si , ai ) − r(si , ai ) + γ max Qφ (s i , ai , si , ri )
dφ a0
This is not gradient descent! Because the “target” value yi is not constant and depends on
our parameter φ, and we are not taking gradient from yi . Therefore, this is not the gradient
descent step that we used to see before. Moreover, in step 2, we are only taking one sample
of transition. This sampling scheme brings us two problems: the first one is that one sample
is really hard to train the network (recall in online actor-critic, we would use parallel workers
to obtain multiple online samples), and the second problem is the samples we are drawing are
correlated in that the transitions are dependent on each other. As you may know, Stochastic
Gradient Descent converges only if we are taking the correct gradient, and when the samples
are IID. We violated both requirements, so the Q-learning algorithms in Alg. 9 and Alg. 10
do not have any convergence guarantees.
30
CHAPTER 6. Q-FUNCTION METHODS 31
Putting it all together, we sketch out the full Q-learning algorithm with replay buffer in
Alg. 11. What have changed here in Alg. 11 compared with Alg. 9? In step 2, we are not
only collecting dataset, but also adding the dataset to replay buffer B. Inside the for loop,
we are now sampling a batch of transitions from B, which will bring us lower variance when
we take the gradient step on the batch. We also periodically update B.
The above solution solves the correlation problem, but we still need to address the wrong
gradient problem.
Therefore, to alleviate this “abruptness”, we can use Polyak Averaging: in step 6 of Alg.
13, instead of copying φ every N steps, we do φ0 ← τ φ0 + (1 − τ )φ. We also call such update
a damped update.
Now let us view the three different Q-learning algorithms in a more general way. As
shown in Fig. 6.2, there are three different steps in the algorithm. The first step is to collect
data, the second step is to update the target in the target network, and the third step is to
regress onto the Q-function. In the simplest, regression-based fitted Q-learning algorithm,
process 3 is in the inner loop of process 2, which is in the inner loop of process 1. In online
Q-learning, we evict the old transitions immediately, and process 1, 2, and 3 run at the same
speed. In DQN, process 1 and 3 run at the same speed, but process 2 runs slower.
Essentially we are using one network’s parameter to update the value, while using the other’s
to select the action. Using the two separate networks, we are decorrelating the action selec-
tion and value evaluation errors, thus decreasing the overestimation in the Q-values.
In practice, we can just use the actual and target networks for the two separate networks.
Therefore, instead of setting target y as y = r + γQφ0 (s0 , arg maxa0 Qφ0 (s0 , a0 )), we use the
current network to select action, and use the target network to evaluate value: y = r +
γQφ0 (s0 , arg maxa0 Qφ (s0 , a0 )).
6.3.2 N-Step Return Estimator
In the definition of our target y, yi,t = ri,t + maxai,t+1 Qφ0 (si,t+1 , ai,t+1 ), the Q-value only
matters if it is a good estimate. If the Q-value estimate is bad, the only values that matter
are from the reward term, so we are not learning much about the Q-function. To resolve
this problem, let us recall the N-step cut trick we did in the actor-critic algorithm. In actor-
critic algorithm, to leverage the bias and variance tradeoff in policy gradient, we can end the
trajectory earlier, and only count the reward summed up to N steps from now. Specifically,
we can define the target as:
X−1
t+N
0
yi,t = γ t−t ri,t + γ N max Qφ0 (si,t+1 , ai,t+1 )
ai,t+1
t0 =t
One subtle problem with this solution is that the learning process suddenly becomes on-
policy, so we cannot efficiently make use of the off-policy data. Why is it on-policy? If we
look at the summation of the rewards, we are collecting the rewards data using a certain
trajectory, which is generated by a specific policy. Therefore, we end up having less biased
target values when the Q-values are inaccurate, and in practice, it is faster in early stages
of learning. However, it is only correct when we are learning on-policy. To fix this problem,
one could ignore this mismatch, which somehow works very well in practice. Or one could
cut the trace by dynamically adapting N to get only on-policy data. Also, one could use
importance sampling as we discussed before. For more details, please refer to this paper by
Munos et al. [5].
6.3.3 Q-Learning with Continuous Actions
Recall the implicit policy that we define using Q-learning:
(
1, if at = arg maxat Aπ (st , at )
π 0 (at |st ) =
0, otherwise
One problem with this definition is that the arg max operation cannot be easily applied if
the actions are continuous. How are we going to address such an issue?
CHAPTER 6. Q-FUNCTION METHODS 34
One option is to use various optimization techniques, as one may have seen in UC
Berkeley’s EE 127. Specifically, one could use SGD on the action space to produce an optimal
at by solving an optimization problem. Another simple approach is to stochastically optimize
the Q-values by using some samples of the values from some pre-defined distribution (e.g.
uniform): maxa Q(s, a) ' max{Q(s, a1 ), ..., Q(s, aN )}. One could also improve the accuracy
by using some more sophisticated optimization techniques such as Cross-Entropy Method
(CEM).
Option no. 2 is to use function classes that are easy to maximize. For example, one
could use the Normalized Advantage Functions (NAF) proposed by Gu et al. in [6].
Another rather fancier option is to learn an approximate optimizer, which was originally
proposed by Lillicrap et al. in [7]. The idea of Deep Deterministic Policy Gradient (which
is actually a Q-learning in disguise) is to train another network µθ (s) such that µθ (s) '
arg maxa Qφ (s, a). To train the network, one can see that the optimization of Q-function with
dQ dQθ
respect to θ can be solved by θ ← arg maxθ Qφ (s, µθ (s)) because by chain rule, dθφ = da
dθ da
.
Then the new target becomes:
yj = rj + γQφ0 (s0j , µθ (s0j )) ' rj + γQφ0 (s0j , arg max Qφ0 (s0j , a0j ))
a0
The sketch of DDPG is in Alg. 14. In step 5, we are updating the Q-function, and in
step 6, we are updating the argmax-er. Therefore, DDPG is essentially DQN with a learned
argmax-er.
Chapter 7: Policy Gradients Theory and Advanced Pol-
icy Gradients
Why does Policy Gradient algorithm work? Recall our generic policy gradient algorithm: we
are essentially looping to constantly estimate the advantage function Âπ (st , at ) for the current
policy π, and then we use this estimate to improve the policy by taking a gradient step on the
policy parameter θ, as shown in Alg. 2. This is very similar to the policy iteration algorithm
that we discussed in last chapter; the idea of policy iteration is to constantly evaluate the
advantage function Aπ (s, a) and update the policy accordingly using the arg max implicit
policy. In this chapter, we are going to dive deeper into the policy gradient algorithm, and we
will show that the policy gradient algorithm can be reduced to our policy iteration algorithm,
which we will prove mathematically.
35
CHAPTER 7. POLICY GRADIENTS THEORY AND ADVANCED POLICY GRADIENTS36
In the first two steps, we swapped out the initial states distribution in the expectation. This
might seem weird at the first sight, but the intuition is that the initial state marginal is the
same for any policy. Therefore, the expectation taken under the initial state marginal can
be equivalently written as any policy’s trajectory distribution, and for simplicity, we choose
the policy of interest π 0 , with corresponding parameter θ0 .
Now we have proved our claim, but we see the result has a distribution mismatch: the
expectation we take is under πθ0 , but the advantage function A is under πθ . It would be nice
if we could make the two distributions the same. Therefore, we make use of our powerful
statistical tool, importance sampling:
" ∞
#
X X
γ t Aπθ (st , at ) = Est ∼pθ0 (st ) Eat ∼πθ0 (at |st ) γ t Aπθ (st , at )
Eτ ∼pθ0 (τ )
t=0 t
X πθ0 (at |st ) t πθ
= Est ∼pθ0 (st ) Eat ∼πθ (at |st ) γ A (st , at )
t
πθ (at |st )
Now the outer expectation is still under θ0 state marginal. Can we sim-
ply ignore hthe distribution h mismatch and ii say that it is approximately equal to
P πθ0 (at |st ) t πθ 0
t Est ∼pθ (st ) Eat ∼πθ (at |st ) πθ (at |st ) γ A (st , at ) , which we define as Ā(θ )? We would be
all set if the approximation holds, because if so, then J(θ0 ) − J(θ) ' Ā(θ0 ), which means
we can calculate ∇θ0 Ā(θ0 ) without generating new samples and calculating any new advan-
tage functions because the only term that depends on θ0 in Ā(θ0 ) is the policy term in the
numerator of importance sampling ratio. Thus, we can just use the current samples from πθ .
Now let’s focus on the more general case, that πθ is an arbitrary distribution. Then we
can try to quantify the notion of “close” by saying πθ is close to πθ0 if:
|πθ0 (at |st ) − πθ (at |st )| ≤ ∀st
Here is a useful lemma that we will use later: if |pX (x) − pY (y)| = , then there exists a
joint distribution of x, y, which we call p(x, y) such that p(x) = pX (x) and p(y) = pY (y) and
p(x = y) = 1 − . Equivalently, this means that under these circumstances, pX (x) disagrees
with pY (y) with probability . If we plug in our πθ and πθ0 , then we can show that πθ0 (at |st )
takes a different action than πθ (at |st ) with probability less than . Using this lemma, we
have the same bound as in the deterministic case:
|pθ0 (st ) − pθ (st )| = (1 − (1 − )t )|pmistake (st ) − pθ (st )|
≤ 2(1 − (1 − )t )
≤ 2t
Now let us first focus on a more general case where we have a generic function of state f (st ):
X
Ep0θ (st ) [f (st )] = pθ0 (st )f (st )
st
X
= (pθ (st ) − pθ (st ) + pθ0 (st ))f (st )
st
X
= pθ (st )f (st ) − (pθ (st ) − pθ0 (st ))f (st )
st
X
≥ pθ (st )f (st ) − |pθ (st ) − pθ0 (st )|f (st )
st
X
≥ pθ (st )f (st ) − |pθ (st ) − pθ0 (st )| ∗ max f (st )
st
st
Now, putting it all together, let us plug in the term inside the expectation taken under the
mismatched distribution:
X πθ0 (at |st ) t πθ
Est ∼pθ0 (st ) Eat ∼πθ (at |st ) γ A (st , at )
t
πθ (at |st )
X
X πθ0 (at |st ) t πθ
≥ Est ∼pθ (st ) Eat ∼πθ (at |st ) γ A (st , at ) − 2C (7.1)
t
πθ (at |st ) t
where the constant term C is a constant depending on the maximum reward, so in the finite
horizon case, it should be of O(T rmax ), and in infinite horizon case, it should be of O(rmax γ t ),
whose sum can be simplified by convergence theory to O( r1−γ max
). Therefore, for small , we
can simply ignore the mismatch.
What have we proved? We have proved that we can update the policy parameter θ0 by
0
X πθ0 (at |st ) t πθ
θ ← arg max Est ∼pθ (st ) Eat ∼πθ (at |st ) γ A (st , at )
θ0 t
πθ (at |st )
such that |πθ0 (at |st ) − πθ (at |st )| ≤ if is small.
CHAPTER 7. POLICY GRADIENTS THEORY AND ADVANCED POLICY GRADIENTS38
such that DKL (πθ0 (at |st )||πθ (at |st )) ≤ . We have guaranteed improvement if we have small
.
7.2.3 Enforcing the Distribution Mismatch Constraint
Now how do we incorporate the constraint on the distribution mismatch with our objective?
One way to do it is to introduce a Lagrangian because we have the following optimization
problem:
0
X πθ0 (at |st ) t πθ
θ ← arg max Est ∼pθ (st ) Eat ∼πθ (at |st ) γ A (st , at )
θ0 t
π θ (a t |s t ) (7.2)
s.t. DKL (πθ0 (at |st )||πθ (at |st )) ≤
Then we optimize in terms of the Lagrangian by first maximizing L(θ0 , λ) with respect to θ0 ,
which we can just do incompletely for a few gradient steps, then we update the dual variable
by λ ← λ + α(DKL (πθ0 (at |st )||πθ (at |st )) − ). This technique is an instance of dual gradient
descent, and we will talk about it more in depth in a later chapter. Essentially, the intuition
is that we raise λ if the constraint is violated too much, and else lower it. Note that one
could also solve this optimization problem by thinking of λ as a regularization term for the
original optimization program.
7.2.4 Other Optimization Techniques
There are also some other ways to optimize based on the distribution mismatch bound. One
way to do it is by using 1st order Taylor expansion. Since θ0 ← arg maxθ0 Ā(θ0 ), we can apply
CHAPTER 7. POLICY GRADIENTS THEORY AND ADVANCED POLICY GRADIENTS39
From what we have learned in policy gradients, we can derive the gradient of Ā as
follows:
X πθ0 (at |st ) t πθ
∇θ Ā(θ) = Est ∼pθ (st ) Eat ∼πθ (at |st ) γ ∇θ0 log πθ0 (at |st )A (st , at )
t
π θ (a t |s t )
and if we have πθ ' πθ0 , then we can effectively cancel out the importance ratio:
X
Est ∼pθ (st ) Eat ∼πθ (at |st ) γ t ∇θ0 log πθ0 (at |st )Aπθ (st , at ) = ∇θ J(θ)
∇θ Ā(θ) =
t
We now have the RL objective in our optimization, then can we just use gradient ascent
just like what we did in policy gradient? Well, it turns out gradient ascent is enforcing some
other constraint:
θ0 ← arg max ∇θ J(θ)T (θ0 − θ)
θ0 (7.5)
s.t. ||θ0 − θ||2 ≤
whose update rule can be written as
r
0
θ ←θ+ ∇θ J(θ)
||∇θ J(θ)||2
, and the square root term is just our learning rate, which depends on . We do not want this
constraint in that it is optimizing in the parameter θ space, which is a bounded −ball, but
we want to optimize in the policy space in an ellipsoidal shape because we want to optimize
the more important parameters with smaller step size and less important parameters with
bigger step size.
Since the two optimization problems are not the same, we will tweak the KL-divergence
constraint a little bit using Taylor expansion one more time. If the two policies are very
similar to each other, one could approximate KL-divergence by
1
DKL (πθ0 ||πθ ) ' (θ0 − θ)T F (θ0 − θ)
2
, where F is called the Fisher-information matrix, and it is defined as
.
This matrix F can be estimated using samples, and it gives us a convenient quadratic
bound. Using a technique similar to Newton-Raphson, we can update the parameter by
θ0 ← θ + αF −1 ∇θ J(θ)
. Now our update rule is a lot more similar to gradient descent, except that in gradient
descent, the l2 norm constrains the update step into a circle, while in our 2nd order approx-
imation of KL-divergence, it constrains the update step into a ellipse. In practice if we want
to solve this natural gradient descent problem with Fisher information matrix efficiently,
there are some nice techniques to approximate the inverse of F , as suggested in the TRPO
paper [8].
Chapter 8: Model-Based Reinforcement Learning
What we have covered so far can be categorized as “model-free” reinforcement learning. The
reason why it is called model-free is that the transition probabilities are unknown and we
did not even attempt to learn the transition probabilities. Recall the RL objective:
πθ (τ ) = pθ (s1 , a1 , ..., sT , at )
T
Y
= p(s1 ) πθ (at |st )p(st+1 |st , at )
t=1
" #
X
θ∗ = arg max Eτ ∼pθ (τ ) r(st , at )
θ t
The transition probabilities p(st+1 |st , at ) is not known in all the model-free RL algorithms
that we have learned such as Q-learning and policy gradients. But what if we know the
transition dynamics? Recall that at the very beginning of the notes we drew an analogy
of RL and control theory; in many cases, we do know the system’s internal transition. For
example, in games, easily modeled systems, and simulated environments, the transitions are
given to us. Moreover, it is not uncommon to learn the transition models: in classic robotics,
system identification fits unknown parameters of a known model to learn how the system
evolves, and one could also imagine a deep learning approach where we could potentially fit
a general-purpose model to observed transition data for later use. In fact, the latter case is
the essence of Model-based RL, where we learn the transition dynamics first, and then figure
out how to choose actions. To learn about model-based RL, we shall start from a simpler
case, where we know the transitions and determine how we control the system optimally
based on the transitions. After this, we can apply our optimal control theory to the more
general case, where we actually learn the transitions first.
41
CHAPTER 8. MODEL-BASED REINFORCEMENT LEARNING 42
Note that we roll out all actions to apply only based on the initial state marginal, so we do
not consider any state-feedback in this case.
In a closed-loop controller, however, we keep interacting with the world, so we need a
policy function that can tell us the action to apply if we input the current state: at ∼ π(at |st ),
which we call a state-feedback. We choose our policy function as follows:
T
Y
p(s1 , a1 , ..., sT , aT ) = p(s1 ) π(at |st )p(st+1 |st , at )
t=1
" #
X
π = arg max Eτ ∼p(τ ) r(st , at )
π
t
Generally, π could take many forms, such as a neural net or time-variant linear controller
K t st + kt .
. If the current node st is not fully expanded, meaning that there is action that we never
took before, then we choose new at ; else, we choose the child with best S(st+1 ).
More details about MCTS can be found in [9] and [10].
8.2.5 Using Derivatives
Let us consider the control theory counterpart of the RL objective. Essentially, we have a
constrained optimization problem defined as follows:
T
X
min c(xt , ut ) s.t. xt = f (xt−1 , ut−1 )
u1 ,...,uT
t=1
However, in collocation method, we optimize upon both actions and states, with con-
straints, and the optimization problem is written as:
T
X
min c(xt , ut ) s.t. xt = f (xt−1 , ut−1 )
u1 ,...,uT ,x1 ,...,xT
t=1
CHAPTER 8. MODEL-BASED REINFORCEMENT LEARNING 45
What we are doing right now is to solve for a closed-form solution for an optimal LQR
controller. The idea is to use backward recursion. Since we are doing shooting method, we
have
min c(x1 , u1 ) + c(f (x1 , u1 ), u2 ) + ... + c(f (f (...)...), uT )
u1 ,...,uT
and the last item is the only term that depends on uT . Therefore, as a base case, we can try
to solve for uT first. In order to simplify our computation, let us define some blocks in the
matrices that we defined above. Specifically, let us assume that
CxT ,xT CxT ,uT
CT =
CuT ,xT CuT ,uT
and
c
cT = xT
cuT
Since our cost function is
T T
1 xT x x
Q(xT , uT ) = const + C T T + T cT
2 uT uT uT
uT = −Cu−1
T ,uT
(CuT ,xT xT + cuT )
Now having solved our terminal control input uT , which is fully determined by our
terminal state xT , we can eliminate uT in Q(xT , uT ). Plugging in, we have
T T
1 xT xT xT
V (xT ) = const + CT + cT
2 K T xT + k T KT xT + kT KT xT + kT
1 1 1 1
= xTT CxT ,xT xT + xTT CxT ,uT KT xT + xTT KTT CuT ,xT xT + xTT KTT CuT ,uT KT xT
2 2 2 2
1
+ xTT KTT CuT ,uT kT + xTT CxT ,uT kT + xTT cxT + xTT KTT cuT + const
2
1 T
= const + xT VT xT + xTT vT
2
where we define vT and VT to make the notation more compact as follows:
VT = CxT ,xT + CxT ,uT KT + KTT CuT ,xT + KTT CuT ,uT KTT
vT = cxT + CxT ,uT kT + KTT cuT + KTT CuT ,uT kT
Having solved the base case, we solve for other optimal control inputs backwards. Let
us first proceed to solve for uT −1 in terms of xT −1 . Now note that uT −1 not only affects state
xT −1 , but it also affects xT because of the system dynamics:
xT −1
f (xT −1 , uT −1 ) = xT = FT −1 + fT −1
uT −1
Therefore, the cost function from T − 1 can be calculated as:
T T
1 xT −1 xT −1 xT −1
Q(xT −1 , uT −1 ) = const + CT −1 + cT −1 + V (f (xT −1 , uT −1 ))
2 uT −1 uT −1 uT −1
and if we plug the transition dynamics function into V (xT ), we will have:
T T T
1 xT −1 T xT −1 xT −1 T xT −1
V (xT ) = const + FT −1 VT FT −1 + FT −1 VT fT −1 + FTT−1 vT
2 uT −1 uT −1 uT −1 uT −1
More compactly, we write the cost function as:
T
1 xT −1 xT −1 xT −1
Q(xT −1 , uT −1 ) = const + QT −1 + qT −1
2 uT −1 uT −1 uT −1
uT −1 = KT −1 xT −1 + kT −1
KT −1 = −Q−1
uT −1 ,uT −1 QuT −1 ,xT −1
Applying the same technique backwards, we can solve for the states and inputs at each
time step, as illustrated in Alg. 17. In step 5 of Alg. 17, Q-function represents the total
cost from now until end if we take ut from state xt , and in step 11, the V-function represents
the total cost from now until end from state xt , so V (xt ) = minut Qxt ,ut , which we call
the cost-to-go function. The above derivation is one of the many derivations of the Riccati
Equation.
What we have analyzed above is based on deterministic dynamics. What if the transition
(dynamics) is stochastic? Specifically, consider the following setup:
xt+1 ∼ p(xt+1 |xt , ut )
xt
p(xt+1 |xt , ut ) = N Ft + f t , Σt
ut
where our transition is actually a Gaussian distribution with constant covariance. It turns
out that we can apply the exact same algorithm, choosing actions according to ut = Kt xt +kt ,
and we can ignore Σt due to symmetry of Gaussians.
8.2.8 Iterative LQR (iLQR)
In LQR, we assumed that the dynamics are linear. In non-linear cases, however, we can apply
a similar approach called iterative LQR. Specifically, we can iteratively apply Jacobian lin-
earization to locally linearize the system with respect to an equilibrium point. Consequently,
we approximate a non-linear system as a linear-quadratic system:
xt − x̂t
f (xt , ut ) ' f (x̂t , ût ) + ∇xt ,ut f (x̂t , ût )
ut − ût
T
xt − x̂t 1 xt − x̂t 2 xt − x̂t
c(xt , ut ) ' c(x̂t , ût ) + ∇xt ,ut c(x̂t , ût ) + ∇xt ,ut c(x̂t , ût )
ut − ût 2 ut − ût ut − ût
CHAPTER 8. MODEL-BASED REINFORCEMENT LEARNING 48
Now we have an LQR system with respect to the divergence from the action space and
state space’s equilibrium points:
¯ δxt
f (δxt , δut ) = Ft
δut
T T
1 δxt δxt
c̄(δxt , δut ) = Ct ct
2 δut δut
where
Ft = ∇xt ,ut f (δxt , δut )
Ct = ∇2xt ,ut c(δxt , δut )
ct = ∇xt ,ut c(δxt , δut )
Then we can iteratively run LQR with dynamics f¯, cost c̄, state δxt , and action δut .
A sketch of iLQR is shown in Alg. 18. In essence, iLQR is an approximation of Newton’s
method for solving minu1 ,...,uT c(x1 , u1 ) + c(f (x1 , u1 ), u2 ) + ... + c(f (f (...)...), uT ).
8.3 Model-based RL
In this section, we are going to cover a rather simpler case of model-based RL. Specifically,
we are going to talk about a technique to learn a model of the system first, and then use the
optimal control technique we covered last time to improve the model. Furthermore, we will
learn to address uncertainty in the model such as model mismatch and imperfection.
8.3.1 Basics
Why do we learn the model? Because when the model is unknown, we can learn the model
so that we know f (st , at ) = st+1 or p(st+1 |st , at ) in stochastic case, we could use the tools
from optimal control to maximize our rewards.
Our first attempt is naive, we learn f (st , at ) from data, and then plan through it. We call
this approach model-based RL version 0.5, or vanilla model-based RL, as shown in algorithm
19. This is essentially what people do in system identification, which is a technique used
in classic robotics, and it is effective when we can hand-engineer a dynamics representation
using our knowledge of physics, and fit just a few parameters. However, it does not work
generally because of distribution mismatch: when the model is imperfect, we might suffer
from false learning. Furthermore, since we are blindly following a trajectory, the mismatch
exacerbates as we use more expressive model classes, when pπ0 (st ) 6= pπf (st ).
CHAPTER 8. MODEL-BASED REINFORCEMENT LEARNING 49
Instead if we estimate the posterior of data p(θ|D) instead of argmax, the entropy of the
distribution
R gives us the model uncertainty from the data. Moreover, we can predict using
p(st+1 |st , at , θ)p(θ|D)dθ.
To learn the posterior distribution, we can apply bootstrap ensembles, where we use
multiple networks to learn the same distribution. Formally, say we have N networks, each
with a parameter θi to learn p(st+1 |st , at ), we can then estimate the posterior by:
1 X
p(θ|D) ' δ(θi )
N i
where δ(·) is the direc-delta function. To train it, we need to generate independent datasets
to get independent models. One way to do this is to train θi on Di sampled with replacement
from D. This method is simple, but it is a very crude approximation.
With this ensemble of networks, we choose actions a little differently. Before, we choose
actions by J(a1 , . . . , aH ) = H
P
t=1 r(s t , at ), where st+1 = f (st , at ), and now we average over
the ensemble by J(a1 , . . . , aH ) = N N
1
P PH
i=1 t=1 r(st,i , at,i ), where st+1,i = f (st,i , at,i )
In general, for candidate action sequence a1 , . . . , aH , we first sample θ ∼ p(θ|D), then
at
P each time step t, we sample st+1 ∼ p(st+1 |st , at , θ), then we calculate the reward R =
t r(st , at ), and we repeat the aforementioned steps and accumulate the average reward.
then with latent models, we are not sure about the actual state, so we take the expected
value:
N T
1 XX
max E [log pφ (st+1,i |st,i , at,i ) + log pφ (ot,i |st,i )]
φ N
i=1 t=1
where the expectation is with respect to the distribution of (st , st+1 ) ∼ p(st , st+1 |o1:T , a1:T )
However, the posterior distribution p(st , st+1 |o1:T , a1:T ) is usually intractable if we have
very complex dynamics. As a result, we could instead try to learn an approximate posterior,
which we call qψ (st |o1:t , a1;t ). We could also learn qψ (st , st+1 |o1:t , a1;t ) and qψ (st |ot ). We call
this technique learning an encoder. Learning the distribution qψ (st |ot ) is crude, but it is
the simplest to implement. If we just decide to learn this distribution for now, then the
expectation becomes:
N T
1 XX
max E [log pφ (st+1,i |st,i , at,i ) + log pφ (ot,i |st,i )]
φ N i=1 t=1
such that the expectation is with respect to st ∼ qψ (st |ot ), st+1 ∼ qψ (st+1 |ot+1 )
For now, let us focus on a simple case where q(st |ot ) is deterministic, because the stochas-
tic case requires variational inference, which will be covered in-depth in a later chapter. In
deterministic case, we are training a neural net gψ (ot ) = st using a direc-delta function such
that qψ (st |ot ) = δ(st = gψ (ot )). Then the expectation can be simplified as
N T
1 XX
max log pφ (gψ (ot+1,i )|gψ (ot,i ), at,i ) + log pφ (ot,i |gψ (ot,i ))
φ N
i=1 t=1
Interested readers can refer to [11] and [12] for more information on learning from pixel-
based images as latent states.
Chapter 9: Model-based Policy Learning
So far we have covered the basics of model-based RL that we first learn a model and use a
model for control. We have seen that this approach does not work well in general because
of the effect of distributional shift in model-based RL. We have also seen the method to
quantify uncertainty in our model in order to alleviate this issue. The methods we covered so
far do not involve learning policies. In this chapter, we will cover model-based reinforcement
learning of policies. Specifically, we will learn global policies and local policies, and combine
local policies into global policies using guided policy search and policy distillation. We shall
understand how and why we should use models to learn policies, global and local policy
learning, and how local policies can be merged via supervised learning into a global policy.
We have seen the difference between a closed-loop and open-loop controller. We also
discussed why an open-loop controller is suboptimal because we are rolling out a whole
sequence of actions solely based on one state observation. Therefore, it would be more ideal
if we could design a closed-loop controller where state feedbacks can help us correct the
mistakes we make. Recall in a stochastic environment, we are optimizing over the policy as
follows:
T
Y
p(s1 , a1 , . . . , sT , aT ) = p(s1 ) π(at |st )p(st+1 |st , at )
t=1
" #
X
π = arg max Eτ ∼p(τ ) r(st , at )
π
t
and π could take several forms: π can be a neural net, which we call a global policy, and
it can also be a time-varying linear controller Kt st + kt as we saw in LQR, which we call a
local policy.
54
CHAPTER 9. MODEL-BASED POLICY LEARNING 55
big (exploding) or extremely small (vanishing), making optimization a lot harder. Further-
more, we have similar parameter sensitivity problems as shooting methods, but we no longer
have convenient second order LQR-like method, because the policy function is extremely
complicated and policy parameters couple all the time steps, so no dynamic programming.
So what can we do about it? First, we can use model-free RL algorithms with synthetic
samples generated by the model. Essentially, we are using models to accelerate model-free
RL. Second, we can use simpler policies than neural nets such as LQR, and train local policies
to solve simple tasks, and then combine them into global policies via supervised learning.
Note that we are not doing any backprop through time in policy gradient because we are
calculating the gradient with respect to an expectation, so we can just take the derivative of
the probability of the samples instead of the actual dynamics function.
Then we look at the regular backprop (pathwise) gradient, we see a more chain rule-like
gradient:
T t
X drt Y dst0 dat0 −1
∇θ J(θ) =
t=1
dst t0 =2 dat0 −1 dst0 −1
The two gradients are different, because the policy gradient is for stochastic systems while
the backprop policy is for deterministic systems. But using variational inference, we can
prove that they are calculating the same gradient differently, thus having different tradeoffs.
We will talk about variational inference more in-depth in the next chapter.
CHAPTER 9. MODEL-BASED POLICY LEARNING 56
Algorithm 24 Dyna
Require: Some exploration policy for data collection π0
1: Given state s, pick action a using exploration policy
2: Observe s0 and r, to get transition (s, a, s0 , r)
3: Update model p̂(s0 |s, a) and r̂(s, a) using (s, a, s0 )
4: Q-update: Q(s, a) ← Q(s, a) + αEs0 ,r [r + maxa0 Q(s0 , a0 ) − Q(s, a)]
5: for K times do
6: Sample (s, a) ∼ B from buffer of past states and actions
7: Q-update: Q(s, a) ← Q(s, a) + αEs0 ,r [r + maxa0 Q(s0 , a0 ) − Q(s, a)]
Actually, given more samples to reduce variance, policy gradient is more stable because
it does not require multiplying many Jacobians. However, if our models are inaccurate,
the samples we use from the wrong model will be incorrect, and the mistakes are likely to
exacerbate as time goes on. So it would be nice to use such model-free optimizer and keep
the rolled out samples’ trajectory short. This is essentially what Dyna algorithm does.
9.2.1 Dyna
Dyna is an online Q-learning algorithm that performs model-free RL with a model. In step
3 of Alg. 24, we are updating the model and reward function using the observed transition.
Then in step 6, we will sample some old state and action pairs and apply the model onto
the sampled pair, so the s0 in step 7 are synthetic next states. Intuitively, as the models get
better, the expectation estimate in step 7 also gets more accurate. This algorithm seems
arbitrary in many aspects, but the gist is to keep improving models and use models to
improve Q-function estimation by taking expectations.
We can also generalize Dyna to see how this kind of general Dyna-style model-based
RL algorithms work. The generalized algorithm is shown in Alg. 25. As shown in Fig.
9.2, we choose some states (orange dots) from the buffer, simulate the next states using the
learned model, and then train model-free RL with synthetic data (s, a, s0 , r) where s is from
the experience buffer, s0 is from the learned model. One could also take more than one step
if one believes that the model is good enough for more steps.
This algorithm only requires very short (as few as one step) rollouts from model, so the
mistakes will not exacerbate and accumulate much. Moreover, we explore well with a lot of
samples because we still see diverse states.
CHAPTER 9. MODEL-BASED POLICY LEARNING 57
However, one problem is that the local policies might not be able to be reproduced using
a single neural net. Therefore, after training the global policy with supervised learning, we
need to reoptimize the local policies using the global policy so that the policies are consistent
with each other. The sketch of guided policy search is shown in Alg. 26. Note that the cost
function c̃k,i is the modified cost function to keep πLQR close to πθ .
In Divide and Conquer RL, the idea is similar, except that we are replacing the local
LQR controllers with local neural net.
9.3.3 Distillation
In RL, we borrow some ideas from supervised learning to achieve the task of learning a global
policy from a bunch of local policies.
Recall in supervised learning, we use model ensemble to make our predictions more
robust and accurate. However, keeping a lot of models is expensive during test time. Is
there a way to train just one model that can behave as well as a meta-learner?
The idea, proposed by Hinton in [13], is to train a model on the ensemble’s predictions
as “soft” targets using:
exp(zi /T )
pi = P
j (zj /T )
where T is called temperature. The new labels here can be intuitively explained using the
example of MNIST dataset. For example, a handwritten digit “2” looks like a 2 and a
backward 6. Therefore, the soft-labels that we use to train the distilled model is going to be
“80% chance being 2 and 20% chance being 6”.
In RL, to achieve multi-task global policy learning, we can use something similar called
CHAPTER 9. MODEL-BASED POLICY LEARNING 59
policy distillation. The idea is to train a global policy using a bunch of local tasks:
X
L= πEi (a|s) log πAM N (a|s)
a
In this chapter we are going to explore some techniques that allow us to infer latent variables
in latent space. We will try to understand the role of latent probabilistic models in deep
learning and how to use them.
In RL, we are mostly concerned with conditional distributions p(x|y) because we are
trying to fit a policy function πθ (a|s) which is a probabilistic model of action conditioned on
state.
So what are latent variable models? Consider that we have a very complicated distribu-
tion p(x), which cannot be easily modeled by a mixture of Gaussians. By Bayes’ rule, this
complicated prior can be modeled by two other easier distributions:
Z
p(x) = p(x|z)p(z)dz
p(x|z) and p(z) could be modeled by a conditional Gaussian and a Gaussian respectively.
Since any function could be represented by a big enough neural network to an arbitrary pre-
cision, we can then use a neural net to represent p(x|z) as p(x|z) = N (µnn (z), µnn (z)). This
sample distribution is a easy distribution with complicated parameters. Often in practice,
we won’t even learn p(z), because we could just model it as a Gaussian distribution and
transform it to any nonlinear distribution using the integral. The challenge of this approach,
however, is to efficiently approximate the integral, which is quite hard.
In RL, we mainly use latent variable models in the following scenarios. First, we could
use conditional latent variable models for multi-modal policies, as we discussed in imitation
learning. Specifically, we could train a network with Gaussian noise to infer the state from
image-based observations. Another scenario is that we could use latent variable models for
model-based RL. Essentially, we learn a conditional distribution p(ot |xt ) and prior p(xt ).
60
CHAPTER 10. VARIATIONAL INFERENCE AND GENERATIVE MODELS 61
Note that we eliminated the expectation Ez∼qi (z) [log p(xi )] because p(xi ) does not depend z.
Since DKL (qi (xi )||p(z|xi )) = −Li (p, qi )+log p(xi ), maximizing Li (p, qi ) with respect to qi
minimizes theP KL-divergence. Now in our maximum likelihood training, insteadPof doing θ ←
arg maxθ N1 i log pθ (xi ), we can use the lower bound and do θ ← arg maxθ N1 i Li (p, qi ) to
approximate it. To optimize, for each xi , we calculate ∇θ Li (p, qi ) by sampling z ∼ qi (z) and
the gradient of the likelihood term can be approximated using ∇θ Li (p, qi ) ' ∇θ log pθ (xi |z)
because log pθ (xi |z) is the only term in the likelihood that depends on θ. Then we apply
gradient ascent on the parameter θ by θ ← θ + α∇θ Li (p, qi ).
However, we also need to update qi to maximize Li (p, qi ) because it also depends on
H(qi ). Let’s say qi (z) = N (µi , σi ), then we can apply gradient ascent on both parameters µi ,
σi to update this distribution. The problem here is the above update rule is for each data
point. Therefore, the number of parameters is |θ| + (|µi | + |σi |) ∗ N , where N is the number
of data points. Thus, we can modify the distribution we are learning so that we use a more
general neural network to approximate q(z|xi ) such that q(z|xi ) = qi (z) ' p(z|xi ). Now the
number of the network parameter does not scale with the number of data points.
10.1.2 Amortized Variational Inference
The above idea is called amortized variational inference. When we maximize the likelihood,
instead of using qi for each data point, we use a general neural net qφ , parameterized by φ.
Then when we update qφ , we can just apply gradient ascent on φ by φ ← φ + α∇φ L. The
likelihood can be denoted as Li (pθ (xi |z), qφ (z|xi )).
How do we calculate ∇φ L? Note that
to calculate the gradient of the likelihood with respect to φ, we can calculate the entropy
term’s gradient easily using textbook formula. However, the first term is harder because the
expectation is taken under a distribution depending on φ, but the term inside the expectation
is independent of φ. Where have we seen this before? Where have seen the same type of
gradient in policy gradient, and by applying the convenient identity, we can get the same
form of gradient. If we call log pθ (xi |z)+log p(z) as r(xi , z), and Ez∼qφ (z|xi ) as J(φ). Applying
the same trick as in policy gradient, we can calculate ∇J(φ) as:
1 X
∇J(φ) ' ∇φ log qφ (zj |xi )r(xi , zj )
M j
To estimate ∇φ J(φ), we can just sample M samples of from a Gaussian N (0, 1).
CHAPTER 10. VARIATIONAL INFERENCE AND GENERATIVE MODELS 63
Using this reparameterization trick, we can derive the expression of Li in another way:
Li = Ez∼qφ (z|xi ) [log pθ (xi |z) + log p(z)] + H(qφ (z|xi ))
= Ez∼qφ (z|xi ) [log pθ (xi |z)] + Ez∼qφ (z|xi ) [log p(z)] + H(qφ (z|xi ))
= Ez∼qφ (z|xi ) [log pθ (xi |z)] − DKL (qφ (z|xi )||p(z))
= E∼N (0,1) [log pθ (xi |µφ (xi ) + σφ (xi ))] − DKL (qφ (z|xi )||p(z))
' log pθ (xi |µφ (xi ) + σφ (xi )) − DKL (qφ (z|xi )||p(z))
The complete computational graph for variational inference is shown in Fig. 10.1.
Compared with policy gradient, the reparameterization trick is easy to implement and
as low variance, but it only works for continuous latent variables. Policy gradient can Can
handle both discrete and continuous latent variables, but it is subject to high variance, rand
equires multiple samples and small learning rates.
In this chapter, we will talk about how we derive optimal control, reinforcement learning,
and planning as probabilistic inference. In a lot of scenarios that, say, involve biological
behaviors, the data is not optimal. The behavior of the agent might be stochastic, but good
behaviors are still more likely.
this might seem an arbitrary choice at the first sight, but we shall see later that this gives
us an elegant mathematical expression in our derivation. We also assume for now that
the reward function is always negative, but we can always take any reward function and
64
CHAPTER 11. CONTROL AS INFERENCE 65
What does the above expression imply? Well, let us pretend that the dynamics are deter-
ministic, then the first term p(τ ) just means if this trajectory is possible. If not, then the
probability is 0. If the trajectory is indeed possible, since we are multiplying by the exponent
of the sum of rewards, then the probability of a trajectory given the agent is acting optimally
is big with high rewards, but small with low rewards.
Let us take a look at the optimality model in Fig. 11.1. Why is this model important?
Because the model is able to model suboptimal behavior, which is important for inverse
RL that will be covered later. We then can apply inference algorithms to solve control and
planning problems. It also provides an explanation for why stochastic behavior might be
preferred, which is useful for exploration and transfer learning.
11.1.1 Inference in the Optimality Model
The first inference we will do is to compute the backward message βt (st , at ) = p(Ot:T |st , at ),
which means the probability of the agent being optimal from the current time step to the
end given state and action. Another inference we will do is the policy p(at |st , O1:T ). Note
that we are inferring the possible actions taken given optimality. The last inference we do is
the forward message αt (st ) = p(st |O1:t−1 ), which is the probability of landing in a particular
state given that the agent is acting optimally up to the current time step.
11.1.2 Inferring the Backward Messages
The backward messages we are inferring is βt (st , at ) = p(Ot:T |st , at ), which we will try to
express in terms of transition probability p(st+1 |st , at ) and optimality probability p(Ot |st , at ).
Mathematically, we can calculate βt (st , at ) as:
The second and the third terms in the product are known, so let us now focus on the first
term:
Z
p(Ot+1:T |st+1 ) = p(Ot+1:T |st+1 , at+1 )p(at+1 |st+1 )dat+1
Z
= β(st+1 , at+1 )dat+1
CHAPTER 11. CONTROL AS INFERENCE 66
we ignored p(at+1 |st+1 ) it means which actions are likely a priori, and we assume it is uniform
(constant) for now.
Therefore, to calculate the backward message, we have a recursive relation. For t =
T − 1 to 1:
βt (st , at ) = p(Ot |st , at )Est+1 ∼p(st+1 |st ,at ) [βt+1 (st+1 )]
βt (st ) = Eat ∼p(at |st ) [βt (st , at )]
As Qt (st , at ) gets bigger Vt (st ) → maxat Qt (st , at ). Using the expression of βt (st , at ), we will
have
Qt (st , at ) = r(st , at ) + log E[exp(Vt+1 (st+1 ))]
Recall in value iteration, we set Q(s, a) ← r(s, a) + γE[V (s0 )]. When the transition
is deterministic, we have Qt (st , at ) = r(st , at ) + Vt+1 (st+1 ), which is similar to value itera-
tion. However, when the transition is stochastic, then the log exp term is like a maximum
operation, so we have a biased optimistic estimation of the Q-function.
11.1.4 Aside: The Action Prior
Recall that we assumed p(at |st ) to be uniform, so it became constant in our integral. How-
ever, we shall see that it does not Rchange much if the action prior is not uniform. Our V
function now becomes Vt (st ) = log exp(Qt (st , at ) + log p(at |st ))dat , and our Q-function be-
comes Q(st , at ) = r(st , at ) + log p(at |st ) + log E[exp(Vt+1 (st+1 ))] We can put the extra p(at |st )
into the reward term, then we will have the same expression of the Q-funtion, thus the V
function. Therefore, uniform action prior can be assumed without loss of generality because
it can always be folded into the reward.
11.1.5 Inferring the Policy
Now with backward messages available to us, we can then proceed to infer the policy
p(at |st , O1:T ). We derive the policy as follows:
p(at |st , O1:T ) = π(at |st )
= p(at |st , Ot:T )
p(at , st |Ot:T )
=
p(st |Ot:T )
p(Ot:T |at , st )p(at , st )/p(Ot:T )
=
p(Ot:T |st )p(st )/p(Ot:T )
p(Ot:T |at , st ) p(at , st )
=
p(Ot:T |st ) p(st )
βt (st , at )
= p(at |st )
βt (st )
CHAPTER 11. CONTROL AS INFERENCE 67
here we used the fact that the current state is conditionally independent of the previous
optimality variables given the previous state, and we also used the fact that the current
action is conditionally independent of the previous optimality variables given the current
state. The first term is just the dynamics, so we need to figure out what the second and the
third terms by Bayes’ rule:
p(Ot−1 |st−1 , at−1 )p(at−1 |st−1 ) p(Ot−1 |st−1 )p(st−1 |p(O1:t−2 )
p(at−1 |st−1 , Ot−1 )p(st−1 |O1:t−1 ) =
p(Ot−1 |st−1 ) p(Ot−1 |O1:t−2 )
p(Ot−1 |st−1 , at−1 )p(at−1 |st−1 )
= αt−1 (st−1 )
p(Ot−1 |O1:t−2 )
so now we have a recursive relation, and αa (s1 ) = p(s1 ) is usually known.
Another byproduct of having this forward message is that we can combine it with the
backward message to calculate the probability of landing in a particular state given optimality
variables:
p(st , O1:T ) p(Ot:T |st )p(st , O1:t−1 )
p(st |O1:T ) = = ∝ βt (st )p(st |O1:t−1 )p(O1:t−1 ) ∝ βt (st )αt (st )
p(O1:T ) p(O1:T )
Geometrically, the relation between the state marginal and the product of backward and
forward messages is shown in Fig. 11.2. Here the backward messages is a backward cone,
and the forward message is a forward cone. When we take the product of the two, we are
essentially finding the intersection of the two cones. Intuitively, for a state in a trajectory,
the state marginals are tighter near the beginning and the end, but looser near the center
because the state marginals need to close in at the beginning and the end of a trajectory.
CHAPTER 11. CONTROL AS INFERENCE 68
Therefore, to maximize the lower bound, we maximize the reward and the entropy.
Using dynamic programming, we can get rid of the optimism max in the Bellman backup
term.
Chapter 12: Inverse Reinforcement Learning
So far in our RL algorithms, we have been assuming that the reward function is known a
priori, or it is manually designed to define a task. What if we want to learn the reward
function from observing an expert, and then use reinforcement learning? This is the idea of
inverse RL, where we first figure out the reward function and then apply RL.
Why should we worry about learning rewards at all? From the imitation learning
perspective, the agent learns via imitation by copying the actions performed by the expert,
without any reasoning about outcomes of actions. However, the natural way that human
learn through imitation is that human copy the intent of the expert, and thus might take
very different actions. In RL, it is often the case that the reward function is ambiguous in
the environment. For example, it is hard to hand-design a reward function for autonomous
driving.
The inverse RL problem definition is as follows: we try to infer the reward functions
from demonstrations, and then learn to maximize the inferred reward using any RL algorithm
that was covered so far. Formally, in inverse RL, we learn rψ (s, a), and then use it to learn
π ∗ (a|s). However, this is an underspecified problem, because many reward function can
explain the same behavior. The reward function can take many forms. One potential form
is the linear reward function, which is a weighted sum of features:
X
rψ (s, a) = ψi fi (s, a) = ψ T f (s, a)
i
The right hand side expectation can be estimated using samples from expert: take N samples
of features, and get the average. The left hand side expectation is a little involved. One
way to do it is to use any RL algorithm to maximize rψ , which is defined using the right
hand side samples, and then produce π rψ , and then we can use this policy to generate more
samples. Another way is to use dynamic programming if we are given the transitions. To
ensure the equality holds, we borrow some ideas from the support vector machine classifier,
70
CHAPTER 12. INVERSE REINFORCEMENT LEARNING 71
where we maximize the margin between the optimal policy’s rewards and that of any other
policy:
max m s.t. ψ T Eπ∗ [f (s, a)] ≥ max ψ T Eπ [f (s, a)] + m
ψ,m π∈Π
but we also need to address the similarity between π and π ∗ so that similar policies do not
need to abide by the m margin requirement.
Using the SVM trick (with the use of Lagrangian dual), we can transform the above
optimization into the following which also contains a function that measures the similarity
between policies:
1
min ||ψ||2 s.t. ψ T Eπ∗ [f (s, a)] ≥ max ψ T Eπ [f (s, a)] + D(π, π ∗ )
ψ 2 π∈Π
where D(π, π ∗ ) measures the difference in feature expectations. However, such approaches
have some issues: maximizing the margin is a bit arbitrary, and there is no clear model of ex-
pert suboptimality (can add slack variables). Furthermore, now we have a messy constrained
optimization problem, which is not great for deep learning!
Note that we can ignore p(τ ) in our optimiztion since it does not depend on ψ. We are
given sample trajectories {τi } sampled from expert policy π ∗ (τ ), so the maximum likelihood
training can be done using:
N N
1 X 1 X
max log p(τi |O1:T , ψ) = max rψ (τi ) − log Z
ψ N ψ N
i=1 i=1
where Z is the partition function needed to make the sum of probability with respect to
τ 1.
12.2.1 Inverse RL Partition Function
In our maximum likelihood training, to make the probability with respect to τ sum to 1, we
introduced the IRL partition function Z. Mathematically, Z is the integral of all possible
trajectories: Z
Z= p(τ ) exp(rψ (τ ))dτ
CHAPTER 12. INVERSE REINFORCEMENT LEARNING 72
Then we take the gradient of the likelihood with respect to ψ after plugging in Z:
N Z
1 X 1
∇ψ L = ∇ψ rψ (τi ) − p(τ ) exp(rψ (τ ))∇ψ rψ (τ )dτ
N i=1 Z
= Eτ ∼π∗ (τ ) [∇ψ rψ (τi )] − Eτ ∼p(τ |O1:T ,ψ) [∇ψ rψ (τ )]
The first expectation is estimated with expert samples, and the second expectation is the
soft optimal policy under current reward. To increase the gradient, we want more expert
trajectory and less current agent trajectory.
12.2.2 Estimating the Expectation
In the above derivation of the gradient of the likelihood, the first expectation is easy to
calculate, but the second one is hard. To calculate the second expectation, we need to do
some messaging:
" T
#
X
Eτ ∼p(τ |O1:T ,ψ) [∇ψ rψ (τ )] = Eτ ∼p(τ |O1:T ,ψ) ∇ψ rψ (st , at )
t=1
T
X
= E(st ,at )∼p(st ,at |O1:T ,ψ) [∇ψ rψ (st , at )]
t=1
Note that the distribution p(st , at |O1:T , ψ) can be rewritten using chain rule as:
p(st , at |O1:T , ψ) = p(at |st , O1:T , ψ)p(st |O1:T , ψ)
where
β(st , at )
p(at |st , O1:T , ψ) =
β(st )
p(st |O1:T , ψ) ∝ α(st )β(st )
Therefore, the distribution is directly proportional to the product of the backward message
and the forward message:
p(at |st , O1:T , ψ)p(st |O1:T , ψ) ∝ β(st , at )α(st )
If we let µt (st , at ) ∝ β(st , at )α(st ), then the second expectation can be written as:
T Z Z
X
Eτ ∼p(τ |O1:T ,ψ) [∇ψ rψ (τ )] = µt (st , at )∇ψ rψ (τ )dst dat
t=1
XT
= µTt ∇ψ rψ
t=1
We know that the first expectation is easy to calculate by sampling expert data, but the
second expectation which is taken under the soft optimal policy under current reward is hard
to calculate. One idea to calculate it is to learn the entire soft optimal policy p(at |st , O1:T , ψ)
using any max-ent RL algorithm and then run this policy to sample {τj } such that:
N M
1 X 1 X
∇ψ L = ∇ψ rψ (τi ) − ∇ψ rψ (τj )
N i=1 M j=1
where we estimate the second expectation using the current policy samples. However, this is
highly impractical because this requires us to run an RL algorithm to convergence in every
gradient step.
12.3.1 More Efficient Updates
As mentioned above, learning p(at |st , O1:T , ψ) in the inner loop in each time step is expensive.
Therefore, we can relax this objective a little to make it more efficient: instead of learning
the policy at each time step, we could improve the policy a little in each time step such that
if the policy keeps getting better, we can generate good samples eventually. Now sampling
from this improved distribution is not actually sampling from the distribution we want, which
is p(τ |O1:T , ψ), we are actually getting a biased estimate of the distribution. Therefore, to
resolve this issue, we use importance sampling:
N M
1 X 1 X
∇ψ L ' ∇ψ rψ (τi ) − P wj ∇ψ rψ (τj )
N i=1 j wj j=1
With the importance ratio, each policy update with respect to rψ brings us closer to the
target distribution.
where demos are made more likely and samples are made less likely. Then we update the
initial policy πθ with respect to rψ :
M
1 X
∇θ L ' ∇θ log πθ (τj )rψ (τj )
M j=1
which in turn changes the policy to make it harder to distinguish from demos.
This looks a lot like a GAN. In a GAN, we have a generator that takes in some noise z,
and outputs a distribution pθ (x|z). We sample from the generator distribution pθ (x). There is
also demonstration data, for example, the real images, which we sample from its distribution
p∗ (x). There is a discriminator parameterized by ψ that determines if the data generated
by the generator is real: D(x) = pψ (real|x). We update the discriminator parameter by
maximizing the binary log likelihood:
1 X 1 X
ψ = arg max log Dψ (x) + log(1 − Dψ (x))
ψ N x∼p∗ M x∼p
θ
where the log likelihood of the data is from demonstration is maximized and that of the data
is from generator is minimized. We also update the generator parameter θ:
Therefore, interestingly, we can frame the IRL problem as a GAN. In a GAN, the
optimal discriminator can be defined as:
p∗ (x)
D∗ (x) =
pθ (x) + p∗ (x)
For inverse RL, the optimal policy approaches πθ (τ ) ∝ p(τ ) exp(rψ (τ )). Choosing the above
optimal parameterization of the discriminator:
p(τ ) Z1 exp(r(τ ))
Dψ (τ ) =
pθ (τ ) + p(τ ) Z1 exp(r(τ ))
p(τ ) Z1 exp(r(τ ))
= Q 1
p(τ ) t πθ (at |st ) + p(τ ) Z exp(r(τ ))
1
exp(r(τ ))
Z
=Q 1
t πθ (at |st ) + Z exp(r(τ ))
Now we don’t need the importance ratio anymore, because it is subsumed into Z.
We could also use a general discriminator, where Dψ is just a normal binary neural net
classifier. It is often simpler to set up optimization, because we have fewer moving parts.
However, the discriminator knows nothing at convergence generally cannot reoptimize the
reward.
Chapter 13: Transfer Learning
This chapter is a high-level overview of transfer learning and multi-task learning techniques.
More to be filled in later.
76
Chapter 14: Exploration
since we have n arms, and assume there is a true distribution of which arm gives us a higher
reward, which we do not know a priori:
r(an ) ∼ p(r|an )
p(ri = 1) = θi , p(ri = 0) = 1 − θi
We also know that θi ∼ p(θ), but we do not know anything else about the distribution.
This actually defines a meta-level POMDP. Our latent state is actually s = [θ1 , . . . , θn ],
which is the true parameterization of the arms’ reward distribution. We also have a belief
77
CHAPTER 14. EXPLORATION 78
state, which is our observation in some sense. The belief state is an estimate of the probability
of getting high reward of each arm:
p̂(θ1 , . . . , θn )
where the first term means in hindsight, how much reward could I have got if I had taken
the best action all the way until the end, and the second term means the actual reward I
have got. The difference of these two gives us the reward.
14.1.2 Optimistic Exploration
One simple way to explore is to use an optimistic exploration strategy, where we keep track
of average reward µ̂a for each action a. Naturally, one way to pick an action is the greedy
exploitation:
a = arg max µ̂a
then to explore better, we need to add bonus to new actions too:
where N (a) counts the number of times that we have applied action a. Using this strategy,
we can get a bound of regret O(log T ).
14.1.3 Probability Matching
Recall we have a belief state model that represents our own estimate of each arm’s reward
parameterization:
p̂(θ1 , . . . , θn )
We can improve our belief state by keep updating it. The idea is to sample (θ1 , . . . , θn ) from
the distribution, and pretend that the model (θ1 , . . . , θn ) is correct, then we take the optimal
action and update the belief model. Then we take the action by a = arg maxa Eθa [r(a)].
This is called posterior sampling or Thompson sampling. This method is harder to
analyze theoretically, but can work very well empirically.
CHAPTER 14. EXPLORATION 79
which is the expected decrease of entropy after observing y. Note that we do not know what
y actually is, but we have some knowledge of what it might be. This is why we are taking
the expected value. Typically, the information gain also depends on the action so we can
have IG(z, y|a). Hence,
it measures how much we learn from z from action a, given the current beliefs.
In our exploration setting, the observation is the observed reward:
y = r(a)
z = θa
we also define another quantity ∆(a) that measures the expected suboptimality of a:
This rule intuitively means that we do not take an action if we are sure if it is optimal (large
∆(a)), or if we cannot learn anything from applying that action (small g(a)).
We talked about bandits models because bandits are easier to analyze and understand.
We can derive foundations for exploration methods, ane then apply these methods to more
complex MDPs. Most exploration strategies require some kind of uncertainty estimation
(even if it’s naïve). We usually assumes some value to new information. For example, we
assume unknown means good (optimism), sample means truth, and information gain means
good.
CHAPTER 14. EXPLORATION 80
where the bonus N (s) decreases with the increase of visitation frequency, and then we use
r+ (s, a) instead of r(s, a) in any model-free algorithm. This is a simple addition to any RL
algorithm, but we need to tune the bonus weight.
14.2.1 Counting the Exploration Bonus
We count the number of times that we have encountered the state s using N (s). However,
in many situations such as video games or autonomous driving, we never actually see the
exact same state twice. Therefore, we need to take the notion of similarity into account: we
count the number of times we have encountered similar states, instead of the same states.
The idea is to fit a density model pθ (s) or pθ (s, a) to the states. pθ (s) is low for very
novel states, and high for states that are very similar to the states we have seen, even if
it might be completely new. To design this density model, we can seek some inspirations
from a simple small MDP. If we have a small MDP, then the density of visiting a state s is
modeled as:
N (s)
P (s) =
n
and if we see the same state again, this density becomes:
N (s) + 1
P 0 (s) =
n+1
we design our neural net density model obeying the same rule.
We devise a deep pseudo-count procedure to count the states as shown in Alg. 28 In
CHAPTER 14. EXPLORATION 81
N̂ (si )
pθ (si ) =
n̂
N̂ (si ) + 1
pθ0 (si ) =
n̂ + 1
this two equations with two unknowns, and we solve for N̂ , n̂ as follows:
N̂ = n̂pθ (si )
1 − pθ0 (si )
n̂ = pθ (si )
pθ0 (si ) − pθ (si )
These counters are able to count similar states.
above three different settings can be calculated exactly. Thus, we need a proxy to estimate
the information gain.
14.4.1 Prediction Gain
One way to approximate the information gain is to use the prediction gain:
log pθ0 (s) − logθ (s)
which is used on state densities. This quantity takes the difference between the density
before and after seeing the state s. Therefore, if the prediction gain is big, then the state s
is novel.
14.4.2 Variational Information Maximization for Exploration
(VIME)
This method to approximate the information gain was first introduced in [16] by Houthooft
et al.. Mathematically, the information gain can be equivalently written in terms of KL
divergence as:
DKL (p(z|y)||p(z))
There is some quantity about the MDP we want to learn about, which in this case, without
loss of generality, is the transition
pθ (st+1 |st , at )
then in the parametrization of the information gain, what we want to learn about is the
parameter of the quantity of interest θ. z could also be some other distributions that involve
θ such as pθ (s) and pθ (r|s, a). The evidence y we observe is the transition. Therefore:
z=θ
y = (st , at , st+1 )
Our information gain in terms of KL-divergence can be set up as
DKL (p(θ|h, st+1 , st , at )||p(θ|h))
where h is the history of all prior transitions. Therefore, intuitively, the transition we observe
is more intuitive if it causes the belief over θ to change.
The idea of VIME is to use variational inference to approximate p(θ|h) since maintaining
the whole history h is not feasible. So we use a distribution to approximate the history:
q(θ|φ) ' p(θ|h)
the new distribution is parameterized by φ, so when we observe a new transition, we update
φ to get φ0 .
As you recall, we update the parameters by optimizing the variational lower bound
DKL (q(θ|φ)||p(h|θ)p(θ)
and we represent q(θ|φ) as a product of independent Gaussians of parameter distributions
with mean φ.
After updating φ0 , we use DKL (q(θ|φ0 )||q(θ|φ)) as the approximate information gain.
CHAPTER 14. EXPLORATION 83
The trick here is when we collect the sum of samples, in our samples, we include both our
experience and demonstration. However, it seems a little weird because in policy gradients
we actually want on-policy data, so why are we including off-policy data. To answer this
question, let us build up some intuition by looking at the optimal importance sampling
distribution. Say we want to estimate Ep(x) [f (x)] using importance sampling:
1 X p(xi )
Ep(x) [f (x)] ' f (xi )
N i q(xi )
gives us the smallest variance. Therefore, by taking off-policy demonstration samples, we are
motivating importance sampling to use distributions that have higher reward than current
policy in order to get closer to the optimal distribution. To construct the sampling distribu-
tion, first we need to figure out which distribution the demonstrations come from. First, we
could use supervised behavioral cloning to learn πdemo . If the demonstration is from multiple
distribution, we could instead use fusion distribution by
1 X
q(x) = qi (x)
M i
RL is fundamentally an “active” learning paradigm: the agent needs to collect its own dataset
to learn meaningful policies. However, this might be unsafe or expensive in real world
problems (e.g., autonomous driving). Therefore it would be more data-efficient to learn
from a previously collected static dataset, which we call Offline (Batch) RL.
85
Bibliography
[1] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured
prediction to no-regret online learning,” in Proceedings of the fourteenth international
conference on artificial intelligence and statistics, 2011, pp. 627–635.
[2] P. de Haan, D. Jayaraman, and S. Levine, “Causal confusion in imitation learning,”
arXiv preprint arXiv:1905.11979, 2019.
[3] P. Thomas, “Bias in natural actor-critic algorithms,” in International conference on
machine learning, 2014, pp. 441–448.
[4] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and
M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint
arXiv:1312.5602, 2013.
[5] R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare, “Safe and efficient off-
policy reinforcement learning,” in Advances in Neural Information Processing Systems,
2016, pp. 1054–1062.
[6] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep q-learning with model-
based acceleration,” in International Conference on Machine Learning, 2016, pp. 2829–
2838.
[7] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and
D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint
arXiv:1509.02971, 2015.
[8] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy
optimization,” in International conference on machine learning, 2015, pp. 1889–1897.
[9] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen,
S. Tavener, D. Perez, S. Samothrakis, and S. Colton, “A survey of monte carlo tree
search methods,” IEEE Transactions on Computational Intelligence and AI in games,
vol. 4, no. 1, pp. 1–43, 2012.
[10] X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang, “Deep learning for real-time
atari game play using offline monte-carlo tree search planning,” in Advances in neural
information processing systems, 2014, pp. 3338–3346.
[11] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller, “Embed to control: A
locally linear latent dynamics model for control from raw images,” in Advances in neural
information processing systems, 2015, pp. 2746–2754.
86
BIBLIOGRAPHY 87
[13] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”
arXiv preprint arXiv:1503.02531, 2015.
[15] E. Parisotto, J. L. Ba, and R. Salakhutdinov, “Actor-mimic: Deep multitask and transfer
reinforcement learning,” arXiv preprint arXiv:1511.06342, 2015.