CS229
CS229
Reinforcement learning
176
177
• Psa are the state transition probabilities. For each state s ∈ S and
action a ∈ A, Psa is a distribution over the state space. We’ll say more
about this later, but briefly, Psa gives the distribution over what states
we will transition to if we take action a in state s.
Or, when we are writing rewards as a function of the states only, this becomes
For most of our development, we will use the simpler state-rewards R(s),
though the generalization to state-action rewards R(s, a) offers no special
difficulties.
178
This says that the expected sum of discounted rewards V π (s) for starting
in s consists of two terms: First, the immediate reward R(s) that we get
right away simply for starting in state s, and second, the expected sum of
future discounted rewards. Examining the second term in more detail, we
see that the summation term above can be rewritten Es0 ∼Psπ(s) [V π (s0 )]. This
is the expected sum of discounted rewards for starting in state s0 , where s0
is distributed according Psπ(s) , which is the distribution over where we will
end up after taking the first action π(s) in the MDP from state s. Thus, the
second term above gives the expected sum of discounted rewards obtained
after the first step in the MDP.
Bellman’s equations can be used to efficiently solve for V π . Specifically,
in a finite-state MDP (|S| < ∞), we can write down one such equation for
V π (s) for every state s. This gives us a set of |S| linear equations in |S|
variables (the unknown V π (s)’s, one for each state), which can be efficiently
solved for the V π (s)’s.
1
This notation in which we condition on π isn’t technically correct because π isn’t a
random variable, but this is quite standard in the literature.
179
In other words, this is the best possible expected sum of discounted rewards
that can be attained using any policy. There is also a version of Bellman’s
equations for the optimal value function:
X
V ∗ (s) = R(s) + max γ Psa (s0 )V ∗ (s0 ). (15.2)
a∈A
s0 ∈S
The first term above is the immediate reward as before. The second term
is the maximum over all actions a of the expected future sum of discounted
rewards we’ll get upon after action a. You should make sure you understand
this equation and see why it makes sense.
We also define a policy π ∗ : S 7→ A as follows:
X
π ∗ (s) = arg max Psa (s0 )V ∗ (s0 ). (15.3)
a∈A
s0 ∈S
Note that π ∗ (s) gives the action a that attains the maximum in the “max”
in Equation (15.2).
It is a fact that for every state s and every policy π, we have
∗
V ∗ (s) = V π (s) ≥ V π (s).
∗
The first equality says that the V π , the value function for π ∗ , is equal to the
optimal value function V ∗ for every state s. Further, the inequality above
says that π ∗ ’s value is at least a large as the value of any other other policy.
In other words, π ∗ as defined in Equation (15.3) is the optimal policy.
Note that π ∗ has the interesting property that it is the optimal policy
for all states s. Specifically, it is not the case that if we were starting in
some state s then there’d be some optimal policy for that state, and if we
were starting in some other state s0 then there’d be some other policy that’s
optimal policy for s0 . The same policy π ∗ attains the maximum in Equa-
tion (15.1) for all states s. This means that we can use the same policy π ∗
no matter what the initial state of our MDP is.
∞, |A| < ∞). In this section, we will also assume that we know the state
transition probabilities {Psa } and the reward function R.
The first algorithm, value iteration, is as follows:
Both value iteration and policy iteration are standard algorithms for solv-
ing MDPs, and there isn’t currently universal agreement over which algo-
rithm is better. For small MDPs, policy iteration is often very fats and
converges with very few iterations. However, for MDPs with large state
spaces, solving for V π explicitly would involve solving a large system of lin-
ear equations, and could be difficult (and note that one has to solve the
linear system multiple times in policy iteration). In these problems, value
iteration may be preferred. For this reason, in practice value iteration seems
to be used more often than policy iteration. For some more discussions on
the comparison and connection of value iteration and policy iteration, please
see Section 15.5.
lem set 4), we had a number of trials in the MDP, that proceeded as follows:
(1) (1) (1) (1)
(1) a
0 1 (1) a
2 (1) a
3 (1) a
s0 −→ s1 −→ s2 −→ s3 −→ ...
(2) (2) (2) (2)
(2) a
0 1 (2) a
2 (2) a
3 (2) a
s0 −→ s1 −→ s2 −→ s3 −→ ...
...
(j) (j)
Here, si is the state we were at time i of trial j, and ai is the cor-
responding action that was taken from that state. In practice, each of the
trials above might be run until the MDP terminates (such as if the pole falls
over in the inverted pendulum problem), or it might be run for some large
but finite number of timesteps.
Given this “experience” in the MDP consisting of a number of trials,
we can then easily derive the maximum likelihood estimates for the state
transition probabilities:
#times took we action a in state s and got to s0
Psa (s0 ) = (15.5)
#times we took action a in state s
Or, if the ratio above is “0/0”—corresponding to the case of never having
taken action a in state s before—the we might simply estimate Psa (s0 ) to be
1/|S|. (I.e., estimate Psa to be the uniform distribution over all states.)
Note that, if we gain more experience (observe more trials) in the MDP,
there is an efficient way to update our estimated state transition probabilities
using the new experience. Specifically, if we keep around the counts for both
the numerator and denominator terms of (15.5), then as we observe more
trials, we can simply keep accumulating those counts. Computing the ratio
of these counts then given our estimate of Psa .
Using a similar procedure, if R is unknown, we can also pick our estimate
of the expected immediate reward R(s) in state s to be the average reward
observed in state s.
Having learned a model for the MDP, we can then use either value it-
eration or policy iteration to solve the MDP using the estimated transition
probabilities and rewards. For example, putting together model learning and
value iteration, here is one possible algorithm for learning in an MDP with
unknown state transition probabilities:
1. Initialize π randomly.
2. Repeat {
(a) Execute π in the MDP for some number of trials.
183
(b) Using the accumulated experience in the MDP, update our esti-
mates for Psa (and R, if applicable).
(c) Apply value iteration with the estimated state transition probabil-
ities and rewards to get a new estimated value function V .
(d) Update π to be the greedy policy with respect to V .
We note that, for this particular algorithm, there is one simple optimiza-
tion that can make it run much more quickly. Specifically, in the inner loop
of the algorithm where we apply value iteration, if instead of initializing value
iteration with V = 0, we initialize it with the solution found during the pre-
vious iteration of our algorithm, then that will provide value iteration with
a much better initial starting point and make it converge more quickly.
15.4.1 Discretization
Perhaps the simplest way to solve a continuous-state MDP is to discretize
the state space, and then to use an algorithm like value iteration or policy
iteration, as described previously.
For example, if we have 2d states (s1 , s2 ), we can use a grid to discretize
the state space:
3
Technically, θ is an orientation and so the range of θ is better written θ ∈ [−π, π) than
θ ∈ R; but for our purposes, this distinction is not important.
184
[t]
Here, each grid cell represents a separate discrete state s̄. We can
then approximate the continuous-state MDP via a discrete-state one
(S̄, A, {Ps̄a }, γ, R), where S̄ is the set of discrete states, {Ps̄a } are our state
transition probabilities over the discrete states, and so on. We can then use
value iteration or policy iteration to solve for the V ∗ (s̄) and π ∗ (s̄) in the
discrete state MDP (S̄, A, {Ps̄a }, γ, R). When our actual system is in some
continuous-valued state s ∈ S and we need to pick an action to execute, we
compute the corresponding discretized state s̄, and execute action π ∗ (s̄).
This discretization approach can work well for many problems. However,
there are two downsides. First, it uses a fairly naive representation for V ∗
(and π ∗ ). Specifically, it assumes that the value function is takes a constant
value over each of the discretization intervals (i.e., that the value function is
piecewise constant in each of the gridcells).
To better understand the limitations of such a representation, consider a
supervised learning problem of fitting a function to this dataset:
5.5
4.5
3.5
y
2.5
1.5
1 2 3 4 5 6 7 8
[t] x
185
4.5
3.5
y
2.5
1.5
1 2 3 4 5 6 7 8
[t] x
[t]
There are several ways that one can get such a model. One is to use
physics simulation. For example, the simulator for the inverted pendulum
in PS4 was obtained by using the laws of physics to calculate what position
and orientation the cart/pole will be in at time t + 1, given the current state
at time t and the action a taken, assuming that we know all the parameters
of the system such as the length of the pole, the mass of the pole, and so
on. Alternatively, one can also use an off-the-shelf physics simulation software
package which takes as input a complete physical description of a mechanical
system, the current state st and action at , and computes the state st+1 of the
system a small fraction of a second into the future.4
An alternative way to get a model is to learn one from data collected in
the MDP. For example, suppose we execute n trials in which we repeatedly
take actions in an MDP, each trial for T timesteps. This can be done picking
actions at random, executing some specific policy, or via some other way of
4
Open Dynamics Engine (http://www.ode.com) is one example of a free/open-source
physics simulator that can be used to simulate systems like the inverted pendulum, and
that has been a reasonably popular choice among RL researchers.
187
choosing actions. We would then observe n state sequences like the following:
(1) (1) (1) (1)
(1) a0 (1) a1 (1) a2 aT −1 (1)
s0 −→ s1 −→ s2 −→ · · · −→ sT
(2) (2) (2) (2)
(2) a
0 1 (2) a
2 (2) a aT −1 (2)
s0 −→ s1 −→ s2 −→ · · · −→ sT
···
(n) (n) (n) (n)
(n) a
0 1 (n) a
2 (n) a aT −1 (n)
s0 −→ s1 −→ s2 −→ · · · −→ sT
We can then apply a learning algorithm to predict st+1 as a function of st
and at .
For example, one may choose to learn a linear model of the form
st+1 = Ast + Bat , (15.6)
using an algorithm similar to linear regression. Here, the parameters of the
model are the matrices A and B, and we can estimate them using the data
collected from our n trials, by picking
T −1
n X
X 2
(i) (i) (i)
arg min st+1 − Ast + Bat .
A,B 2
i=1 t=0
We could also potentially use other loss functions for learning the model.
For example, it has been found in recent work Luo et al. [2018] that using
k · k2 norm (without the square) may be helpful in certain cases.
Having learned A and B, one option is to build a deterministic model,
in which given an input st and at , the output st+1 is exactly determined.
Specifically, we always compute st+1 according to Equation (15.6). Alter-
natively, we may also build a stochastic model, in which st+1 is a random
function of the inputs, by modeling it as
st+1 = Ast + Bat + t ,
where here t is a noise term, usually modeled as t ∼ N (0, Σ). (The covari-
ance matrix Σ can also be estimated from data in a straightforward way.)
Here, we’ve written the next-state st+1 as a linear function of the current
state and action; but of course, non-linear functions are also possible. Specif-
ically, one can learn a model st+1 = Aφs (st ) + Bφa (at ), where φs and φa are
some non-linear feature mappings of the states and actions. Alternatively,
one can also use non-linear learning algorithms, such as locally weighted lin-
ear regression, to learn to estimate st+1 as a function of st and at . These
approaches can also be used to build either deterministic or stochastic sim-
ulators of an MDP.
188
(In Section 15.2, we hadPwritten the value iteration update with a summation
V (s) := R(s) + γ maxa s0 Psa (s0 )V (s0 ) rather than an integral over states;
the new notation reflects that we are now working in continuous states rather
than discrete states.)
The main idea of fitted value iteration is that we are going to approxi-
mately carry out this step, over a finite sample of states s(1) , . . . , s(n) . Specif-
ically, we will use a supervised learning algorithm—linear regression in our
description below—to approximate the value function as a linear or non-linear
function of the states:
V (s) = θT φ(s).
Here, φ is some appropriate feature mapping of the states.
For each state s in our finite sample of n states, fitted value iteration
will first compute a quantity y (i) , which will be our approximation to R(s) +
γ maxa Es0 ∼Psa [V (s0 )] (the right hand side of Equation 15.8). Then, it will
apply a supervised learning algorithm to try to get V (s) close to R(s) +
γ maxa Es0 ∼Psa [V (s0 )] (or, in other words, to try to get V (s) close to y (i) ).
In detail, the algorithm is as follows:
Above, we had written out fitted value iteration using linear regression
as the algorithm to try to make V (s(i) ) close to y (i) . That step of the algo-
rithm is completely analogous to a standard supervised learning (regression)
problem in which we have a training set (x(1) , y (1) ), (x(2) , y (2) ), . . . , (x(n) , y (n) ),
and want to learn a function mapping from x to y; the only difference is that
here s plays the role of x. Even though our description above used linear re-
gression, clearly other regression algorithms (such as locally weighted linear
regression) can also be used.
Unlike value iteration over a discrete set of states, fitted value iteration
cannot be proved to always to converge. However, in practice, it often does
converge (or approximately converge), and works well for many problems.
Note also that if we are using a deterministic simulator/model of the MDP,
then fitted value iteration can be simplified by setting k = 1 in the algorithm.
This is because the expectation in Equation (15.8) becomes an expectation
over a deterministic distribution, and so a single example is sufficient to
exactly compute that expectation. Otherwise, in the algorithm above, we
had to draw k samples, and average to try to approximate that expectation
(see the definition of q(a), in the algorithm pseudo-code).
190
In other words, here we are just setting t = 0 (i.e., ignoring the noise in
the simulator), and setting k = 1. Equivalent, this can be derived from
Equation (15.9) using the approximation
where here the expectation is over the random s0 ∼ Psa . So long as the noise
terms t are small, this will usually be a reasonable approximation.
However, for problems that don’t lend themselves to such approximations,
having to sample k|A| states using the model, in order to approximate the
expectation above, can be computationally expensive.
return V
5:
Require: hyperparameter k.
6: Initialize π randomly.
7: for until convergence do
8: Let V = VE(π, k).
9: For each state s, let
X
π(s) := arg max Psa (s0 )V (s0 ). (15.13)
a∈A
s0