0% found this document useful (0 votes)
11 views

cs229 Notes14

Uploaded by

Manal Khalil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

cs229 Notes14

Uploaded by

Manal Khalil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

CS229 Lecture notes

Tengyu Ma

Part XV
Policy Gradient
(REINFORCE)
We will present a model-free algorithm called REINFORCE that does not
require the notion of value functions and Q functions. It turns out to be more
convenient to introduce REINFORCE in the finite horizon case, which will
be assumed throughout this note: we use τ = (s0 , a0 , . . . , sT −1 , aT −1 , sT ) to
denote a trajectory, where T < ∞ is the length of the trajectory. Moreover,
REINFORCE only applies to learning a randomized policy. We use πθ (a|s)
to denote the probability of the policy πθ outputting the action a at state s.
The other notations will be the same as in previous lecture notes.
The advantage of applying REINFORCE is that we only need to assume
that we can sample from the transition probabilities {Psa } and can query the
reward function R(s, a) at state s and action a,1 but we do not need to know
the analytical form of the transition probabilities or the reward function.
We do not explicitly learn the transition probabilities or the reward function
either.
Let s0 be sampled from some distribution µ. We consider optimizing the
expected total payoff of the policy πθ over the parameter θ defined as.
"T −1 #
X
η(θ) , E γ t R(st , at ) (1)
t=0

Recall that st ∼ Pst−1 at−1 and at ∼ πθ (·|st ). Also note that η(θ) =
Es0 ∼P [V πθ (s0 )] if we ignore the difference between finite and infinite hori-
zon.
1
In this notes we will work with the general setting where the reward depends on both
the state and the action.

1
2

We aim to use gradient ascent to maximize η(θ). The main challenge


we face here is to compute (or estimate) the gradient of η(θ) without the
knowledge of the form of the reward function and the transition probabilities.
Let Pθ (τP) denote the distribution of τ (generated by the policy πθ ), and
−1 t
let f (τ ) = Tt=0 γ R(st , at ). We can rewrite η(θ) as

η(θ) = Eτ ∼Pθ [f (τ )] (2)

We face a similar situations in the variational auto-encoder (VAE) setting


covered in the previous lectures, where the we need to take the gradient w.r.t
to a variable that shows up under the expectation — the distribution Pθ
depends on θ. Recall that in VAE, we used the re-parametrization techniques
to address this problem. However it does not apply here because we do
know not how to compute the gradient of the function f . (We only have
an efficient way to evaluate the function f by taking a weighted sum of the
observed rewards, but we do not necessarily know the reward function itself
to compute the gradient.)
The REINFORCE algorithm uses an another approach to estimate the
gradient of η(θ). We start with the following derivation:
Z
∇θ Eτ ∼Pθ [f (τ )] = ∇θ Pθ (τ )f (τ )dτ
Z
= ∇θ (Pθ (τ )f (τ ))dτ (swap integration with gradient)
Z
= (∇θ Pθ (τ ))f (τ )dτ (becaue f does not depend on θ)
Z
= Pθ (τ )(∇θ log Pθ (τ ))f (τ )dτ
∇Pθ (τ )
(because ∇ log Pθ (τ ) = Pθ (τ )
)
= Eτ ∼Pθ [(∇θ log Pθ (τ ))f (τ )] (3)

Now we have a sample-based estimator for ∇θ Eτ ∼Pθ [f (τ )]. Let τ (1) , . . . , τ (n)
be n empirical samples from Pθ (which are obtained by running the policy
πθ for n times, with T steps for each run). We can estimate the gradient of
η(θ) by

∇θ Eτ ∼Pθ [f (τ )] = Eτ ∼Pθ [(∇θ log Pθ (τ ))f (τ )] (4)


n
1X
≈ (∇θ log Pθ (τ (i) ))f (τ (i) ) (5)
n i=1
3

The next question is how to compute log Pθ (τ ). We derive an analyt-


ical formula for log Pθ (τ ) and compute its gradient w.r.t θ (using auto-
differentiation). Using the definition of τ , we have

Pθ (τ ) = µ(s0 )πθ (a0 |s0 )Ps0 a0 (s1 )πθ (a1 |s1 )Ps1 a1 (s2 ) · · · PsT −1 aT −1 (sT ) (6)

Here recall that µ to used to denote the density of the distribution of s0 . It


follows that

log Pθ (τ ) = log µ(s0 ) + log πθ (a0 |s0 ) + log Ps0 a0 (s1 ) + log πθ (a1 |s1 )
+ log Ps1 a1 (s2 ) + · · · + log PsT −1 aT −1 (sT ) (7)

Taking gradient w.r.t to θ, we obtain

∇θ log Pθ (τ ) = ∇θ log πθ (a0 |s0 ) + ∇θ log πθ (a1 |s1 ) + · · · + ∇θ log πθ (aT −1 |sT −1 )

Note that many of the terms disappear because they don’t depend on θ and
thus have zero gradients. (This is somewhat important — we don’t know how
to evaluate those terms such as log Ps0 a0 (s1 ) because we don’t have access to
the transition probabilities, but luckily those terms have zero gradients!)
Plugging the equation above into equation (4), we conclude that
" T −1 ! #
X
∇θ η(θ) = ∇θ Eτ ∼Pθ [f (τ )] = Eτ ∼Pθ ∇θ log πθ (at |st ) · f (τ )
t=0
" T −1
! T −1
!#
X X
= Eτ ∼Pθ ∇θ log πθ (at |st ) · γ t R(st , at )
t=0 t=0
(8)

We estimate the RHS of the equation above by empirical sample trajectories,


and the estimate is unbiased. The vanilla REINFORCE algorithm iteratively
updates the parameter by gradient ascent using the estimated gradients.

Interpretation
PT −1of the policy gradient formula (8). The quantity
∇θ Pθ (τ ) = t=0 ∇ θ log π (a |s
θ t t ) is intuitively the direction of the change
of θ that will make the trajectory τ more likely to occur (or increase the
probability of choosing action a0 , . . . , at−1 ), and f (τ ) is the total payoff of
this trajectory. Thus, by taking a gradient step, intuitively we are trying to
improve the likelihood of all the trajectories, but with a different emphasis
or weight for each τ (or for each set of actions a0 , a1 , . . . , at−1 ). If τ is very
rewarding (that is, f (τ ) is large), we try very hard to move in the direction
4

that can increase the probability of the trajectory τ (or the direction that
increases the probability of choosing a0 , . . . , at−1 ), and if τ has low payoff,
we try less hard with a smaller weight.
An interesting fact that follows from formula (3) is that
"T −1 #
X
Eτ ∼Pθ ∇θ log πθ (at |st ) = 0 (9)
t=0

To see this, we take f (τ ) = 1 (that is, the reward is always a constant), PT then
the LHS of (8) is zero because the payoff is always a fixed constant t=0 γ t .
Thus the RHS of (8) is also zero, which implies (9).
In fact, one can verify that Eat ∼πθ (·|st ) ∇θ log πθ (at |st ) = 0 for any fixed t
and st .2 This fact has two consequences. First, we can simplify formula (8)
to
T −1
" T −1
!#
X X
j
∇θ η(θ) = Eτ ∼Pθ ∇θ log πθ (at |st ) · γ R(sj , aj )
t=0 j=0
T −1
" T −1
!#
X X
= Eτ ∼Pθ ∇θ log πθ (at |st ) · γ j R(sj , aj ) (10)
t=0 j≥t

where the second equality follows from


" !#
X
Eτ ∼Pθ ∇θ log πθ (at |st ) · γ j R(sj , aj )
0≤j<t
" !#
X
= E E [∇θ log πθ (at |st )|s0 , a0 , . . . , st−1 , at−1 , st ] · γ j R(sj , aj )
0≤j<t

=0 (because E [∇θ log πθ (at |st )|s0 , a0 , . . . , st−1 , at−1 , st ] = 0)


Note that here we used the law of total expectation. The outer expecta-
tion in the second line above is over the randomness of s0 , a0 , . . . , at−1 , st ,
whereas the inner expectation is over the randomness of at (conditioned on
s0 , a0 , . . . , at−1 , st .) We see that we’ve made the estimator slightly simpler.
The second consequence of Eat ∼πθ (·|st ) ∇θ log πθ (at |st ) = 0 is the following: for
any value B(st ) that only depends on st , it holds that
Eτ ∼Pθ [∇θ log πθ (at |st ) · B(st )]
= E [E [∇θ log πθ (at |st )|s0 , a0 , . . . , st−1 , at−1 , st ] B(st )]
=0 (because E [∇θ log πθ (at |st )|s0 , a0 , . . . , st−1 , at−1 , st ] = 0)
2
In general, it’s true that Ex∼pθ [∇ log pθ (x)] = 0.
5

Again here we used the law of total expectation. The outer expecta-
tion in the second line above is over the randomness of s0 , a0 , . . . , at−1 , st ,
whereas the inner expectation is over the randomness of at (conditioned on
s0 , a0 , . . . , at−1 , st .) It follows from equation (10) and the equation above that
T −1
" T −1
!#
X X
∇θ η(θ) = Eτ ∼Pθ ∇θ log πθ (at |st ) · γ j R(sj , aj ) − γ t B(st )
t=0 j≥t
T −1
" T −1
!#
X X
t j−t
= Eτ ∼Pθ ∇θ log πθ (at |st ) · γ γ R(sj , aj ) − B(st )
t=0 j≥t
(11)

Therefore, we will get a different estimator for estimating the ∇η(θ) with a
difference choice of B(·). The benefit of introducing a proper B(·) — which
is often referred to as a baseline — is that it helps reduce the variance of the
estimator.3 It turns
hP out that a near optimal i estimator would be the expected
T −1 j−t
future payoff E j≥t γ R(sj , aj )|st , which is pretty much the same as the
value function V πθ (st ) (if we ignore the difference between finite and infinite
horizon.) Here one could estimate the value function V πθ (·) in a crude way,
because its precise value doesn’t influence the mean of the estimator but only
the variance. This leads to a policy gradient algorithm with baselines stated
in Algorithm 1.4

3
PT −1 As a heuristic but illustrating example, suppose for a fixed t, the future reward
j−t
j≥t γ R(sj , aj ) randomly takes two values 1000 + 1 and 1000 − 2 with equal proba-
bility, and the corresponding values for ∇θ log πθ (at |st ) are vector z and −z. (Note that
because E [∇θ log πθ (at |st )] = 0, if ∇θ log πθ (at |st ) can only take two values uniformly,
then the two values have to two vectors in an opposite direction.) In this case, without
subtracting the baseline, the estimators take two values (1000 + 1)z and −(1000 − 2)z,
whereas after subtracting a baseline of 1000, the estimator has two values z and 2z. The
latter estimator has much lower variance compared to the original estimator.
4
We note that the estimator of the gradient in the algorithm does not exactly match the
equation 11. If we multiply γ t in the summand of equation (13), then they will exactly
match. Removing such discount factors empirically works well because it gives a large
update.
6

Algorithm 1 Vanilla policy gradient with baseline


for i = 1, · · · do
Collect a set of P
trajectories by executing the current policy. Use R≥t
−1 j−t
as a shorthand for Tj≥t γ R(sj , aj )
Fit the baseline by finding a function B that minimizes
XX
(R≥t − B(st ))2 (12)
τ t

Update the policy parameter θ with the gradient estimator


XX
∇θ log πθ (at |st ) · (R≥t − B(st )) (13)
τ t

You might also like