Reinforcement Learning Cheat Sheet: Return
Reinforcement Learning Cheat Sheet: Return
the returns following first visits to s, whereas the every-visit require that every action taken under π is also taken, at least end
MC method averages the returns following all visits to s. The occasionally, under b. That is, we require that π(a|s) > 0 end
every-visit MC Prediction is derived from first-visit version implies b(a|s) > 0. This is called the assumption of coverage. Algorithm 7: Off-policy MC Control - estimating
removing the “if” condition. Off-policy Every-visit MC Prediction π ∼ π∗ [§5.7]
In other words we move backward from the step T and
compute the G incrementally and associate the values of G to Inputs: π - the policy to be evaluated
the current state and perform the average. Initialize: V (s) ∈ R for all s ∈ S
Return(s) ← an empty list for all s ∈ S
while forever - for each episode do
Generate an episode following b:
MC Estimation of Action Values S0 , A0 , R1 , S1 , A1 , ..., ST −1 , AT −1 , RT
G←0
W ←1
To determine a policy, if a model is not available, the state foreach step of episode, t = T − 1, T − 2, ..., 0 do
value is not sufficient and we have to estimate the values of G ← γW G + Rt+1
state–action pairs. Append G to Return(St )
The MC methods are essentially the same as just presented for V (St ) ← average(Return(St ))
state values, but now we have state–action pairs. π(A |S )
W ← W b(A t|S t)
t t
The only complication is that many state–action pairs may
end
never be visited. We need to estimate the value of all the
end
actions from each state, not just the one we currently favor.
We can specify that the episodes start in a state–action pair, Algorithm 6: Off-policy Every-visit Monte Carlo
and that every pair has a nonzero probability of being selected prediction - estimating V ∼ vπ [Course2-Week2]
as the start (assumption of exploring starts).
Temporal-Difference Learning TD methods update their estimates based in part on other Q-Learning - Off-policy TD Control
estimates. They learn a guess from a guess, i.e. they Q-Learning is an off-policy TD control. Q-learning is a
TD Prediction
bootstrap. TD and MC methods have an advantage over DP sample-based version of value iteration which iteratively
Starting from Eq. 26, we can consider a generic update rule of methods in that they do not require a model of the applies the Bellman’s optimality equation. The update rule:
V (St ) environment, of its reward and next-state probability
distributions.
The most obvious advantage of TD methods over MC methods Q(St , At ) ←
V (St ) ← V (St ) + α(Gt − V (St )) [6.1] (27) h i
is that they are naturally implemented in an online, fully Q(St , At )+α Rt+1 + γ max Q(St+1 , a) − Q(St , At ) [6.8]
α is a constant step-size and we call previous method incremental fashion. With MC methods one must wait until a
constant-α MC. MC has to wait the end of an episode to the end of an episode, because only then the return is known, (34)
determine the increment to V (St ). whereas with TD methods one needs wait only one time step.
Differently from MC, TD updates the value at each step of the In practice, TD methods have usually been found to converge Params: step size α ∈]0, 1], small > 0
episode following the equation below: faster than constant-α MC methods on stochastic tasks. Initialize Q(s, a) for all s ∈ S + and a ∈ A(s),
The error, available at time t + 1, between V (St ) and the arbitrarily except that Q(terminal − state, ·) = 0
better estimate Rt+1 + γV (St+1 ) is called TD error : foreach episode do
V (St ) ← V (St ) + α(Rt+1 + γV (St+1 ) − V (St )) [6.2] (28) Initialize S
.
δt = Rt+1 + γV (St+1 ) − V (St ) [6.5] (32) foreach step of episode - until S is terminal do
Inputs: π - the policy to be evaluated Choose A from S using policy derived from Q
Params: step size α ∈]0, 1] Sarsa - On-policy TD Control (e.g. -greedy)
Initialize: V (s) ∈ R for all s ∈ S + except for Take action A, observe R, S 0
V(terminal)=0 Sarsa (State-action-reward-state-action) is an on-policy TD Q(S, A) ←
foreach episode do control. Sarsa is sample-based version of policy iteration which Q(S, A) + α [R + γ maxa Q(S 0 , a) − Q(S, A)]
Initialize S uses Bellman equations for action values. The update rule: S ← S0
foreach step of episode - until S is terminal do end
A ← action given by π for S end
Take action A, observe R, S 0 Q(St , At ) ←
Algorithm 10: Q-Learning - Off-policy TD Con-
V (S) ← V (S) + α(R + γV (S 0 ) − V (S)) Q(St , At )+α [Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )] [6.7]
S ← S0 trol - estimating π ∼ π∗ [§6.5]
(33)
end
end
Expected Sarsa
Params: step size α ∈]0, 1], small > 0 Similar to Q-Learning, the update rule of Expected Sarsa,
Algorithm 8: Tabular TD(0) - estimating vπ [§6.1] Initialize Q(s, a) for all s ∈ S + and a ∈ A(s), takes the expected value instead of the maximum over the
Recall that: arbitrarily except that Q(terminal − state, ·) = 0 next state:
. foreach episode do
vπ (s) = Eπ [Gt |St = s] [3.12|6.3] (29) Initialize S Q(St , At ) ←
= Eπ [Rt+1 + γGt+1 |St = s] [by 3.9] (30) Choose A from S using policy derived from Q (e.g. Q(St , At ) + α[Rt+1 + γEπ [Q(St+1 , At+1 )|St+1 ] − Q(St , At )]
= Eπ [Rt+1 + γvπ (St+1 )|St = s] [6.4] (31) -greedy) X
foreach step of episode - until S is terminal do Q(St , At ) + α[Rt+1 + γ π(a|St+1 )Q(St+1 , a) − Q(St , At )] [6.9]
Take action A, observe R, S 0 a
Choose A0 from S 0 using policy derived from Q (35)
MC methods use an estimate of Eq. 29 as a target. The MC
target is an estimate because the expected value in Eq. 29 is (e.g. -greedy) The next action is sampled from π. However, the expectation
not known; a sample return is used in place of the real Q(S, A) ← over actions is computed independently of the action actually
expected return. Q(S, A) + α [R + γQ(S 0 , A0 ) − Q(S, A)] selected in the next state. In fact, it is not necessary that π is
DP and TD methods use an estimate of Eq. 31 as a target. S ← S0 equal to the behavior policy. This means that Expected Sarsa,
The DP target is an estimate because vπ (St+1 ) is not known A ← A0 like Q-learning, can be used to learn off-policy without
end
and the current estimate, V (St+1 ), is used instead. importance sampling.
end
The TD target is an estimate because it samples the expected If the target policy is greedy with respect to its action value
values in Eq. 31 and it uses the current estimate V instead of Algorithm 9: Sarsa - On-policy TD Control - es- estimates we obtain the Q-Learning. Hence Q-Learning is a
the true vπ . timating Q ∼ q∗ [§6.4] special case of Expected Sarsa.
n-step Bootstrapping In the one-step TD, instead, the update, is based on just the In the n-step TD, it is used the n-step return:
one next reward, bootstrapping from the value of the state one
n-step TD Prediction
step later as a proxy for the remaining rewards using one-step .
MC methods updates the estimate of vπ (St ) for each state return: Gt:t+n = Rt+1 + γRt+2 + ... + γ n−1 Rt+n + γ n Vt+n−1 (St+n )
based on the entire sequence of observed rewards from that (38)
state until the end of the episode using:
. .
Gt = Rt+1 + γRt+2 + ... + γ T −t−1 RT (36) Gt:t+1 = Rt+1 + γVt (St+1 ) (37)
Planning and Learning experience obtained with the interaction with the environment If (e) and (f) were omitted, the remaining algorithm would be
Planning methods use simulated experience generated by a can be used to improve directly the Policy/value function one-step tabular Q-learning.
model, learning methods use real experience generated by the (direct RL) or indirectly through the model learning and the The agent responds instantly to the latest sensory information
environment. Many ideas and algorithms can be transferred planning that use simulated experience (indirect RL) [§8.2]. and yet always planning in the background. Also the
between planning and learning. model-learning process is in background. As new information
Params: step size α ∈]0, 1], small > 0 is gained, the model is updated to better match reality. As the
Initialize Q(s, a) for all s ∈ S + and ainA(s), arbitrarily model changes, the ongoing planning process will gradually
except that Q(terminal − state, ·) = 0 compute a different way of behaving to match the new model.
foreach episode do Models may be incorrect for many reasons: environment is
1. Select a state S ∈ S, and an action, A ∈ A(s), stochastic and only a limited number of samples have been
at random observed, the model was learned using function approximation
2. From a Sample Model obtain the sample reward that has generalized imperfectly, the environment has changed
and next state following: R, S’ = model (S,A) and its new behavior has not yet been observed. When the
3. Apply one-step tabular Q-Learning to model is incorrect, the planning process is likely to compute a
S, A, R, S 0 : suboptimal policy. In some cases, the suboptimal policy
Q(S, A) ← computed by planning quickly leads to the discovery and
Q(S, A) + α [R + γ maxa Q(S 0 , a) − Q(S, A)] correction of the modeling error. This happens when the
end model is optimistic, predicting greater reward or better state
transition than are actually possible. It is more difficult to
Algorithm 11: Random-sample one-step tabular correct a model when the environment becomes better than it
Q-planning [§8.1] was before.
Initialize Q(s, a) and M odel(S, A) for all s ∈ S + and In Figure below there are represented the relation among
Dyna a ∈ A(s) algorithms presented in the Course on Coursera
Within a planning agent, there are at least two roles for real while forever do [Course3-Week1].
experience: (a) S ← current (nonterminal) state
(1) it can be used to improve the model (to match more (b) A ← -greedy(S, Q)
accurately the real environment), model learning, (c) Take action A; observe resultant reward, R,
(2) it can be used to directly improve the value function and and state, S 0
policy using the kinds reinforcement learning methods (d) Q(S, A) ←
discussed before, direct RL [§8.2]. Q(S, A) + α [R + γ maxa Q(S 0 , a) − Q(S, A)]
(e) M odel(S, A) ← R, S 0 (assuming deterministic
environment)
(f) foreach n times do
S ← random previously observed state
A ← random action previosuly taken in S
R, S 0 ← M odel(S, A)
Q(S, A) ←
Q(S, A) + α [R + γ maxa Q(S 0 , a) − Q(S, A)]
end
end
Algorithm 12: Dyna-Q [§8.2]
(d) Direct reinforcement learning
(e) Model-learning
The experience can improve value functions and policies either (f) Planning
directly or indirectly via the model (indirect RL). The real
.
On-policy Prediction with Function state’s estimate more accurate means making others’ less For MC, Ut = Gt and hence Ut is an unbiased estimation of
accurate. vπ (St ).
Approximation
We can approximate value function not as a table but as a . X
V E(w) = µ(s) [vπ (s) − v̂(s, w)]2 [9.1] (39) Inputs: π - the policy to be evaluated
parametrized functional form: v̂(s, w) ∼ vπ (s) where w ∈ Rd s∈S) a differentiable function v̂ : S × Rd → R
and the number of weights is much less than the number of P
Parameters: step size α > 0
where µ(s) is a state distribution (µ(s) ≥ 0 and s µ(s) = 1)
states (d << |S|). The value estimation can be framed as a Initialize: w ∈ Rd arbitrarily (e.g. w = 0)
representing how much we care about the error in each state s.
Supervised Learning problem. while forever - for each episode do
Usually to minimize the 35 it is used the Stochastic Gradient
The Monte Carlo methods estimate the value function using Generate an episode following π:
Descent (SGD):
samples of the return so the input is the state and the targets S0 , A0 , R1 , S1 , A1 , ..., ST −1 , AT −1 , RT
are the returns (pairs (Si , Gi )). 1 foreach step of episode, t = 0, 1, ..., T − 1 do
.
For TD methods the targets are the one-step bootstrap return wt+1 = wt − α∇ [vπ (St ) − v̂(St , wt )]2 [9.4] (40) w ← w − α [Gt − v̂(St , wt )] ∇v̂(St , w)
(pairs (Si , Ri+1 + γv̂(Si+1 , w)). 2
end
In RL setting, the data is temporally correlated and the full = wt − α [vπ (St ) − v̂(St , wt )] ∇v̂(St , wt ) [9.5] (41)
end
dataset is not fixed and available from the beginning. Usually we have only an approximation Ut of vπ (St ) but, if Ut Algorithm 13: Gradient MC - Estimating v ∼ vπ
Moreover, due to the bootstrapping methods (TD, DP), the is an unbiased estimation of vπ (St ), that is
target labels change. E[Ut |St = s] = vπ (St ), for each t, then wt is guaranteed to [§9.3]
In tabular case the learned values at each state were decoupled converge to a local optimum (under the stochastic
- an update at one state affected no other. Now making one approximation condition for decreasing α). https://github.com/linker81/Reinforcement-Learning-CheatSheet