0% found this document useful (0 votes)
309 views

Reinforcement Learning Cheat Sheet: Return

This document provides an overview of key concepts in reinforcement learning, including returns, value functions, and the agent-environment interface. It defines returns as the cumulative discounted reward over time, and distinguishes between episodic and continuing tasks. Value functions map states or state-action pairs to expected returns, and optimal value functions aim to maximize expected returns. The agent-environment interaction is framed as a Markov decision process, where the agent selects actions based on the current state and receives rewards and a new state from the environment.

Uploaded by

Joydeep Hazra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
309 views

Reinforcement Learning Cheat Sheet: Return

This document provides an overview of key concepts in reinforcement learning, including returns, value functions, and the agent-environment interface. It defines returns as the cumulative discounted reward over time, and distinguishes between episodic and continuing tasks. Value functions map states or state-action pairs to expected returns, and optimal value functions aim to maximize expected returns. The agent-environment interaction is framed as a Markov decision process, where the agent selects actions based on the current state and receives rewards and a new state from the environment.

Uploaded by

Joydeep Hazra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Reinforcement Learning Cheat Sheet Return policy π.

Informally, is the expected return when starting from


In RL the goal of the agent is not the maximization of the s, taking action a and thereafter following π:
Recap immediate reward, but of the cumulative reward in the long .
qπ (s, a) = Eπ [Gt |St = s, At = a] [3.13] (13)
. P run. The return is a specific function of reward sequence. In
E[X] = xi xi · P r{X = xi } X h X i
P the simplest case the return is the sum of rewards: = p(s0 , r|s, a) r + γ π(a0 |s0 )qπ (a0 , s0 ) [Ex 3.17]
E[X|Y = yj ] = xi xi · P r{X = xi |Y = yj } . s0 ,r a0
P
E[X|Y = yj ] = zk P r{Z = zk |Y = yj } · E[X|Y = yj , Z = zk ] Gt = Rt+1 + Rt+2 + Rt+3 + ... + RT [3.7] (5)
(14)
where T is a final time step. When there is a natural notion of
Agent-Environment Interface final time step (T ), the agent-environment interaction breaks The last one is the Bellman equation for qπ .
naturally into sub-sequences (episodes) and the next episode
begins independently of how the previous one ended. Tasks Relation between Value Functions
with episodes are called episodic tasks. Each episodes ends in
a special state called terminal state with different rewards for
X
vπ (s) = π(a|s) · qπ (s, a) [Ex 3.12] (15)
the different outcomes. S + is the set of all states plus the a
terminal state.
= Eπ [qπ (s, a)|St = s] [Ex 3.18] (16)
When the agent-environment interaction does not break
naturally into episodes but goes on continually without limit, X h i
we call these continuing tasks. The previous formulation of qπ (s, a) = p(s0 , r|s, a) r + γvπ (s0 ) [Ex 3.13] (17)
return (Eq. 5) is problematic because T = ∞. s0 ,r
It is introduced the total discounted return expressed as the h i
The Agent at each step t receives a representation of the sum of rewards (opportunely discounted using the discount = E Rt+1 + γvπ (s0 )|St = s, At = a [Ex 3.19] (18)
environment’s state St ∈ S and it selects an action At ∈ A(s). rate 0 ≤ γ ≤ 1):
One time step later, as a consequence of its action, the agent . Optimal Value Functions
receives a reward, Rt+1 ∈ R ⊆ R and goes to the new state Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + ...
.
St+1 . ∞ v∗ (s) = max vπ (s) [3.15] (19)
π
X
= γ k Rt+k+1 [3.8] (6)
Markov Decision Process k=0 = max E[Rt+1 + γv∗ (St+1 )|St = s, At = a] [3.18]
a
A finite Markov Decision Process, MDP, is defined by: = Rt+1 + γGt+1 [3.9] (7) X h i
= max p(s0 , r|s, a) r + γv∗ (s0 ) [3.19]
finite set of states: s ∈ S, To unify the notation for episodic and continuing tasks we use:
a
s0 ,r
finite set of actions: a ∈ A
dynamics: T
. X .
0 . 0 Gt = γ k−t−1 Rk [3.11] (8) q∗ (s, a) = max qπ (s, a) [3.16] (20)
p(s , r|s, a) = P r{St = s , Rt = r|St−1 = s, At−1 = a} [3.2] π
k=t+1
(1) = E[Rt+1 + γ max q∗ (St+1 , a0 )|St = s, At = a]
including the possibility that T = ∞ or γ = 1 but not both. a0
state transition probabilities: X h i
. Policy = p(s0 , r|s, a) r + γ max q∗ (s0 , a0 ) [3.20]
p(s0 |s, a) = P r{St = s0 |St−1 = s, At−1 = a} a0
X A policy is a mapping from a state to probabilities of selecting s0 ,r
= p(s0 , r|s, a) [3.4] (2) each possible action:
r∈R π(a|s) (9) v∗ (s) = max qπ∗ (s, a) (21)
a∈A(s)
expected reward for state-action: That is the probability of select an action At = a if St = s.
Intuitively, the above equation express the fact that the value
.
r(s, a) = E[Rt |St−1 = s, At−1 = a]
Value Functions of a state under the optimal policy must be equal to the
State-Value function describes how good is to be in a specific expected return from the best action from that state.
X X
= r· p(s0 , r|s, a) [3.5] (3) state s under a certain policy π. Informally, it is the expected Relation between Optimal Value Functions
r∈R s0 ∈S return (expected cumulative discounted reward) when starting
from s and following π. For any policy π and ∀s ∈ S: X h X i
expected reward for state-action-next state: v∗ (s) = max p(s0 , r|s, a) r + γ π(a0 |s0 )q∗ (s0 , a0 ) [Ex 3.25]
. a
.
vπ (s) = Eπ [Gt |St = s] [3.12] (10) s0 ,r a0
0 0
r(s , s, a) = E[Rt |St−1 = s, At−1 = a, St = s ] (22)
= Eπ [Rt+1 + γGt+1 |St = s] [by 3.9] (11)
X p(s0 , r|s, a) h i

X X
=
p(s0 |s, a)
[3.6] (4) = π(a|s) p(s0 , r|s, a) r + γvπ (s0 ) [3.14] (12)
r∈R a s0 ,r X h i
q∗ (s, a) = p(s0 , r|s, a) r + γv∗ (s0 ) [Ex 3.26] (23)
The MDP and agent together thereby give rise to a sequence The last one is the Bellman equation for vπ .
s0 ,r
or trajectory that begins like this: Action-Value function (Q-Function) describes how good is to
S0 , A0 , R1 , S1 , A1 , R2 , S2 , A2 , R3 ... perform a given action a in a given state s under a certain
Dynamic Programming 1. Initialization One sweep is one update of each state.
Assign arbitrarily V (s) ∈ R and π(s) ∈ A(s) for all In value iteration only a single iteration of policy evaluation is
s ∈ S, performed between each policy improvement. Value iteration
Collection of algorithms that can be used to compute optimal 2. Policy Evaluation combines, in each of its sweeps, one sweep of policy evaluation
policies given a perfect model of the environment as a MDP. ∆←0 and one sweep of policy improvement.
while ∆ ≥ θ do Faster convergence is often achieved by interposing multiple
foreach s ∈ S do policy evaluation sweeps between each policy improvement
Policy Evaluation [Prediction] v ← V (s) sweep. The entire class of truncated policy iteration algorithms
p(s0 , r|s, a) r + γV (s0 )
P P  
V (s) ← π(a|s) can be thought of as sequences of sweeps, some of which use
a s0 ,r
If the environment’s dynamic is completely known, the Eq. 12 policy evaluation updates and some of which use value
∆ ← max(∆, |v − V (s)|)
is a system of |S| equations in |S| unknowns (vπ (s), s ∈ S). iteration updates.
end
We also can use an iterative solution, ∀s ∈ S: end Generalized Policy Iteration
3. Policy Improvement Generalized Policy Iteration is a way to refer to the general
. policy-stable ← true
vk+1 (s) = Eπ [Rt+1 + γvk (St+1 )|St = s] idea of letting policy-evaluation and policy-improvement
X X foreach s ∈ S do processes interact, independent of the granularity and other
= π(a|s) p(s0 , r|s, a)[r + γvk (s0 )] [4.5] (24) old-action ← π(s) details of the two processes.
p(s0 , r|s, a) r + γV (s0 )
P  
a s0 ,r π(s) ← argmax Almost all reinforcement learning methods are well described
a s0 ,r
as GPI. That is, all have identifiable policies and value
if old-action 6= π(s) then functions, with the policy always being improved with respect
We can compute new values vk+1 (s) from old values vk (s) policy-stable ← false
without change old values or update the values in-place. to the value function and the value function always being
end
driven toward the value function for the policy, as suggested
end by the diagram below [§4.6]:
if policy-stable then
Iterative Policy Evaluation for estimating V ∼ vπ return V ≈ v∗ and π ≈ π∗
(in-place version) else
go to 2
end
Inputs: π - the policy to be evaluated Algorithm 2: Policy Iteration - estimating π ∼ π∗
Params: θ - a small positive threshold determining the - deterministic policy - [§4.3]
accuracy of the estimation
Initialize V(s), for all s ∈ S + arbitrarily, except Value Iteration
V(terminal) = 0 Instead of waiting the convergence of V (s) (policy evaluation
∆←0 loop) we can perform only one step of policy evaluation that,
while ∆ ≥ θ do combined with policy improvement, lead to the following
foreach s ∈ S do formulation:
v ← V (s) vk+1 (s) = max Eπ [Rt+1 + γvk (St+1 )|St = s]
p(s0 , r|s, a) r + γV (s0 )
P P   a
V (s) ← π(a|s)
a s0 ,r
X
= max p(s0 , r|s, a)[r + γvk (s0 )] [4.10] (25)
∆ ← max(∆, |v − V (s)|) a
s0 ,r
end
end Params: θ - a small positive threshold determining the
Algorithm 1: Iterative Policy Evaluation - esti- accuracy of the estimation
Initialize V(s), for all s ∈ S + arbitrarily, except
mating V ∼ vπ - [§4.1] V(terminal) = 0
∆←0
The algorithm tests the quantity maxs∈S |vk+1 (s) − vk (s)| while ∆ ≥ θ do
after each sweep and stops when it is sufficiently small (∆ < θ). foreach s ∈ S do
v ← V (s)
p(s0 , r|s, a) r + γV (s0 )
P  
V (s) ← max
a s0 ,r
Policy Iteration ∆ ← max(∆, |v − V (s)|)
end
Policy iteration consists of two simultaneous, interacting end
processes: one making the value function consistent with the output: Deterministic policy π ≈ π∗ such that
p(s0 , r|s, a) r + γV (s0 )
P
current policy (policy evaluation), and the other making the π(s) = argmax
a s0 ,r
policy greedy with respect to the current value function (policy
improvement). Algorithm 3: Value Iteration - estimating π ∼ π∗
- [§4.4]
Monte Carlo Methods MC Control Incremental Implementation
Initialise: π(s) ∈ A(s) arbitrarily, for all s ∈ S The average used to compute V (St ), can be performed
Q(s, a) ∈ R (arbitrarily) for all s ∈ S, a ∈ A(s) incrementally:
Monte Carlo (MC) methods require only experience from
Returns(s, a) ← empty list for all s ∈ S, a ∈ A(s) n
actual or simulated environment. 1X 1
while forever do Vn (St ) = Gi (t) = Vn−1 (St ) + (Gn (t) − Vn−1 (St ))
Choose S0 ∈ S and A0 ∈ A(S0 ), randomly such n i=1 n
that all pairs have probability > 0 (26)
MC Prediction Generate an episode from S0 , A0 following
π : S0 , A0 , R1 , ..., ST −1 , AT −1 , RT Off-policy MC Control
G←0 The policy used to generate behavior, called the behavior
Inputs: π - the policy to be evaluated foreach step of episode, t = T − 1, T − 2, ..., 0 do policy, may in fact be unrelated to the policy that is evaluated
Initialize: V (s) ∈ R for all s ∈ S G ← γG + Rt+1 and improved, called the target policy. An advantage of this
Return(s) ← an empty list for all s ∈ S if St , At pair is not seen before, is not in the separation is that the target policy may be deterministic (e.g.,
while forever - for each episode do sequence S0 , A0 , S1 , A1 ..., St−1 , At−1 then greedy), while the behavior policy can continue to sample all
Generate an episode following π: Append G to Returns(St , At ) possible actions.
S0 , A0 , R1 , S1 , A1 , ..., ST −1 , AT −1 , RT Q(St , At ) ← average(Returns(St , At ))
π(St ) ← argmaxa (Q(St , a)) Initialize:for all s ∈ S, a ∈ A(s)
G←0 Q(s, a) ∈ R C(s, a) ← 0 π(s) ← argmaxa +Q(s, a)
foreach step of episode, t = T − 1, T − 2, ..., 0 do end
while forever - for each episode do
G ← γG + Rt+1 end b ← anysof tpolicy Generate an episode following
if St is not in the sequence S0 , S1 , ..., St−1 end b: S0 , A0 , R1 , S1 , A1 , ..., ST −1 , AT −1 , RT
(i.e. it is the first visit to St ) then Algorithm 5: First-visit Monte Carlo (Exploring G←0
Append G to Return(St )
Starts) - estimating π ∼ π∗ [§5.3] W ←1
V (St ) ← average(Return(St ))
foreach step of episode, t = T − 1, T − 2, ..., 0 do
end To remove the exploring starts assumption, we can use an
G ← γG + Rt+1
end −soft policy. Most of times it selects the greedy policy but
C(St , At ) ← C(St , At ) + W
end with probability  it instead selects an action at random.
Q(St , At ) ←
Other approaches are the off-policy methods that learn about
Algorithm 4: On-policy First-visit Monte Carlo Q(St , At ) + C(SW,A ) [G − Q(St , At )]
the optimal policy while behaving according to a different t t
prediction - estimating V ∼ vπ [§5.1] exploratory policy. The policy being learned about is called π(s) ← argmaxa +Q(ST , a)
the target policy, π, and the policy used to generate behavior if At 6= π(St ) then
exit For Loop
The first-visit is the first time a particular state has been is called the behavior policy, b (usually an exploratory policy,
end
observed. e.g. random policy).
In order to use episodes from b to estimate values for π, we W ← W b(A 1|S )
The first-visit MC method estimates vπ (s) as the average of t t

the returns following first visits to s, whereas the every-visit require that every action taken under π is also taken, at least end
MC method averages the returns following all visits to s. The occasionally, under b. That is, we require that π(a|s) > 0 end
every-visit MC Prediction is derived from first-visit version implies b(a|s) > 0. This is called the assumption of coverage. Algorithm 7: Off-policy MC Control - estimating
removing the “if” condition. Off-policy Every-visit MC Prediction π ∼ π∗ [§5.7]
In other words we move backward from the step T and
compute the G incrementally and associate the values of G to Inputs: π - the policy to be evaluated
the current state and perform the average. Initialize: V (s) ∈ R for all s ∈ S
Return(s) ← an empty list for all s ∈ S
while forever - for each episode do
Generate an episode following b:
MC Estimation of Action Values S0 , A0 , R1 , S1 , A1 , ..., ST −1 , AT −1 , RT
G←0
W ←1
To determine a policy, if a model is not available, the state foreach step of episode, t = T − 1, T − 2, ..., 0 do
value is not sufficient and we have to estimate the values of G ← γW G + Rt+1
state–action pairs. Append G to Return(St )
The MC methods are essentially the same as just presented for V (St ) ← average(Return(St ))
state values, but now we have state–action pairs. π(A |S )
W ← W b(A t|S t)
t t
The only complication is that many state–action pairs may
end
never be visited. We need to estimate the value of all the
end
actions from each state, not just the one we currently favor.
We can specify that the episodes start in a state–action pair, Algorithm 6: Off-policy Every-visit Monte Carlo
and that every pair has a nonzero probability of being selected prediction - estimating V ∼ vπ [Course2-Week2]
as the start (assumption of exploring starts).
Temporal-Difference Learning TD methods update their estimates based in part on other Q-Learning - Off-policy TD Control
estimates. They learn a guess from a guess, i.e. they Q-Learning is an off-policy TD control. Q-learning is a
TD Prediction
bootstrap. TD and MC methods have an advantage over DP sample-based version of value iteration which iteratively
Starting from Eq. 26, we can consider a generic update rule of methods in that they do not require a model of the applies the Bellman’s optimality equation. The update rule:
V (St ) environment, of its reward and next-state probability
distributions.
The most obvious advantage of TD methods over MC methods Q(St , At ) ←
V (St ) ← V (St ) + α(Gt − V (St )) [6.1] (27) h i
is that they are naturally implemented in an online, fully Q(St , At )+α Rt+1 + γ max Q(St+1 , a) − Q(St , At ) [6.8]
α is a constant step-size and we call previous method incremental fashion. With MC methods one must wait until a
constant-α MC. MC has to wait the end of an episode to the end of an episode, because only then the return is known, (34)
determine the increment to V (St ). whereas with TD methods one needs wait only one time step.
Differently from MC, TD updates the value at each step of the In practice, TD methods have usually been found to converge Params: step size α ∈]0, 1], small  > 0
episode following the equation below: faster than constant-α MC methods on stochastic tasks. Initialize Q(s, a) for all s ∈ S + and a ∈ A(s),
The error, available at time t + 1, between V (St ) and the arbitrarily except that Q(terminal − state, ·) = 0
better estimate Rt+1 + γV (St+1 ) is called TD error : foreach episode do
V (St ) ← V (St ) + α(Rt+1 + γV (St+1 ) − V (St )) [6.2] (28) Initialize S
.
δt = Rt+1 + γV (St+1 ) − V (St ) [6.5] (32) foreach step of episode - until S is terminal do
Inputs: π - the policy to be evaluated Choose A from S using policy derived from Q
Params: step size α ∈]0, 1] Sarsa - On-policy TD Control (e.g. -greedy)
Initialize: V (s) ∈ R for all s ∈ S + except for Take action A, observe R, S 0
V(terminal)=0 Sarsa (State-action-reward-state-action) is an on-policy TD Q(S, A) ←
foreach episode do control. Sarsa is sample-based version of policy iteration which Q(S, A) + α [R + γ maxa Q(S 0 , a) − Q(S, A)]
Initialize S uses Bellman equations for action values. The update rule: S ← S0
foreach step of episode - until S is terminal do end
A ← action given by π for S end
Take action A, observe R, S 0 Q(St , At ) ←
Algorithm 10: Q-Learning - Off-policy TD Con-
V (S) ← V (S) + α(R + γV (S 0 ) − V (S)) Q(St , At )+α [Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )] [6.7]
S ← S0 trol - estimating π ∼ π∗ [§6.5]
(33)
end
end
Expected Sarsa
Params: step size α ∈]0, 1], small  > 0 Similar to Q-Learning, the update rule of Expected Sarsa,
Algorithm 8: Tabular TD(0) - estimating vπ [§6.1] Initialize Q(s, a) for all s ∈ S + and a ∈ A(s), takes the expected value instead of the maximum over the
Recall that: arbitrarily except that Q(terminal − state, ·) = 0 next state:
. foreach episode do
vπ (s) = Eπ [Gt |St = s] [3.12|6.3] (29) Initialize S Q(St , At ) ←
= Eπ [Rt+1 + γGt+1 |St = s] [by 3.9] (30) Choose A from S using policy derived from Q (e.g. Q(St , At ) + α[Rt+1 + γEπ [Q(St+1 , At+1 )|St+1 ] − Q(St , At )]
= Eπ [Rt+1 + γvπ (St+1 )|St = s] [6.4] (31) -greedy) X
foreach step of episode - until S is terminal do Q(St , At ) + α[Rt+1 + γ π(a|St+1 )Q(St+1 , a) − Q(St , At )] [6.9]
Take action A, observe R, S 0 a
Choose A0 from S 0 using policy derived from Q (35)
MC methods use an estimate of Eq. 29 as a target. The MC
target is an estimate because the expected value in Eq. 29 is (e.g. -greedy) The next action is sampled from π. However, the expectation
not known; a sample return is used in place of the real Q(S, A) ← over actions is computed independently of the action actually
expected return. Q(S, A) + α [R + γQ(S 0 , A0 ) − Q(S, A)] selected in the next state. In fact, it is not necessary that π is
DP and TD methods use an estimate of Eq. 31 as a target. S ← S0 equal to the behavior policy. This means that Expected Sarsa,
The DP target is an estimate because vπ (St+1 ) is not known A ← A0 like Q-learning, can be used to learn off-policy without
end
and the current estimate, V (St+1 ), is used instead. importance sampling.
end
The TD target is an estimate because it samples the expected If the target policy is greedy with respect to its action value
values in Eq. 31 and it uses the current estimate V instead of Algorithm 9: Sarsa - On-policy TD Control - es- estimates we obtain the Q-Learning. Hence Q-Learning is a
the true vπ . timating Q ∼ q∗ [§6.4] special case of Expected Sarsa.
n-step Bootstrapping In the one-step TD, instead, the update, is based on just the In the n-step TD, it is used the n-step return:
one next reward, bootstrapping from the value of the state one
n-step TD Prediction
step later as a proxy for the remaining rewards using one-step .
MC methods updates the estimate of vπ (St ) for each state return: Gt:t+n = Rt+1 + γRt+2 + ... + γ n−1 Rt+n + γ n Vt+n−1 (St+n )
based on the entire sequence of observed rewards from that (38)
state until the end of the episode using:
. .
Gt = Rt+1 + γRt+2 + ... + γ T −t−1 RT (36) Gt:t+1 = Rt+1 + γVt (St+1 ) (37)
Planning and Learning experience obtained with the interaction with the environment If (e) and (f) were omitted, the remaining algorithm would be
Planning methods use simulated experience generated by a can be used to improve directly the Policy/value function one-step tabular Q-learning.
model, learning methods use real experience generated by the (direct RL) or indirectly through the model learning and the The agent responds instantly to the latest sensory information
environment. Many ideas and algorithms can be transferred planning that use simulated experience (indirect RL) [§8.2]. and yet always planning in the background. Also the
between planning and learning. model-learning process is in background. As new information
Params: step size α ∈]0, 1], small  > 0 is gained, the model is updated to better match reality. As the
Initialize Q(s, a) for all s ∈ S + and ainA(s), arbitrarily model changes, the ongoing planning process will gradually
except that Q(terminal − state, ·) = 0 compute a different way of behaving to match the new model.
foreach episode do Models may be incorrect for many reasons: environment is
1. Select a state S ∈ S, and an action, A ∈ A(s), stochastic and only a limited number of samples have been
at random observed, the model was learned using function approximation
2. From a Sample Model obtain the sample reward that has generalized imperfectly, the environment has changed
and next state following: R, S’ = model (S,A) and its new behavior has not yet been observed. When the
3. Apply one-step tabular Q-Learning to model is incorrect, the planning process is likely to compute a
S, A, R, S 0 : suboptimal policy. In some cases, the suboptimal policy
Q(S, A) ← computed by planning quickly leads to the discovery and
Q(S, A) + α [R + γ maxa Q(S 0 , a) − Q(S, A)] correction of the modeling error. This happens when the
end model is optimistic, predicting greater reward or better state
transition than are actually possible. It is more difficult to
Algorithm 11: Random-sample one-step tabular correct a model when the environment becomes better than it
Q-planning [§8.1] was before.
Initialize Q(s, a) and M odel(S, A) for all s ∈ S + and In Figure below there are represented the relation among
Dyna a ∈ A(s) algorithms presented in the Course on Coursera
Within a planning agent, there are at least two roles for real while forever do [Course3-Week1].
experience: (a) S ← current (nonterminal) state
(1) it can be used to improve the model (to match more (b) A ← -greedy(S, Q)
accurately the real environment), model learning, (c) Take action A; observe resultant reward, R,
(2) it can be used to directly improve the value function and and state, S 0
policy using the kinds reinforcement learning methods (d) Q(S, A) ←
discussed before, direct RL [§8.2]. Q(S, A) + α [R + γ maxa Q(S 0 , a) − Q(S, A)]
(e) M odel(S, A) ← R, S 0 (assuming deterministic
environment)
(f) foreach n times do
S ← random previously observed state
A ← random action previosuly taken in S
R, S 0 ← M odel(S, A)
Q(S, A) ←
Q(S, A) + α [R + γ maxa Q(S 0 , a) − Q(S, A)]
end
end
Algorithm 12: Dyna-Q [§8.2]
(d) Direct reinforcement learning
(e) Model-learning
The experience can improve value functions and policies either (f) Planning
directly or indirectly via the model (indirect RL). The real
.
On-policy Prediction with Function state’s estimate more accurate means making others’ less For MC, Ut = Gt and hence Ut is an unbiased estimation of
accurate. vπ (St ).
Approximation
We can approximate value function not as a table but as a . X
V E(w) = µ(s) [vπ (s) − v̂(s, w)]2 [9.1] (39) Inputs: π - the policy to be evaluated
parametrized functional form: v̂(s, w) ∼ vπ (s) where w ∈ Rd s∈S) a differentiable function v̂ : S × Rd → R
and the number of weights is much less than the number of P
Parameters: step size α > 0
where µ(s) is a state distribution (µ(s) ≥ 0 and s µ(s) = 1)
states (d << |S|). The value estimation can be framed as a Initialize: w ∈ Rd arbitrarily (e.g. w = 0)
representing how much we care about the error in each state s.
Supervised Learning problem. while forever - for each episode do
Usually to minimize the 35 it is used the Stochastic Gradient
The Monte Carlo methods estimate the value function using Generate an episode following π:
Descent (SGD):
samples of the return so the input is the state and the targets S0 , A0 , R1 , S1 , A1 , ..., ST −1 , AT −1 , RT
are the returns (pairs (Si , Gi )). 1 foreach step of episode, t = 0, 1, ..., T − 1 do
.
For TD methods the targets are the one-step bootstrap return wt+1 = wt − α∇ [vπ (St ) − v̂(St , wt )]2 [9.4] (40) w ← w − α [Gt − v̂(St , wt )] ∇v̂(St , w)
(pairs (Si , Ri+1 + γv̂(Si+1 , w)). 2
end
In RL setting, the data is temporally correlated and the full = wt − α [vπ (St ) − v̂(St , wt )] ∇v̂(St , wt ) [9.5] (41)
end
dataset is not fixed and available from the beginning. Usually we have only an approximation Ut of vπ (St ) but, if Ut Algorithm 13: Gradient MC - Estimating v ∼ vπ
Moreover, due to the bootstrapping methods (TD, DP), the is an unbiased estimation of vπ (St ), that is
target labels change. E[Ut |St = s] = vπ (St ), for each t, then wt is guaranteed to [§9.3]
In tabular case the learned values at each state were decoupled converge to a local optimum (under the stochastic
- an update at one state affected no other. Now making one approximation condition for decreasing α). https://github.com/linker81/Reinforcement-Learning-CheatSheet

You might also like