Reinforcement Learning: Foundations
Reinforcement Learning: Foundations
November 2024
This book is still work in progress. In particular, references to
literature are not complete. We would be grateful for comments,
suggestions, omissions, and errors of any kind, at
[email protected].
Please cite as
@book{MannorMT-RLbook,
url = {https://sites.google.com/view/rlfoundations/home},
author = {Mannor, Shie and Mansour, Yishay and Tamar, Aviv},
title = {Reinforcement Learning: Foundations},
year = {2023},
publisher = {-}
}
2
Contents
3
3.4.6 From Dijkstra’s Algorithm to A∗ . . . . . . . . . . . . . . . . 40
3.5 Average cost criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Continuous Optimal Control . . . . . . . . . . . . . . . . . . . . . . 44
3.6.1 Linear Quadratic Regulator . . . . . . . . . . . . . . . . . . . 45
3.6.2 Iterative LQR . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7 Bibliography notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4 Markov Chains 49
4.1 State Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Invariant Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Reversible Markov Chains . . . . . . . . . . . . . . . . . . . . 58
4.3.2 Mixing Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4
6.7 Policy Iteration (PI) . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.8 A Comparison between VI and PI Algorithms . . . . . . . . . . . . . 92
6.9 Bibliography notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5
11 Reinforcement Learning: Model Free 141
11.1 Model Free Learning – the Situated Agent Setting . . . . . . . . . . . 141
11.2 Q-learning: Deterministic Decision Process . . . . . . . . . . . . . . 142
11.3 Monte-Carlo Policy Evaluation . . . . . . . . . . . . . . . . . . . . . 145
11.3.1 Generating the samples . . . . . . . . . . . . . . . . . . . . . . 146
11.3.2 First visit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
11.3.3 Every visit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
11.3.4 Monte-Carlo control . . . . . . . . . . . . . . . . . . . . . . . 152
11.3.5 Monte-Carlo: pros and cons . . . . . . . . . . . . . . . . . . . 153
11.4 Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . 154
11.4.1 Convergence via Contraction . . . . . . . . . . . . . . . . . . . 155
11.4.2 Convergence via the ODE method . . . . . . . . . . . . . . . . 156
11.4.3 Comparison between the two convergence proof techniques . . 159
11.5 Temporal Difference algorithms . . . . . . . . . . . . . . . . . . . . . 161
11.5.1 TD(0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
11.5.2 Q-learning: Markov Decision Process . . . . . . . . . . . . . . 165
11.5.3 Q-learning as a stochastic approximation . . . . . . . . . . . . 166
11.5.4 Step size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
11.5.5 SARSA: on-policy Q-learning . . . . . . . . . . . . . . . . . . 168
11.5.6 TD: Multiple look-ahead . . . . . . . . . . . . . . . . . . . . . 173
11.5.7 The equivalence of the forward and backward view . . . . . . 176
11.5.8 SARSA(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
11.6 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
11.6.1 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . 178
11.6.2 Algorithms for Episodic MDPs . . . . . . . . . . . . . . . . . 180
11.7 Bibliography Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6
12.4 Approximate Policy Optimization . . . . . . . . . . . . . . . . . . . . 199
12.4.1 Approximate Policy Iteration . . . . . . . . . . . . . . . . . . 200
12.4.2 Approximate Policy Iteration Algorithms . . . . . . . . . . . . 200
12.4.3 Approximate Value Iteration . . . . . . . . . . . . . . . . . . . 202
12.5 Off-Policy Learning with Function Approximation . . . . . . . . . . . 203
7
B Ordinary Differential Equations 259
B.1 Definitions and Fundamental Results . . . . . . . . . . . . . . . . . . 259
B.1.1 Systems of Linear Differential Equations . . . . . . . . . . . . 261
B.2 Asymptotic Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
8
Chapter 1
9
1.2 Motivation for RL
In recent years there is a renewed interest in RL. The new interest is grounded
in emerging applications of RL, and also progress of deep learning that has been
impressively applied for solving challenging RL tasks. But for us, the interest comes
from the promise of RL and its potential to be an effective tool for control and
behavior in dynamic environments.
Over the years, reinforcement learning has proven to be highly successful for play-
ing board games that require long horizon planning. Early in 1962, Arthur Samuel
[96] developed a checkers game, which was at the level of the best human. His original
framework included many of the ingredients which latter contributed to RL, as well
as search heuristics for large domains. Gerald Tesauro in 1992 developed the TD-
gammon [120], which used a two layer neural-network to achieve a high performance
agent for playing the game of backgammon. The network was trained from scratch,
by playing against itself in simulation, and using a temporal differences learning rule.
One of the amazing features of TD-gammon was that even in the first move, it played
a different game move than the typical opening that backgammon grandmasters use.
Indeed, this move was later adopted in the backgammon community [121]. More
recently, DeepMind have developed AlphaGo – a deep neural-network based agent
for playing Go, which was able to beat the best Go players in the world, solving a
long-standing challenge for artificial intelligence [103].
To complete the picture of computer board games, we should mention Deep Blue,
from 1996, which was able to beat the world champion then, Kasparov [18]. This
program mainly built on heuristic search and new hardware was developed to sup-
port it. Recently, DeepMind’s AlphaZero matched the best chess programs (which
are already much better than any human players), using a reinforcement learning
approach [104].
Another domain, popularized by DeepMind, is playing Atari video games [83],
which were popular in the 1980’s. DeepMind were able to show that deep neural
networks can achieve human level performance, using only the raw video image and
the game score as input (and having no additional information about the goal of
the game). Importantly, this result reignited the interest of RL in the robotics
community, where acting based on raw sensor measurements (a.k.a. ‘end-to-end’) is
a promising alternative to the conventional practice of separating decision making
into perception, planning, and control components [68].
More recently, interest in RL sparked yet again, as it proved to be an important
component in fine tuning large language models to match user preferences, or to
accomplish certain tasks [88, 134]. One can think of the sequence of words in a
10
conversation as individual decisions made with some higher level goal in mind, and
RL fits naturally with this view of language generation.
While the RL implementations in each of the different applications mentioned
above were very different, the fundamental models and algorithmic ideas were sur-
prisingly similar. These foundations are the topic of this book.
11
1.5 Book Organization
The book is thematically comprised of two main parts – planning and learning.
Planning: The planning theme develops the fundamentals of optimal decision mak-
ing in the face of uncertainty, under the Markov decision process model. The basic
assumption in planning is that the MDP model is known (yet, as the model is stochas-
tic, uncertainty must still be accounted for in making decisions). In a preface to the
planning section, Chapter 2, we motivate the MDP model and relate it to other mod-
els in the planning and control literature. In Chapter 3 we introduce the problem and
basic algorithmic ideas under the deterministic setting. In Chapter 4 we review the
topic of Markov chains, which the Markov decision process model is based on, and
then, in Chapter 5 we introduce the finite horizon MDP model and a fundamental
dynamic programming approach. Chapter 6 covers the infinite horizon discounted
setting, and Chapter 7 covers the episodic setting. Chapter 8 covers an alternative
approach for solving MDPs using a linear programming formulation.
Learning: The learning theme covers decision making when the MDP model is not
known in advance. In a preface to the learning section, Chapter 9, we motivate
this learning problem and relate it to other learning problems in decision making.
Chapter 10 introduces the model-based approach, where the agent explicitly learns
an MDP model from its experience and uses it for planning decisions. Chapter
11 covers an alternative model-free approach, where decisions are learned without
explicitly building a model. Chapters 12 and 13 address learning of approximately
optimal solutions in large problems, that is, problems where the underlying MDP
model is intractable to solve. Chapter 12 approaches this topic using approximation
of the value function, while Chapter 13 considers policy approximations. In Chapter
14 we consider the special case of Multi-Arm Bandits, which can be viewed as a MDP
with a single state and unknown rewards, and study the online nature of decision
making in more detail.
12
The book of Howard [42], building on his PhD thesis, introduced the policy iteration
algorithm as well as a clear algorithmic definition of value iteration. A precursor
work by Shapely [100] introduced a discounted MDP model for stochastic games.
There is a variety of books addressing Markov Decision Processes and Reinforce-
ment Learning. Puterman’s book [92] gives an extensive exposition of mathematical
properties of MDPs, including planning algorithms. Bertsekas and Tsitsiklis [12]
give a stochastic processes approach for reinforcement learning. Bertsekas [13] give
a detailed exposition of stochastic shortest paths.
Sutton and Barto [112] give a general exposition to modern reinforcement learn-
ing, which is more focused on implementation issues less focused on mathematical
issues. Szepesvari’s monograph [115] gives an outline of basic reinforcement learning
algorithms. Bertsekas and Tsitsiklis provide a thorough treatment of RL algorithms
and theory in [12].
13
14
Chapter 2
In the following chapters, we discuss the planning problem where a model is known.
Before diving in, however, we shall spend some time on defining the various ap-
proaches to modeling a sequential decision problem, and motivate our choice to focus
on some of them. In the next chapters, we will rigorously cover selected approaches
and their implications. This chapter is quite different from the rest of the book, as
it discusses epistemological and philosophical issues more than anything else.
We are interested in sequential decision problems in which a sequence of decisions
need to be taken in order to achieve a goal or optimize some performance measure.
Some examples include:
Example 2.1 (Board games). An agent playing a board game such as Tic-Tac-Toe,
chess, or backgammon. Board games are typically played against an opponent, and
may involve external randomness such as the dice in backgammon. The goal is to
play a sequence of moves that lead to winning the game.
Example 2.2 (Robot Control). A robot needs to be controlled to perform some task,
for example, picking up an object and placing it in a bin, or folding up a piece of
cloth. The robot is controlled by applying voltages to its motors, and the goal is to
find a sequence of controls that perform the desired task within some time limits.
Example 2.3 (Inventory Control). Inventory control represents a classical and prac-
tical applications of sequential decision making under uncertainty. In its simplest
form, a decision maker must determine how much inventory to order at each time
period to meet uncertain future demand while balancing ordering costs, holding costs,
and stockout penalties. The uncertainty in demand requires a good policy to adapt
to the stochastic nature of customer behavior while accounting for both immediate
costs and future implications of current decisions. The (s, S) policy, also known as
15
a reorder point-order-up-to policy ([97]), is an elegantly simple yet often optimal ap-
proach to inventory control. Under this policy, whenever the inventory level drops to
or below a reorder point s, an order is placed to bring the inventory position up to a
target level S. While finding the optimal values for s and S is non-trivial, this pol-
icy structure has been proven optimal for many important inventory problems under
reasonable assumptions. The (s, S) framework provides an excellent example of how
constraining the policy space, in this case to just two parameters, can make learning
more efficient while still achieving strong performance.
When we are given a sequential decision problem we have to model it from a
mathematical perspective. In this book, and in much of the literature, the focus is
mostly on the celebrated Markov Decision Process (MDP) model. It should be clear
that this is merely a model, i.e., one should not view it as a precise reflection of
reality. To quote Box is “all models are wrong, but some are useful”. Our goal is
to have useful models and as such the Markov decision model is a perfect example.
The MDP model has the following components which we discuss here and provide
formally in later chapters. We will use the agent-centric view, assuming an agent
interacts with an environment. This agent is sometimes called a “decision maker”,
especially in the operations research community.
1. States: A state is the atomic entity that represents all the information needed
to predict future rewards of the system. The agent in an MDP can fully observe
the state.
3. Rewards: the rewards represent some numerical measurement that the decision
maker wishes to maximize. The reward is assumed to be a function of the
current state and the action.
4. Dynamics: The state changes (or transitions) according to the dynamics. This
evolution depends only on the current state and the action chosen but not on
future or past states on actions.
In planning, it is assumed that all the components are known. The objective of the
decision maker is to find a policy, i.e., a mapping from histories of state observations
to actions, that maximizes some objective function of the reward. We will adopt the
following standard assumptions concerning the planning model:
1. Time is discrete and regular: decisions are made in some predefined decision
epochs. For example, every second/month/year. While continuous time is
16
especially common in robotic applications, we will adhere for simplicity to dis-
crete regular times. In principle, this is not a particularly limiting assumption,
as most digital systems inherently discretize the time measurement. However,
it may be unnecessary to apply a different control at every time step; The
semi-MDP model is a common framework to use when the decision epochs
are irregular [92], and there is an extensive literature on optimal control in
continuous time [58], which we will not consider here.
2. Action space is finite. We will mostly assume that the available actions a
decision maker can choose from belong to a finite set. While this assumption
may appear natural in board games, or any digital system that is discretized,
in some domains such as robotics it is more natural to consider a continuous
control setting. For continuous actions, the structure of the action space is
critical for effective decision making – we will discuss some specific examples,
such as a linear dynamical system here. More general continuous and hybrid
discrete-continuous models are often studied in the control literature [11] and
in operations research [91].
3. State space is finite. The set of possible system states is also assumed to be
finite and unstructured. The finiteness assumption is mostly a convenience, as
any bounded continuous space can be finely discretized to a finite, but very
large set. Indeed, in the second part of this book, we shall study learning-
based methods that can handle very large state spaces. For problems where
the state space has a known and convenient structure, a model that takes
this structure into account can be more appropriate. For example, in a linear
controlled dynamical system, which we discuss in Section 3.6, the state space
is continuous, and its evolution with respect to the control is linear, leading
to a closed form optimal solution when the reward has a particular quadratic
structure. In the classical STRIPS and PDDL planning models, which we do
not cover here, the state space is a list of binary variables (e.g., a system for
robot setting a table may be described by [robot gripper closed = False, cup
on table = True, plate on table = False,. . . ]), and planning algorithms that try
to find actions that lead to certain goal states being ‘true’ can take account of
this special structure [95].
4. Rewards are all given in a single currency. We assume that the agent has a single
reward stream it tries to optimize. Specifically the agent tries to maximize the
long term sum of rewards. In some cases, a user may be interested in other
statistics of the reward, such as its variance [76], or to balance multiple types
17
of rewards [75]; we do not cover these cases here.
5. Markov state evolution. We shall assume that the environment’s reaction to the
agent’s actions is fixed (it may be stochastic, but with a fixed distribution), and
depends only on the current state. This assumption precludes environments
that are adversarial to the agent, or systems with multiple independent agents
that learn together with our agent [19, 133].
As should be clear from the points above, the MDP model is agnostic to structure
that certain problems may posses, and more specialized models may exploit. The
reader may question, therefore, why study such a model for planning. As it turns
out, the simplicity and generality of the MDP is actually a blessing when using it
for learning, which is the main focus of this book. The reason is that structure of
a specific problem may be implicitly picked up by the learning algorithm, which is
designed to identify patterns in data. This strategy has proved to be very valuable in
computing decision making policies for problems where structure exists, but is hard to
define manually, which is often the case in practice. Indeed, many recent RL success
stories, such as mastering the game of Go, managing resources in the complex game
of StarCraft, and state-of-the-art continuous control of racing drones, have all used
the simple MDP model combined with powerful deep learning methods [102, 128, 50].
There are two other strong modelling assumptions in the MDP: (1) all uncertainty
in decision making is limited to the randomness of the (Markov) state transitions,
and (2) the objective can only be specified using rewards. We next discuss these two
design choices in more detail.
18
For example, in the board game backgammon, the probability of a move is given by
throwing two dices.
Epistemic2 uncertainty: Dealing with lack of knowledge about the model param-
eters. Sometimes we do not know what are the exact model parameters because
more interaction is needed to learn them (this is addressed in the learning part of
this book). Sometimes we have a nominal model and the true parameters are only
revealed at runtime (this is addressed within the robust MDP framework; see [87]).
Sometimes our model is too coarse or simply incorrect – this is known as model
misspecification.
Partial observability: Reasoning with incomplete information concerning the true
state of the system. There are many problems where we just do not have an accurate
measurement that can help us predict the future and instead we get to observe partial
information concerning the true state of the system. Some would argue that all real
problems have some elements of partial observability in them.
We emphasize that for planning and for learning, a model could combine all types
of uncertainty. The choice of which type of uncertainty is an important design choice.
The MDP model that we focus on in the planning chapter only accommodates
aleatoric uncertainty, through the stochastic state transitions. While this may appear
to be a strong limitation, MDPs have proven useful for dealing with more general
forms of uncertainty. For example, in the learning chapters, we will ask how to up-
date an MDP model from interaction with the environment, to potentially reduce
epistemic uncertainty. For board games, even though MDPs cannot model an ad-
versary, assuming that the opponent is stochastic helps find a robust policy against
various opponent strategies. Moreover, by using the concept of self play – an agent
that learns to play against itself and continually improve – RL has produced the
most advanced AI agents for several games, including Chess and Go. For partially
observable systems, a fundamental result shows that taking the full observation his-
tory as the ‘state’, results in an MDP model for the problem (albeit with a huge
state space).
19
Nevertheless, in many applications, much of the problem is to engineer a “right”
reward function. This may be done by understanding the specifications of the prob-
lem, or from data of desired behavior, a problem known as Inverse Reinforcement
Learning [86].
Specifically, the mere existence of a reward function implies that every aspect
of the decision problem can be converted into a single currency. For example, in a
communication network minimizing power and maximizing bit rate may be hard to
combine into a single reward function. Moreover, even when all aspects in expectation
of a problem can be amortized with a single reward function the decision maker may
have other risk aspects in mind, such as resilience to rare events. We emphasize that
the reward function is a design choice made by the decision maker.
In some cases, the reward stream is very sparse. For example, in board game
the reward is often obtained only at the end of the game in the form of a victory
or a loss. While this does not pose a conceptual problem, it may lead to practical
problems as we will discuss later in the book. A conceptual solution here is to use
“proxy rewards”.
A limitation of the Markov decision process planning model is the underlying
assumption that preferences can be succinctly represented through reward functions.
While in principle, any preference among trajectories can be represented using a re-
ward function, by extending the state space to include all history, this may be cum-
bersome and may require a much larger state space. Specifically, the discount factor
which is often assumed a part of the problem specifications, represent a preference
between short-term and long-term objectives. Such preferences are often arbitrary.
We finally comment that the assumption that there exists a scalar reward we
optimize (through a long term objective) does not hold in many problems. Often,
we have several potentially contradicting objectives. For example, we may want
to minimize power consumption while maximizing throughput in communication
networks. In general part of the reward function engineering pertains to balancing
different objectives, even if they are not measured in the same way (“adding apples
and oranges”). A different approach is to embrace the multi-objective nature of the
decision problems through constrained Markov decision processes [3], or using other
approaches [e.g., 75].
Nevertheless, MDPs with their single reward function have proven useful in many
practical domains, as the availability of strong algorithms for solving MDPs effec-
tively allow the system engineer to tweak the reward function manually to fit some
hard-to-quantify desired behavior.
20
2.3 Importance of Small (Finite) Models
The next few chapters, and indeed much of the literature, explicitly assume that the
models are finite (in terms of actions and states) and even practically small. While
this is certainly justified from a pedagogical perspective there are additional reasons
for that make small models relevant.
Small models are more interpretable than large ones: it is often the case that dif-
ferent state capture particular meanings and hence lead to more explainable policies.
For example, in inventory control problems, the dynamic programming techniques
that we will study can show that for certain simplified problem instances, an op-
timal strategy has the structure of a threshold policy – if the inventory is below
some certain threshold then replenish, otherwise do not. Such observations about
the structure of optimal policies often inform the design of policies for more complex
scenarios.
The language and some fundamental concepts we shall develop for small models,
such as the value function, value iteration and policy iteration algorithms, and con-
vergence of stochastic approximation, will also carry over to the learning chapters,
which deal with large state spaces and approximations.
21
22
Chapter 3
In this chapter we introduce the dynamic system viewpoint of the optimal planning
problem, where given a complete model we characterize and compute the optimal
policy. We restrict the discussion here to deterministic (rather than stochastic) sys-
tems. We consider two basic settings: (1) the finite-horizon decision problem and its
recursive solution via finite-horizon Dynamic Programming, and (2) the average cost
and its related minimum average weight cycle.
st+1 = ft (st , at ), t = 0, 1, 2, . . . , T − 1,
where
23
Remark 3.1. More generally, the set At of available actions may depend on the state
at time t, namely: at ∈ At (st ) ⊂ At .
Remark 3.2. The system is, in general, time-varying. It is called time invariant if
ft , St , At do not depend on the time t. In that case we write
ot = Ot (st , at ),
where ot is the system observation, or the output. In most of this book we implicitly
assume that ot = st , namely, the current state st is fully observed.
Graphical description: Finite models (over finite time horizons) can be represented
by a corresponding decision graph, as specified in the following example.
• A0 (1) = {1, 2}, A0 (2) = {1, 3}, A1 (b) = {α}, A1 (c) = {1, 4}, A1 (d) = {β}
24
Figure 3.1: Graphical description of a finite model
25
The standard definition of the cost CT is through the following cumulative cost
functional :
T−1
X
CT (hT ) = ct (st , at ) + cT (sT )
t=0
Here:
• ct (st , at ) is the instantaneous cost or single-stage cost at stage t, and ct is the
instantaneous cost function.
• cT (sT ) is the terminal cost, and cT is the terminal cost function.
We shall refer to CT as the cumulative T-stage cost, or just the cumulative cost.
Our objective is to minimize the cumulative cost CT , by a proper choice of actions.
We will define that goal more formally in the next section.
Remark 3.4. The cost functional defined above is additive in time. Other cost func-
tionals are possible, for example the max cost, but additive cost is by far the most
common and useful.
26
3.2.3 Control Policies
In general we will consider a few classes of control policies. The two basic dimensions
in which we will characterize the control policies is their dependence on the history,
and their use of randomization.
Definition 3.5 (Stationary deterministic policy). For stationary models, we may de-
fine stationary control policies that depend only on the current state. A stationary
policy is defined by a single mapping π : S → A, so that at = π(st ) for all t ∈ T.
We denote the set of stationary policies by ΠS .
Evidently, ΠH ⊃ ΠM ⊃ ΠS .
Randomized (Stochastic) Control policies The control policies defined above spec-
ify deterministically the action to be taken at each stage. In some cases we want to
allow for a random choice of action.
Definition 3.7 (Markov stochastic policy). Define the set ΠM S of Markov randomized
(stochastic) control policies, where πt (·|ht ) is replaced by πt (·|st ).
Definition 3.8 (Stationary stochastic policy). Define the set ΠSS of stationary ran-
domized (stochastic) control policies, where πt (·|st ) is replaced by π(·|st ).
Note that the set ΠHS includes all other policy sets as special cases. For stochastic
control policies, we similarly have ΠHS ⊃ ΠM S ⊃ ΠSS .
27
Control policies and paths: As mentioned, a deterministic control policy specifies
an action for each state, whereas a path specifies an action only for states along the
path. The definition of a policy, allows us to consider counter-factual events, namely,
what would have been the path if we considered a different action. This distinction
is illustrated in the following figure.
Remark 3.5. Suppose that for each state st , each action at ∈ At (st ) leads to a
different state st+1 (i.e., at most one edge connects any two states). We can then
identify each action at ∈ At (st ) with the next state st+1 = ft (st , at ) it induces. In
that case a path may be uniquely specified by the state sequence (s0 , s1 , . . . , sT ).
ρπt (s, a) = Pr
0
[at = a, st = s] = Eh0t−1 [I[st = s, at = a]|h0t−1 ],
ht−1
where h0t−1 = (s0 , a0 , . . . , st−1 , at−1 ) is the history of the first t − 1 time steps gen-
erated using π, and the probability and expectation are taken with respect to the
28
randomness of the policy π. Now we can rewrite the expected cost to go as,
T−1
X X
π
E[C (s0 )] = ct (s, a)ρπt (s, a),
t=0 a∈At ,s∈St
where C π (s0 ) is the random variable of the cost when starting at state s0 and following
policy π.
0
This implies that any two policies π and π 0 for which ρπt (s, a) = ρπt (s, a), for any
time t, state s and action a, would have the same expected cumulative cost for any
0
cost function, i.e., E[C π (s0 )] = E[C π (s0 )]
Theorem 3.1. For any policy π ∈ ΠHS , there is a policy π 0 ∈ ΠM S , such that for
0
every state s and action a we have, ρπ (s, a) = ρπ (s, a). This implies that,
0
E[C π (s0 )] = E[C π (s0 )]
Proof. Given the policy π ∈ ΠHS , we define π 0 ∈ ΠM S as follows. For every state
s ∈ St we define,
ρπt (s, a)
πt0 (a|s) = Pr [at = a|st = s] = P π 0
.
a0 ∈At ρt (s, a )
ht−1
By definition π 0 is Markovian (depends only on the time t and the realized state s).
0
We now claim that ρπt (s, a) = ρπt (s, a). To see this, let us denote ρπt (s) =
0 0 0 π
Prh0t−1 [st = s]. By construction, we have that ρπt (s, a) = ρπt (s)π 0 (a|s) = ρπt (s) ρρtπ(s,a) .
t (s)
π0 π
We now show by induction that ρt (s) = ρt (s). For the base of the induction, by
0 0
definition we have that ρπ0 (s) = ρπ0 (s). Assume that ρπt (s) P = ρπt (s). Then, by the
π0 π π0
above, we have that ρt (s, a) = ρt (s, a). Then, ρt+1 (s) = a0 ,s0 Pr[st+1 = s|at =
0
a0 , st = s0 ]ρπt (s0 , a0 ) = a0 ,s0 Pr[st+1 = s|at = a0 , st = s0 ]ρπt (s0 , a0 ) = ρπt+1 (s).
P
0
Finally, we obtain that ρπt (s, a) = ρπt (s, a) for all t, s, a, and therefore E[C π (s0 )] =
0
E[C π (s0 )].
Next we show that for any stochastic Markovian policy there is a deterministic
Markovian policy with at most the same cumulative cost.
29
Proof. The proof is by backward induction on the steps. The inductive claim is:
For any policy π ∈ ΠM S which is deterministic in [t+1, T], there is a policy π 0 ∈ ΠM S
0
which is deterministic in [t, T] and E[C π (s0 )] ≥ E[C π (s0 )].
Clearly, the theorem follows from the case of t = 0.
For the base of the induction we can take t = T, which holds trivially.
For the inductive step, assume that π ∈ ΠM S is deterministic in [t + 1, T].
For every st+1 ∈ St+1 define
πt0 (st ) = arg min ct (st , a) + Ct+1 (ft (st , a)). (3.1)
a∈At
Recall that since we have a Deterministic Decision Process ft (st , a) ∈ St+1 is the
next state if we take action a in st .
For the analysis, note that π and π 0 are identical until time t, so they generate
exactly the same distribution over paths. At time t, π 0 is defined to minimize the
cost to go from st , given that we follow π from t + 1 to T. Therefore the cost can
only decrease. Formally, let Eπ [·] be the expectation with respect to policy π. We
have,
Eπst [Ct (st )] =Eπst Eπat [ct (st , at ) + Ct+1 (ft (st , at ))]
≥Eπst min [ct (st , at ) + Ct+1 (ft (st , at ))]
at ∈At
0
=Eπst [Ct (st )],
30
3.2.5 Optimal Control Policies
Definition 3.9. A control policy π ∈ ΠM D is called optimal if, for each initial state
s0 , it induces an optimal path hT from s0 .
An alternative definition can be given in terms of policies only. For that pur-
pose, let hT (π; s0 ) denote the path induced by the policy π from s0 . For a given
return functional VT (hT ), denote VT (π; s0 ) = VT (hT (π; s0 )) That is, VT (π; s0 ) is the
cumulative return for the path induced by π from s0 .
Definition 3.10. A control policy π ∈ ΠM D is called optimal if, for each initial state
s0 , it holds that VT (π; s0 ) ≥ VT (π̃; s0 ) for any other policy π̃ ∈ ΠM D .
The naive approach to finding an optimal policy: For finite models (i.e., finite
state and action spaces), the number of feasible paths (or control policies) is finite.
It is therefore possible, in principle, to enumerate all T-stage paths, compute the
cumulative return for each one, and choose the one which gives the largest return. Let
us evaluate the number of different paths and control policies. Suppose for simplicity
that number of states at each stage is the same: |St | = n, and similarly the number of
actions at each state is the same: |At (s)| = m (with m ≤ n) . The number of feasible
T-stage paths for each initial state is seen to be mT . The number of different policies
is mnT . For example, for a fairly small problem with T = n = m = 10, we obtain
1010 paths for each initial state (and 1011 overall), and 10100 control policies. Clearly,
it is not computationally feasible to enumerate them all. Fortunately, Dynamic
Programming offers a drastic reduction of the computational complexity for this
problem, as presented in the next Section.
31
The DP technique for dynamic systems is based on a general observation called
Bellman’s Principle of Optimality. Essentially, it states the following (for determin-
istic problems): Any sub-path of an optimal path is itself an optimal path between
its end points.
To see why this should hold, consider a sub-path which is not optimal. We can
replace it by an optimal sub-path, and improve the return.
Applying this principle recursively from the last stage backward, obtains the
(backward) Dynamic Programming algorithm. Let us first illustrate the idea with
following example.
Example 3.4. Shortest path on a decision graph: Suppose we wish to find the shortest
path (minimum cost path) from the initial node in T steps.
The boxed values are the terminal costs at stage T, the other number are the link
costs. Using backward recursion, we may obtain that the minimal path costs from the
two initial states are 7 and 3, as well as the optimal paths and an optimal policy.
We can now describe the DP algorithm. Recall that we consider the dynamic
system
st+1 = ft (st , at ), t = 0, 1, 2, . . . , T − 1
st ∈ St , at ∈ At (st )
and we wish to maximize the cumulative return:
T−1
X
VT = rt (st , at ) + rT (sT )
t=0
32
The DP algorithm computes recursively a set of value functions Vt : St → R , where
Vt (st ) is the value of an optimal sub-path ht:T = (st , at , . . . , sT ) that starts at st .
Note that the algorithm involves visiting each state exactly once, proceeding
backward in time. For each time instant (or stage) t, the value function Vt (s) is
computed for all states s ∈ St before proceeding to stage t − 1. The backward
induction step of Algorithm 1 (Finite-horizon Dynamic Programming), along with
similar equations in the theory of DP, is called Bellman’s equation.
where V0π (s) is the expected return of policy π when started at state s.
Proof. We show that the computed policy π ∗ is optimal and its return from time t
is Vt . We will establish the following inductive claim:
For any time t and any state s, the path from s defined by π ∗ is the maximum return
path of length T − t. The value of Vt (s) is the maximum return from s.
The proof is by a backward induction. For the basis of the induction we have:
t = T, and the inductive claim follows from the initialization.
Assume the inductive claim holds for t prove for t + 1. For contradiction as-
sume there is a higher return path from s. Let the path generated by π ∗ be
P = (s, s∗T−t , . . . , s∗T ). Let P1 = (s, sT−t , . . . , sT ) be the alternative path with higher
return. Let P2 = (s, sT−t , s0T−t−1 , . . . , s0T ) be the path generated by following π ∗ from
33
sT−t . Since P1 and P2 are identical except for the last t stages, we can use the
inductive hypothesis, which implies that V(P1 ) ≤ V(P2 ). From the definition of π ∗
we have that V(P2 ) ≤ V(P ). Hence, V(P1 ) ≤ V(P2 ) ≤ V(P ), which completes the
proof of the inductive hypothesis.
Let us evaluate the computational complexity of finite horizon DP: there is a
total of nT states (excluding the final one), and in each we need m computations.
Hence, the number of required calculations is mnT. For the example above with
m = n = T = 10, we need O(103 ) calculations.
Remark 3.8. A similar algorithm that proceeds forward in time (from t = 0 to t = T)
can be devised. We note that this will not be possible for stochastic systems (i.e., the
stochastic MDP model).
Remark 3.9. The celebrated Viterbi algorithm is an important instance of finite-
horizon DP. The algorithm essentially finds the most likely sequence of states in a
Markov chain (st ) that is partially (or noisily) observed. The algorithm was intro-
duced in 1967 for decoding convolution codes over noisy digital communication links.
It has found extensive applications in communications, and is a basic computational
tool in Hidden Markov Models (HMMs), a popular statistical model that is used ex-
tensively in speech recognition and bioinformatics, among other areas.
34
Definition 3.12. Path: A path ω on G from v0 to vk is a sequence (v0 , v1 , v2 , . . . , vk )
of vertices such that (vi , vi+1 ) ∈ E. A path is simple if all edges in the path are
distinct. A cycle is a path with v0 = vk .
Definition 3.13. Path length: The length of a path c(ω) is the sum of the weights
k
P
over its edges: c(ω) = c(vi−1 , vi ).
i=1
A shortest path from u to v is a path from u to v that has the smallest length
c(ω) among such paths. Denote this minimal length as d(u, v) (with d(u, v) = ∞ if
no path exists from u to v). The shortest path problem has the following variants:
• Single pair problem: Find the shortest path from a given source vertex u to a
given destination vertex v.
• Single source problem: Find the shortest path from a given source vertex u to
all other vertices.
• Single destination: Find the shortest path to a given destination node v from
all other vertices.
• All pair problem: Find the shortest path from every source vertex u to every
destination vertex v.
We note that the single-source and single-destination problems are symmetric
and can be treated as one. The all-pair problem can of course be solved by multiple
applications of the other algorithms, but there exist algorithms which are especially
suited for this problem.
35
3.4.3 The Bellman-Ford Algorithm
This algorithm solves the single destination (or the equivalent single source) shortest
path problem. It allows both positive and negative edge weights. Assume for the
moment that there are no negative-weight cycles.
The output of the algorithm is d[v] = d(v, vd ), the weight of the shortest path
from v to vd , and the routing list π. A shortest path from vertex v is obtained from
π by following the sequence: v1 = π[v], v2 = π[v1 ], . . . , vd = π[vk−1 ]. To understand
the algorithm, we observe that after round i, d[v] holds the length of the shortest
path from v in i steps or less. To see this, observe that the calculations done up to
round i are equivalent to the calculations in a finite horizon dynamic programming,
where the horizon is i. Since the shortest path takes at most |V| − 1 steps, the above
claim on optimality follows.
The running time of the algorithm is O(|V| · |E|). This is because in each round
i of the algorithm, each edge e is involved in exactly one update of d[v] for some v.
If {d[v] : v ∈ V} does not change at all at some round, then the algorithm may be
stopped early.
Remark 3.10. We have assumed above that no negative-weight cycles exist. In fact
the algorithm can be used to check for existence of such cycles: A negative-weight
cycle exists if and only if d[v] changes during an additional step (i = |V|) of the
algorithm.
Remark 3.11. The basic scheme above can also be implemented in an asynchronous
manner, where each node performs a local update of d[v] at its own time. Further,
36
the algorithm can be started from any initial conditions, although convergence can be
slower. This makes the algorithm useful for distributed environments such as internet
routing.
Let us discuss the running time of Dijkstra’s algorithm. Recall that the Bellman-
Form algorithm visits each edge of the graph up to |V| − 1 times, leading to a running
time of O(|V| · |E|). Dijkstra’s algorithm visits each edge only once, which con-
tributes O( |E|) to the running time. The rest of the computation effort is spent on
determining the order of node insertion to S.
37
The vertices in V\S need to be extracted in increasing order of d[v]. This is
handled by a min-priority queue, and the complexity of the algorithm depends on
the implementation of this queue. With a naive implementation of the queue that
simply keeps the vertices in some fixed order, each extract-min operation takes O(|V|)
time, leading to overall running time of O(|V|2 + |E|) for the algorithm. Using a basic
(binary heap) priority queue brings the running time to O((|V| + |E|) log |V|), and a
more sophisticated one (Fibonacci heap) can bring it down to O(|V| log |V| + |E|).
In the following, we prove that Dijkstra’s is complete, i.e., that is finds the shortest
path. Let d∗ [v] denote the shortest path length from v to vd .
Theorem 3.4. Assume that c(v, u) ≥ 0 for all u, v ∈ S. Then Dijkstra’s algorithm
terminates with d[v] = d∗ [v] for all v ∈ S.
Proof. We first prove by induction that d[v] ≥ d∗ [v] throughout the execution of the
algorithm. This obviously holds at initialization. Now, assume d[v] ≥ d∗ [v] ∀v ∈ V
before a relaxation step of edge (x, y) ∈ E. If d[x] changes after the relaxation we have
d[x] = c(x, y) + d[y] ≥ c(x, y) + d∗ [y] ≥ d∗ [x], where the last inequality is Bellman’s
equation.
We will next prove by induction that throughout the execution of the algorithm,
for each v ∈ S we have d[v] = d∗ [v]. The first vertex added to S is vd , for which
the statement holds. Now, assume by contradiction that u is the first node that is
going to be added to S for which d[u] 6= d∗ [u]. We must have that u is connected to
vd , otherwise d[u] = d∗ [u] = ∞. Let p denote the shortest path from u to vd . Since
p connects a node in V\S to a node in S, it must cross the boundary of S. We can
thus write it as p = u → x → y → vd , where x ∈ V\S, y ∈ S, and the path y → vd is
inside S. By the induction hypothesis, d[y] = d∗ [y]. Since x is on the shortest path, it
must have been updated when y was inserted into S, so d[x] = d∗ [y] + c(x, y) = d∗ [x].
Since the weights are non-negative, we must have d[x] = d∗ [x] ≤ d∗ [u] ≤ d[u] (the
last inequality is from the induction proof above). But because both u and x were
in S and we chose to update u, we must have d[x] ≥ d[u], so d∗ [u] = d[u].
38
Algorithm 4 Dijkstra’s Algorithm (Single Pair Problem)
1: Input: A weighted directed graph G, source node vs , and destination node vd .
2: Initialization:
3: d[vs ] = 0
4: d[v] = ∞ for all v ∈ V \ {vs }
5: π[v] = ∅ for all v ∈ V
6: S=∅
7: while S 6= V do
8: Choose u ∈ V \ S with the minimal value d[u]
9: Add u to S
10: If u == vd
11: break
12: for all (u, v) ∈ E do
13: If d[v] > d[u] + c(u, v)
14: d[v] = d[u] + c(u, v)
15: π[v] = u
16: end for
17: end while
18: return {(d[v], π[v]) | v ∈ V}
39
3.4.6 From Dijkstra’s Algorithm to A∗
Dijkstra’s algorithm expands vertices in the order of their distance from the source.
When the destination is known (as in the single pair problem), it seems reasonable
to bias the search order towards vertices that are closer to the goal.
The A∗ algorithm implements this idea through the use of a heuristic function
h[v], which is an estimate of the distance from vertex v to the goal. It then expands
vertices in the order of d[v] + h[v], i.e., the (estimated) length of the shortest path
from vs to vd that passes through v.
Algorithm 5 A∗ Algorithm
1: Input: Weighted directed graph G, source vs , destination vd , heuristic h.
2: Initialization:
3: d[vs ] = 0
4: d[v] = ∞ for all v ∈ V \ {vs }
5: π[v] = ∅ for all v ∈ V
6: S=∅
7: while S 6= V do
8: Choose u ∈ V \ S with the minimal value d[u] + h[u]
9: Add u to S
10: If u == vd
11: break
12: for all (u, v) ∈ E do
13: If d[v] > d[u] + c(u, v)
14: d[v] = d[u] + c(u, v)
15: π[v] = u
16: end for
17: end while
18: return {(d[v], π[v]) | v ∈ V}
40
A heuristic is said to be admissible if it is a lower bound of the shortest path to the
goal, i.e., for every vertex u we have that
h[u] ≤ d[u, vd ],
where we recall that d[u, v] denotes the length of the shortest path between u and v.
It is easy to show that every consistent heuristic is also admissible (exercise: show
it!). It is more difficult to find admissible heuristics that are not consistent. In path
finding applications, a popular heuristic that is both admissible and consistent is the
Euclidean distance to the goal.
With a consistent heuristic, A∗ is guaranteed to find the shortest path in the
graph. With an admissible heuristic, some extra bookkeeping is required to guarantee
optimality. We will show optimality for a consistent heuristic by showing that A∗ is
equivalent to running Dijkstra’s algorithm on a graph with modified weights.
Proposition 3.5. Assume that c(v, u) ≥ 0 for all u, v ∈ S, and that h is a consistent
heuristic. Then the A∗ algorithm terminates with d[v] = d∗ [v] for all v ∈ S.
Proof. Define new weights ĉ(u, v) = c(u, v) + h(v) − h(u). This transformation does
not change the shortest path from vs to vd (show this!), and the new weights are
non-negative due to the consistency property.
The A∗ algorithm is equivalent to running Dijkstra’s algorithm (for the single
ˆ = d[v] + h[v]. The optimality of
pair problem) with the weights ĉ, and defining d[v]
∗
A therefore follows from the optimality results for Dijsktra’s algorithm.
Remark 3.12. Actually, a stronger result of optimal efficiency can be shown for A∗ :
for a given h that is consistent, no other algorithm that is guaranteed to be optimal
will explore a smaller set of vertices during the search [39].
Remark 3.14. In the proof of Proposition 3.5, the idea of changing the cost function
to make the problem easier to solve without changing the optimal solution is known
as cost shaping, and also plays a role in learning algorithms [85].
41
3.5 Average cost criteria
The average cost criteria considers the limit of the average costs. Formally:
T−1
π 1X
Cavg = lim ct (st , at )
T→∞ T
t=0
π
where the trajectory is generated using π. The aim is to minimize E[Cavg ]. This
implies that any finite prefix has no influence of the final average cost, since its
influence vanishes as T goes to infinity.
For a deterministic stationary policy, the policy converges to a simple cycle, and
the average cost is the average cost of the edges on the cycle. (Recall, we are con-
sidering only DDP.)
Given a directed graph G(V, E), let Ω be thePcollection of all cycles in G(V, E). For
each cycle ω = (v1 , . . . , vk ), we define c(ω) = ki=1 c(vi , vi+1 ), where (vi , vi+1 ) is the
i-th edge in the cycle ω. Let µ(ω) = c(ω) k
. The minimum average cost cycle is
µ∗ = min µ(ω)
ω∈Ω
We show that the minimum average cost cycle is the optimal policy.
Theorem 3.6. For any Deterministic Decision Process (DDP) the optimal average
cost is µ∗ , and an optimal policy is πω that cycles around a simple cycle of average
cost µ∗ , where µ∗ is the minimum average cost cycle.
42
Next we develop an algorithm for computing the minimum average cost cycle,
which implies an optimal policy for DDP for average costs. The input is a directed
graph G(V, E) with edge cost c : E → R.
We first give a characterization of µ∗ . Set a root r ∈ V. Let Fk (v) be paths of
length k from r to v. Let dk (v) = minp∈Fk (v) c(p), where if Fk (v) = ∅ then dk (v) = ∞.
The following theorem of Karp [49] gives a characterization of µ∗ .
Theorem 3.7. The value of the minimum cost cycle is
dn (v) − dk (v)
µ∗ = min max ,
v∈V 0≤k≤n−1 n−k
where we define ∞ − ∞ as ∞.
Proof. We have two cases, µ∗ = 0 and µ∗ > 0. We assume that the graph has no
negative cycle (we can guarantee this by adding a large number M to all the weights).
We start with µ∗ = 0. This implies that we have in G(V, E) a cycle of weight zero,
but no negative cycle. For the theorem it is sufficient to show that,
For every node v ∈ V there is a path of length k ∈ [0, n − 1] of cost d(v), the cost
of the shortest path from r to v. This implies that
We need to show that for some v ∈ V we have dn (v) = d(v), which implies that
minv∈S {dn (v) − d(v)} = 0.
Consider a cycle ω of cost C(ω) = 0 (there is one, since µ∗ = 0). Let v be a
node on the cycle ω. Consider a shortest path P from r to v which then cycles
around ω and has length at least n. The path P is a shortest path to v (although
not necessarily simple). This implies that any sub-path of P is also a shortest path.
Let P 0 be a sub-path of P of length n and let it end in u ∈ V. Path P 0 is a shortest
path to u, since it is a prefix of a shortest path P . This implies that the cost of P 0 is
d(u). Since P 0 is of length n, by construction, we have that dn (u) = d(u). Therefore,
minv∈S {dn (v) − d(v)} = 0, which completes the case that µ∗ = 0.
For µ∗ > 0 we subtract a constant ∆ = µ∗ from all the costs in the graph. This
implies that for the new costs we have a zero cycle and no negative cycle. We can
now apply the previous case. It only remains to show that the formula changes by
exactly ∆ = µ∗ .
43
Formally, for every edge e ∈ E let c0 (e) = c(e) − ∆. For any path p we have
C 0 (p) = C(p) − |p|∆, and for any cycle ω we have µ0 (ω) = µ(ω) − ∆. This implies
that for ∆ = µ∗ we have a cycle of cost zero and no negative cycles. We now consider
the formula,
d0n (v) − d0k (v)
0 = (µ0 )∗ = min max { }
v∈V 0≤k≤n−1 n−k
dn (v) − n∆ − dk (v) + k∆
= min max { }
v∈V 0≤k≤n−1 n−k
dn (v) − dk (v)
= min max { − ∆}
v∈V 0≤k≤n−1 n−k
dn (v) − dk (v)
= min max { } − ∆.
v∈V 0≤k≤n−1 n−k
Therefore we have,
dn (v) − dk (v)
µ∗ = ∆ = min max { }
v∈V 0≤k≤n−1 n−k
which completes the proof.
We would like now to recover the minimum average cost cycle. The basic idea
is to recover the cycle from the minimizing vertices in the formula, but some care is
needed to be taken. It is true that for some minimizing pair (v, k) the path of length
n from r to v has a cycle of length n − k, which is the suffix of the path. The solution
is that for the path p, from r to v of length n, any simple cycle is a minimum average
cost cycle. (See [20].)
The running time of computing the minimum average cost cycle is O(|V| · |E|).
44
A simple approach for solving Problem 3.2 is using gradient based optimization.
Note that we can expand the terms in the sum using the known dynamics function
and initial state:
T
X
V(a0 , . . . , aT ) = ct (st , at )
t=0
= c0 (s0 , a0 ) + c1 (f0 (s0 , a0 ), a1 ) + · · · + cT (fT−1 (fT−2 (. . . ), aT−1 ), aT ).
45
Proposition 3.8. The value function has a quadratic form: Vt (s) = s> Pt s, and
Pt = Pt> .
Proof. We prove by induction. For t = T, this holds by definition, as VT (s) = s> QT s.
Now, assume that Vt+1 (s) = s> Pt+1 s. We have that
Substituting back a∗t in the expression for Vt (s) gives a quadratic expression in s.
From the construction in the proof of Proposition 3.8 one can recover the se-
quence of optimal controllers a∗t . By substituting the optimal controls in the forward
dynamics equation, one can also recover the optimal state trajectory.
Note that the DP solution is globally optimal for the LQR problem. Interestingly,
the computational complexity is polynomial in the dimension of the state, and linear
in the time horizon. This is in contrast to the curse of dimensionality, which would
make a discretization based approach infeasible for high dimensional problem. This
efficiency is due to the special structure of the dynamics and cost function in the
LQR problem, and does not hold in general.
Remark 3.15. Note that the DP computation resulted in a sequence of linear feedback
controllers. It turns out that these controllers are also optimal in the presence of
Gaussian noise added to the dynamics.
A similar derivation holds for the system:
T
X
min ct (st , at ),
a0 ,...,aT
t=0
s.t. st+1 = At st + Bt at + Ct ,
ct = [st , at ]> Wt [st , at ] + Zt [st , at ] + Yt , ∀t = 0, . . . , T.
In this case, the optimal control is of the form a∗t = Kt s + κt , for some matrices Kt
and vectors κt .
46
3.6.2 Iterative LQR
We now return to the original non-linear problem (3.2). If we linearize the dynam-
ics and quadratize the cost – we can plug in the LQR solution we obtained above.
Namely, given some reference trajectory sˆ0 , aˆ0 , . . . , sˆt , aˆT , we apply a Taylor approx-
imation:
ft (st , at ) ≈ ft (ŝt , ât ) + ∇st ,at ft (ŝt , ât )[st − ŝt , at − ât ]
ct (st , at ) ≈ ct (ŝt , ât ) + ∇st ,at ct (ŝt , ât )[st − ŝt , at − ât ] (3.4)
1
+ [st − ŝt , at − ât ]> ∇2st ,at ct (ŝt , ât )[st − ŝt , at − ât ].
2
If we define δs = s − ŝ, δa = a − â, then the Taylor approximation gives an LQR
problem for δs , δa . It’s optimal controller is a∗t = Kt (st − ŝt ) + κt + ât . By running
this controller on the non-linear system, we obtain a new reference trajectory. Also
note that the controller a∗t = Kt (st −ŝt )+ακt +ât for α ∈ [0, 1] smoothly transitions
from the previous trajectory (α = 0) to the new trajectory (α = 1) (show that!).
Therefore we can interpret α as a step size, to guarantee that we stay within the
Taylor approximation limits.
The iterative LQR algorithm works by applying this approximation iteratively:
In practice, the iLQR algorithm can converge much faster than the simple gradient
descent approach.
47
A treatment of LQR appears in [56]. Our presentation of the iterative LQR
follows [123], which is closely related to differential dynamic programming [44].
48
Chapter 4
Markov Chains
∆
P(Xt+1 = j|Xt = i) = P(X1 = j|X0 = i) = pi,j .
are the transition probabilities, which satisfy pi,j ≥ 0, and for each i ∈ X
The pi,j ’sP
we have j∈X pi,j = 1, namely, {pi,j : j ∈ X} is a distribution on the next state
following state i. The matrix P = (pi,j ) is the transition matrix. The matrix is
row-stochastic (each row sums to 1 and all entries non-negative).
Given the initial distribution p0 of X0 , namely P(X0 = i) = p0 (i), we obtain the
finite-dimensional distributions:
(m)
Define pi,j = P(Xm = j|X0 = i), the m-step transition probabilities. It is easy
(m)
to verify that pi,j = [P m ]ij , where P m is the m-th power of the matrix P .
49
Example 4.1. 1 Consider the following two state Markov chain, with transition prob-
ability P and initial distribution p0 , as follows:
0.4 0.6
P = p0 = 0.5 0.5
0.2 0.8
Initially, we have both states equally likely. After one step, the distribution of states
is p1 = p0 P = (0.3 , 0.7). After two steps we have p2 = p1 P = p0 P 2 = (0.26 , 0.74).
The limit of this sequence would be p∞ = (0.25 , 0.75), which is called the steady
state distribution, and would be discussed later.
Definition 4.2. A communicating class (or just class) is a maximal collection of states
that communicate.
For a finite X, this implies that in G(X, E) we have i and j in the same strongly
connected component of the graph. (A strongly connected component has a directed
path between any pair of vertices.)
Definition 4.3. The Markov chain is irreducible if all states belong to a single class
(i.e., all states communicate with each other).
50
(m)
Definition 4.4. State i has a period di = GCD{m ≥ 1 : pi,i > 0}, where GCD is
the greatest common divisor. A state is aperiodic if di = 1.
(m)
State i is periodic with period di ≥ 2 if pi,i = 0 for m (mod di ) 6= 0 and for any
(m)
m such that pi,i > 0 we have m (mod di ) = 0.
If a state i is aperiodic, then there exists an integer m0 such that for any m ≥ m0
(m)
we have pi,i > 0.
Periodicity is a class property: all states in the same class have the same period.
Specifically, if some state is a-periodic, then all states in the class are a-periodic.
Claim 4.1. For any two states i and j with periods di and dj , in the same communi-
cating class, we have di = dj .
Proof. For contradiction, assume that dj (mod di ) 6= 0. Since they are in the same
communicating class, we have a trajectory from i to j of length mi,j and from j to i of
length mj,i . This implies that (mi,j + mj,i ) (mod di ) = 0. Now, there is a trajectory
(which is a cycle) of length mj,j from j back to j such that mj,j (mod di ) 6= 0
(otherwise di divides the period of j). Consider the path from i to itself of length
mi,j +mj,j +mj,i . We have that (mij +mjj +mji ) (mod di ) = mjj (mod di ) 6= 0. This
is a contradiction to the definition of di . Therefore, dj (mod di ) = 0 and similarly di
(mod dj ) = 0, which implies that di = dj .
The claim shows that periodicity is a class property, and all the states in a class
have the same period.
4.2 Recurrence
We define the following.
We can relate the state property of recurrent and transient to the expected num-
ber of returns to a state.
P∞ (m)
Claim 4.2. State i is transient iff m=1 pi,i < ∞.
51
Proof. Assume that state i is transient. Let qi = P(Xt = i for some t ≥ 1|X0 = i).
Since state i is transient we have qi < 1. Let Zi be the number of times the trajectory
returns to state i. Note that Zi is geometrically distributed with parameter qi , namely
Pr[Zi = k] = qik (1 − qi ). Therefore the expected number of returns to state i is
1/(1 − qi ) and is finite. The expected number of returns to state i is equivalently
P∞ (m) P∞ (m)
m=1 p i,i , and hence if a state is transient we have m=1 pi,i < ∞.
For the other direction, assume that ∞
P (m)
m=1 pi,i < ∞. This implies that there
P∞ (m)
is an m0 such that m=m0 pi,i < 1/2. Consider the probability of returning to i
within m0 stages. This implies that P(Xt = i for some t ≥ m0 |X0 = i) < 1/2. Now
consider the probability qi0 = P(Xt = i for some m0 ≥ t ≥ 1|X0 = i). If qi0 < 1, this
implies that P(Xt = i for some t ≥ 1|X0 = i) < qi0 + (1 − qi0 )/2 = (1 + qi0 )/2 < 1,
which implies that state i is transient. If qi0 = 1, this implies that after at most m0
stages we are guaranteed to return to i, hence the expected number of return to state
i is infinite, i.e., ∞
P (m)
m=1 i,i = ∞. This is in contradiction to the assumption that
p
P∞ (m)
m=1 pi,i < ∞.
52
Claim 4.6. If the state space X is finite, all recurrent states are positive recurrent.
Proof. This follows since the set of states that are null recurrent cannot have transi-
tions from positive recurrent states and cannot have a transition to transient states.
If the chain never leaves the set of null recurrent states, then some state would have
a return time which is at most the size of the set. If there is a positive probability
of leaving the set (and never returning) then the states are transient. (See the proof
of Theorem 4.10 for a more formal proof of a similar claim for countable Markov
Chains.)
In the following we illustrate some of the notions that we define. we start with
the classic random walk on the integers, where all integer (states) are null recurrent.
Example 4.3. Random walk Consider the following Markov chain over the integers.
The states are the integers. The initial state is 0. At each state i, with probability 1/2
we move to i + 1 and with probability 1/2 to i − 1. Namely, pi,i+1 = 1/2, pi,i−1 = 1/2,
and pi,j = 0 for j 6∈ {i − 1, i + 1}. We will show that Ti is finite with probability 1
and E[Ti ] = ∞. This implies that all the states are null recurrent.
To compute E[Ti ] consider what happens after one and two steps. Let Zi,j be the
time to move from i to j. Note that we have,
where the first identity uses the fact that Zi+2,i = Zi+2,i+1 + Zi+1,i , since in order to
reach from state i + 2 to state i we need to first reach from state i + 2 state i + 1,
and then from state i + 1 to state i.
This implies that we have
Clearly, there is no finite value for E[Z1,0 ] which will satisfy both equations, which
implies E[Z1,0 ] = ∞, and hence E[Ti ] = ∞.
53
To show that state 0 is a recurrent state, note that the probability that at time 2k
(2k) 2k −2k
≈ √ck (using Stirling’s approximation),
we are at state 0 is exactly p0,0 = k 2
for some constant c > 0. This implies that
X∞ (m)
X∞ c
p0,0 ≈ √ =∞
m=1 m=1 m
and therefore state 0 is recurrent. (By symmetry, this shows that all the states are
recurrent.)
Note that this Markov chain has a period of 2. This follows since any trajectory
starting at 0 and returning to 0 has an equal number of +1 and −1 and therefore of
even length. Any even number n has a trajectory of this length that starts at 0 and
returns to 0, for example, having n/2 times +1 followed by n/2 times −1.
The next example is a simple modification of the random walk, where each time
we either return to the origin or continue to the next integer with equal probability.
This Markov chain will have all (non-negative) integers as positive recurrent states.
Example 4.4. Random walk with jumps. Consider the following Markov chain over
the integers. The states are the integers. The initial state is 0. At each state i, with
probability 1/2 we move to i + 1 and with probability 1/2 we return to 0. Namely,
pi,i+1 = 1/2, pi,0 = 1/2, and pi,j = 0 for j 6∈ {0, i + 1}. We will show that E[Ti ] < ∞
(which implies that Ti is finite with probability 1).
From any state we return to 0 with probability 1/2, therefore E[T0 ] = 2 (The
return time is 1 with probability 1/2, 2 with probability (1/2)2 , k with probability
(1/2)k , and computing the expectation gives ∞ k
P
k=1 k/2 = 2). We will show that for
state i we have E[Ti ] ≤ 2 + 2 · 2i . We will decompose Ti to two parts. The first is the
return to 0, this part has expectation 2. The second is to reach state i from state 0.
Consider an epoch as the time between two visits to 0. The probability that an epoch
would reach i is exactly 2−i . The expected time of an epoch is 2 (the expected time
to return to state 0). The expected time to return to state 0, given that we did not
reach state i is less than 2. Therefore, E[Ti ] ≤ 2 + 2 · 2i .
Note that this Markov chain is aperiodic.
54
Clearly, if Xt ∼ µ then Xt+1 ∼ µ. If X0 ∼ µ, then the Markov chain (Xt ) is a
stationary stochastic process.
Theorem 4.7. Let (Xt ) be an irreducible and a-periodic Markov chain over a finite
state space X with transition matrix P . Then there is a unique distribution µ such
that µ> P = µ> > 0.
Proof. Assume that x is an eigenvector of P with eigenvalue λ, i.e., we have P x = λx.
Since P is a stochastic matrix, we have kP xk∞ ≤ kxk∞ , which implies that λ ≤ 1.
Since the matrix P is row stochastic, P ~1 = ~1, which implies that P has a right
eigenvalue of 1 and this is the maximal eigenvalue. Since the sets of right and left
eigenvalues are identical for square matrices, we conclude that there is x such that
x> P = x> . Our first task is to show that there is such an x with x ≥ 0.
Since the Markov chain is irreducible and a-periodic, there is an integer m, such
that P m has all the entries strictly positive. Namely, for any i, j ∈ X we have
(m)
pi,j > 0.
We now show a general property of positive matrices (matrices where all the
entries are strictly positive). Let A = P m be a positive matrix and x an eigenvector
of A with eigenvalue 1. First, if x has complex number then Re(x) and Im(x) are
eigenvectors of A of eigenvalue 1 and one of them is non-zero. Therefore we can
assume that x ∈ Rd . We would like to show that there is an x ≥ 0 such that
x> A = x> . If x ≥ 0 we are done. If x ≤ 0 we can take x0 = −x and we are done.
We need to show that x cannot have both positive and negative entries.
For contradiction, assume that we have xk > 0 and xk0 < 0. This implies that
for any weight vector w > 0 we have |x> w| < |x|> w, where |x| is point-wise absolute
value. Therefore,
X X X XX X X X
|xj | = | xi Pi,j | < |xi |Pi,j = |xi | Pi,j = |xj |,
j j i j i i j j
where the first identity follows since x is an eigenvector. The second since P is strictly
positive.The third is a change of order of summation. The last follows since P is a
row stochastic matrix, so each row sums to 1. Clearly, we reached a contradiction,
and therefore x cannot have both positive and negative entries.
We have shown so far that there exists a µ such that µ> P = µ> and µ ≥ 0. This
implies that µ/kµk1 is a steady state distribution. Since A = P m is strictly positive,
then µ> = µ> A > 0.
To show the uniqueness of µ, assume we have x and y such that x> P = x> and
y P = y > and x 6= y. Recall that we showed that in such a case both x > 0 and
>
y > 0. Then there is a linear combination z = ax + by such that for some i we have
55
zi = 0. Since z > P = z > , we have showed that z is strictly positive, i.e., z > 0, which
is a contradiction. Therefore, x = y, and hence µ is unique.
We define the average fraction that a state j ∈ X occurs, given that we start
with an initial state distribution x0 , as follows:
m
(m) 1 X
πj = I(Xt = j).
m t=1
Theorem 4.8. Let (Xt ) be an irreducible and a-periodic Markov chain over a finite
state space X with transition matrix P . Let µ be the stationary distribution of P .
Then, for any j ∈ X we have,
(m) 1
µj = lim E[πj ] = .
m→∞ E[Tj ]
Proof. We have that
m m m
(m) 1 X 1 X 1 X > t
E[πj ] = E[ I(Xt = j)] = Pr[Xt = j|X0 = x0 ] = x P ej ,
m t=1 m t=1 m t=1 0
where ej denotes a vector of zeros, with 1 only in the j’s element. Let v1 , . . . , vn
be the eigenvectors of P with eigenvalues λ1 ≥ . . . ≥ λn . By Theorem 4.7 we
have Pthat v1 = µ, the stationary distribution and λ1 = 1 > λi for i ≥ 2. Rewrite
x0 = i αi vi . Since P m is a stochastic matrix, x>
0P
m
is a distribution, and therefore
> m
limm→∞ x0 P = µ.
We will be interested in the limit πj = limm→∞ πjm , and mainly in the expected
value E[πj ]. From the above we have that E[πj ] = µj .
A different way to express E[πj ] is using a variable time horizon, with a fixed
number of occurrences of j. Let Tk,j be the time between the k and k + 1 occurrence
of state j. This implies that
m
1 X n
lim I(Xt = j) = lim Pn
m→∞ m
k=1 Tk,j
n→∞
t=1
56
We have established the following general theorem.
Theorem 4.9 (Recurrence of finite Markov chains). Let (Xt ) be an irreducible, a-
periodic Markov chain over a finite state space X. Then the following properties
hold:
1. All states are positive recurrent
2. There exists a unique stationary distribution µ, where µ(i) = 1/E[Ti ].
3. Convergence to the stationary distribution: limt→∞ Pr[Xt = j] = µj (∀j)
∆
4. Ergodicity: For any finite f : limt→∞ 1t t−1
P P
s=0 f (Xs ) = i µi f (i) = π · f.
Proof. From Theorem 4.7, we have that µ > 0, and from Theorem 4.8 we have that
E[Ti ] = 1/µi < ∞. This establishes (1) and (2).
For any initial distribution x0 we have that
Pr[Xt = j] = x> t
0 P ej ,
where ej denotes a vector of zeros, with 1 only in the j’s element. Let v1 , . . . , vn be
the eigenvectors of P with eigenvalues λ1 ≥ . . . ≥ λn . By Theorem 4.7 we have P that
v1 = µ, the stationary distribution and λ1 = 1 > λi for i ≥ 2. Rewrite x0 = i αi vi .
We have that X
Pr[Xt = j] = αi λti vi> ej ,
i
>
and therefore limt→∞ Pr[Xt = j] = λ1 µ ej = λ1 µj . Since P t is a stochastic matrix,
x> t
0 P is a distribution, and therefore λ1 = 1. This establishes (3).
Finally, we establish (4) following the proof of Theorem 4.8:
1 Xt−1 1 Xt−1 X
lim f (Xs ) = lim I(Xs = i)f (Xi )
t→∞ t s=0 t→∞ t s=0 i
X 1 Xt−1 (4.1)
= f (Xi ) lim I(Xs = i)
i t→∞ t s=0
X
= µi f (i).
i
57
Proof. Let i be a positive recurrent state, then we will show that all states are positive
recurrent. For any state j, since the Markov chain is irreducible, we have for some
(m ) (m )
m1 , m2 ≥ 0 that pj,i 1 , pi,j 2 > 0. This implies that the return time to state j is at
(m ) (m )
most E[Tj ] ≤ 1/pj,i 1 + E[Ti ] + 1/pi,j 2 , and hence j is positive recurrent.
If there is no positive recurrent state, let i be a null recurrent state, then we will
show that all states are null recurrent. For any state j, since the Markov chain is
(m ) (m )
irreducible, we have for some m1 , m2 ≥ 0 that pj,i 1 , pi,j 2 > 0. This implies that
P∞ P∞
pi,i pi,j = ∞, since we have ∞
(m) (m1 ) (m) (m2 ) P (m)
m=0 pj,j = ∞ is at least m=0 pj,i m=0 pi,i =
∞, since i is a recurrent state. This implies that j is a recurrent state. Since there
are no positive recurrent states, it has to be that j is a null recurrent state.
If there are no positive or null recurrent states, then all states are transient.
Xt+1 = (Xt + At − St )+ .
Suppose that (St ) is a sequence of i.i.d. RVs, and similarly (At ) is a sequence of
i.i.d. RVs, with (St ), (At ) and X0 mutually independent. It may then be seen that
(Xt , t ≥ 0) is a Markov chain. Suppose further that each St is a Bernoulli RV with
parameter q, namely P (St = 1) = q, P (St = 0) = 1 − q. Similarly, let At be a
Bernoulli RV with parameter p. Then
p(1 − q) : j =i+1
(1 − p)(1 − q) + pq : j = i, i > 0
pi,j = (1 − p)q : j = i − 1, i > 0
(1 − p) + pq : j=i=0
0 : otherwise
58
Denote λ = p(1 − q), η = (1 − p)q, and ρ = λ/η. The detailed balance equations for
this case are:
µi pi,i+1 = µi λ = µi+1 η = µi+1 pi+1,i , ∀i ≥ 0
P
These equations have a solution with i µi = 1 if and only if ρ < 1. The solution
is µi = µ0 ρi , with µ0 = 1 − ρ. This is therefore the stationary distribution of this
queue.
The mixing time τ is defined as the time to reach a total variation of at most 1/4:
1
ks0 P τ − µkT V = kp(τ ) − µkT V ≤ ks0 − µkT V
4
where µ is the steady state distribution and p(τ ) is the state distribution after τ steps
starting with an initial state distribution s0 .
Note that after 2τ time steps we have
1 1
ks0 P 2τ − µkT V = kp(τ ) P τ − µkT V ≤ kp(τ ) − µkT V ≤ 2 ks0 − µkT V .
4 4
In general, after kτ time steps we have
1 1
ks0 P kτ − µkT V = kp((k−1)τ ) P τ − µkT V ≤ kp((k−1)τ ) − µkT V ≤ k ks0 − µkT V .
4 4
where the formal proof is by induction on k ≥ 1.
59
60
Chapter 5
61
Figure 5.1: Markov chain
is called the transition law or transition kernel of the controlled Markov process.
Graphical Notation: The state transition probabilities of a Markov chain are often
illustrated via a state transition diagram, such as in Figure 5.1.
A graphical description of a controlled Markov chain is a bit more complicated
because of the additional action variable. We obtain the diagram (drawn for state
62
Figure 5.2: Controlled Markov chain
s = 1 only, and for a given time t) in Figure 5.2, reflecting the following transition
probabilities:
p(s0 = 2|s = 1, a = 1)= 1
0.3 : s0 = 1
0
p(s |s = 1, a = 2) = 0.2 : s0 = 2
0.5 : s0 = 3
st+1 = ft (st , at , wt ),
ft (1, 1, wt ) = 2
ft (1, 2, wt ) = wt − 3
This state algebraic equation notation is especially useful for problems with continu-
ous state space, but also for some models with discrete states. Equivalently, we can
write
ft (1, 2, wt ) = 1 · I[wt = 4] + 2 · I[wt = 5] + 3 · I[wt = 6],
where I[·] is the indicator function.
Next we recall the definitions of control policies from Chapter 3.
63
Control Policies
• Similarly, we can define the set ΠM S of Markov stochastic control policies, where
πt (·|ht ) is replaced by πt (·|st ), and the set ΠSS of stationary stochastic control
policies, where πt (·|st ) is replaced by π(·|st ), namely the policy is independent
of the time.
• Note that the set ΠHS includes all other policy sets as special cases.
64
S0 ), induces a probability distribution over any finite state-action sequence hT =
(s0 , a0 , . . . , sT−1 , aT−1 , sT ), given by
T−1
Y
Pr(hT ) = p0 (s0 ) pt (st+1 |st , at )πt (at |ht ),
t=0
where ht = (s0 , a0 , . . . , sT−1 , aT−1 , st ). To see this, observe the recursive relation:
Pr(ht+1 ) = Pr(ht , at , st+1 ) = Pr(st+1 |ht , at ) Pr(at |ht ) Pr(ht )
= pt (st+1 |st , at )πt (at |ht ) Pr(ht ).
In the last step we used the conditional Markov property of the controlled chain:
Pr(st+1 |ht , at ) = pt (st+1 |st , at ), and the definition of the control policy πt . The
required formula follows by recursion.
Therefore, the state-action sequence h∞ = (sk , ak )k≥0 can now be considered
a stochastic process. We denote the probability law of this stochastic process by
Prπ,p0 (·). The corresponding expectation operator is denoted by Eπ,p0 (·). When the
initial state s0 is deterministic (i.e., p0 (s) is concentrated on a single state s), we
may simply write Prπ,s (·) or Prπ (·|s0 = s).
Under a Markov control policy, the state sequence (st )t≥0 becomes a Markov
chain, with transition probabilities:
X
Pr(st+1 = s0 |st = s) = pt (s0 |s, a)πt (a|s).
a∈At
If the controlled Markov chain is stationary (time-invariant) and the control policy
is stationary, then the induced Markov chain is stationary as well.
Remark 5.1. For most non-learning optimization problems, Markov policies suffice
to achieve the optimum.
Remark 5.2. Implicit in these definitions of control policies is the assumption that
the current state st can be fully observed before the action at is chosen . If this is not
the case we need to consider the problem of a Partially Observed MDP (POMDP),
which is more involved and is not discussed in this book.
65
5.2 Performance Criteria
5.2.1 Finite Horizon Return
Consider the finite-horizon return, with a fixed time horizon T. As in the deterministic
case, we are given a running reward function rt = {rt (s, a) : s ∈ St , a ∈ At } for
0 ≤ t ≤ T − 1, and a terminal reward function rT = {rT (s) : s ∈ ST }. The obtained
reward is Rt = rt (st , at ) at times t ≤ T − 1, and RT = rT (sT ) at the last stage.
(Note that st , at and sT are random variables that depend both on the policy π and
the stochastic transitions.) Our general goal is to maximize the cumulative return:
T
X T−1
X
Rt = rt (st , at ) + rT (sT ).
t=0 t=0
However, since the system is stochastic, the cumulative return will generally be a
random variable, and we need to specify in which sense to maximize it. A natural
first option is to consider the expected value of the return. That is, define:
XT XT
VTπ (s) = Eπ ( Rt |s0 = s) ≡ Eπ,s ( Rt ).
t=0 t=0
Here π is the control policy as defined above, and s denotes the initial state. Hence,
VTπ (s) is the expected cumulative return under the control policy π. Our goal is to
find an optimal control policy that maximizes VTπ (s).
Remark 5.3. Reward dependence on the next state: In some problems, the obtained
reward may depend on the next state as well: Rt = r̃t (st , at , st+1 ). For control
purposes, when we only consider the expected value of the reward, we can reduce this
reward function to the usual one by defining
∆
X
rt (s, a) = E(Rt |st = s, at = a) ≡ 0
p(s0 |s, a)r̃t (s, a, s0 ).
s ∈S
Remark 5.4. Random rewards: The reward Rt may also be random, namely a random
variable whose distribution depends on (st , at ). This can also be reduced to our
standard model for planning purposes by looking at the expected value of Rt , namely
Remark 5.5. Risk-sensitive criteria: The expected cumulative return is by far the
most common goal for planning. However, it is not the only one possible. For
66
example, one may consider the following risk-sensitive return function:
T
π 1 π,s
X
VT,λ (s) = log E (exp(λ Rt )).
λ t=0
For λ > 0, the exponent gives higher weight to high rewards, and the opposite for
λ < 0.
In the case that the rewards are stochastic, but have a discrete support, we
can construct an equivalent MDP in which all the rewards are deterministic and
trajectories have the same distribution of rewards. This implies that the important
challenge is the stochastic state transition function, and the rewards can be assumed
to be deterministic. Formally, given a trajectory we define a rewards trajectory as the
sub-trajectory that includes only the rewards, i.e., for a trajectory (s0 , a0 , r0 , s1 , . . .)
the reward trajectory is (r0 , r1 , . . .).
Theorem 5.1. Given an MDP M (S, A, P, r, s0 ), where the rewards are stochastic,
with support K = {1, . . . , k}, there is an MDP M 0 (S ×K, A, P0 , r0 , s00 ), and a mapping
of policies π of M to π 0 policies of M 0 , such that: running π in M for horizon T
generates reward trajectory R = (R0 , . . . , RT ) and running π 0 in M 0 for horizon T + 1
generates reward trajectory R = (R1 , . . . , RT+1 ), then the distributions of R and R0
are identical.
Proof. For simplicity we assume that the MDP is loop-free, namely you can reach
any state at most once in a trajectory. This is mainly to simplify the notation.
The basic idea is to encode the rewards in the states of M 0 which are S ×
K = S 0 . For each (s, i) ∈ S 0 and action a ∈ A we have p0t ((s0 , j)|(s, i), a) =
pt (s0 |s, a) Pr[Rt (s, a) = j], and p0T ((s0 , j)|(s, i)) = I(s0 = s) Pr[RT (s) = j]. The
reward is r0t ((s, i), a) = i. The initial state is s00 = (s0 , 0).
For any policy π(a|s) in M we have a policy π 0 in M 0 where π 0 (a|(s, i)) = π(a|s).
We map trajectories of M to trajectories of M 0 which have identical proba-
bilities. A trajectory (s0 , a0 , R0 , s1 , a1 , R1 , s2 . . . , RT ) is mapped to ((s0 , 0), a0 , 0,
(s1 , R0 ), a1 , R0 , (s2 , R1 ) . . . , RT+1 ). Let R and R0 be the respective reward trajecto-
ries. Clearly, the two trajectories have identical probabilities. This implies that the
rewards trajectories R and R0 have are identical probabilities (up to a shift of one in
the index).
Theorem 5.1 requires the number of rewards to be bounded, and guarantees that
the reward distribution be identical. In the case that the rewards are continuous, we
can have a similar guarantee for linear return functions.
67
Theorem 5.2. Given an MDP M (S, A, P, r, s0 ), where the rewards are stochastic,
with support [0, 1], there is an MDP M 0 (S, A, P, r0 , s0 ), where the rewards are stochas-
tic, with support {0, 1}, such that for any policy π ∈ ΠM S the distribution of the
expected rewards trajectory is identical.
Corollary 5.3. Given an MDP M (S, A, P, r, s0 ), where the rewards are stochastic,
with support [0, 1], there is an MDP M 0 (S × {0, 1}, A, P0 , r0 , s00 ), and a mapping of
π 0 ,M 0 0
policies π ∈ ΠM S of M to π 0 ∈ ΠM D policies of M 0 , such that VTπ,M (s0 ) = VT+1 (s0 )
Discounted return: The most common performance criterion for infinite horizon
problems is the expected discounted return:
X∞ ∞
X
Vγπ (s) =E (π t π,s
γ r(st , at )|s0 = s) ≡ E ( γ t r(st , at )) ,
t=0 t=0
where 0 < γ < 1 is the discount factor. Mathematically, the discount factor ensures
convergence of the sum (whenever the reward sequence is unbounded). This makes
the problem “well behaved”, and relatively easy to analyze. The discounted return
is discussed in Chapter 6.
68
Average return: Here we are interested to maximize the long-term average return.
The most common definition of the long-term average return is,
T−1
π 1X
Vav (s) = lim inf Eπ,s ( r(st , at ).)
T→∞ T t=0
τ = inf{t ≥ 0 : st ∈ SG }
as the first time in which a goal state is reached. The total expected return for this
problem is defined as:
Xτ −1
π
Vssp (s) = Eπ,s ( r(st , at ) + rG (sτ ))
t=0
Here rG (s), s ∈ SG specified the reward at goal states. Note that the length of the
run τ is a random variable.
Stochastic shortest path includes, naturally, the finite horizon case. This can
be shown by creating a leveled MDP where at each time step we move to the next
level and terminate at level T. Specifically, we define a new state space S 0 = S × T,
transition function p((s0 , t + 1)|(s, t), a) = p(s0 |s, a) and goal states SG = {(s, T) :
s ∈ S}.
Stochastic shortest path includes also the discounted infinite horizon. To see that,
add a new goal state, and from each state with probability 1−γ jump to the goal state
and terminate. The expected return of a policy would be the same in both models.
Specifically, we add a state sG and modify the transition probability to p0 , such that
p0 (sG |s, a) = 1 − γ, for any state s ∈ S and action a ∈ A and p0 (s0 |s, a) = γp(s0 |s, a).
The probability that we do not terminate by time t is exactly γ t . Therefore, the
∞
expected return is Eπ,s ( γ t r(st , at )) which is identical to the discounted return.
P
t=0
This class of problems provides a natural extension of the standard shortest-path
problem to stochastic settings. Some conditions on the system dynamics and reward
69
function must be imposed for the problem to be well posed (e.g., that a goal state
may be reached with probability one). Stochastic shortest path problems are also
known as episodic MDP problems.
pπ,s
t
0
(s, a) = P π,s0 (st = s, at = a), (s, a) ∈ St × At
where π is the control policy used, and s0 is a given initial state. We wish to maximize
the expected return V π (s0 ) over all control policies, and find an optimal policy π ∗
70
that achieves the maximal expected return V ∗ (s0 ) for all initial states s0 . Thus,
∆
VT∗ (s0 ) = VTπ∗ (s0 ) = max VTπ (s0 )
π∈ΠHS
This principle is not an actual claim, but rather a guiding principle that can
be applied in different ways to each problem. For example, considering our finite-
horizon problem, let π ∗ = (π0 , . . . , πT−1 ) denote an optimal Markov policy. Take
any state st = s0 which has a positive probability to be reached under π ∗ , namely
∗
P π ,s0 (st = s0 ) > 0. Then ∗
the tail policyπt:T = (πt , . . . , πT−1 ) is optimal for the “tail”
π 0 π
P T 0
criterion Vt:T (s ) = E k=t Rk |st = s .
Note that the reverse is not true. The prefix of the optimal policy is not optimal
for the “prefix” problem. When we plan for a long horizon, we might start with
non-greedy actions, so we can improve our return in later time steps. Specifically,
the first action taken does not have to be the optimal action for horizon T = 1, for
which the greedy action is optimal.
Lemma 5.5 (Value Iteration). Vkπ (s) may be computed by the backward recursion:
X
π 0 π 0
Vk (s) = rk (s, a) + 0
pk (s |s, a) Vk+1 (s ) , ∀s ∈ Sk
s ∈Sk+1
a=πk (s)
71
Proof. Observe that:
XT
Vkπ (s) = Eπ Rk + Rt | sk = s, ak = πk (s)
t=k+1
XT
= Eπ Rk + Eπ Rt | sk+1 |sk = s, ak = πk (s)
t=k+1
π π
= E rk (sk , ak ) + Vk+1 (sk+1 )|sk = s, ak = πk (s)
X
= rk (s, πk (s)) + 0
pk (s0 |s, πk (s)) Vk+1
π
(s0 )
s ∈Sk+1
The first identity is simply writing the value function explicitly, starting at state
s at time k and using action a = πk (s). We split the sum to Rk , the immediate
reward, and the sum of other latter rewards. The second identity uses the law of
total probability, we are conditioning on state sk+1 , and taking the expectation over
it. The third identity observes that the expected value of the sum is actually the
value function at sk+1 . The last identity writes the expectation over sk+1 explicitly.
This completes the proof of the lemma.
Remark 5.6. Note that s0 ∈Sk+1 pk (s0 |s, a) Vk+1
π
(s0 ) = Eπ (Vk+1
π
P
(sk+1 )|sk = s, ak =
a).
Remark 5.7. For the more general reward function r̃t (s, a, s0 ), the recursion takes
the form
X
π 0 0 π 0
Vk (s) = 0
pk (s |s, a)[r̃k (s, a, s ) + Vk+1 (s )] .
s ∈Sk+1
a=πk (s)
where the maximum is taken over “tail” policies π k = (πk , . . . , πT−1 ) that start from
time k. Note that π k is allowed to be a general policy, i.e., history-dependent and
stochastic. Obviously, V0 ∗ (s0 ) = V ∗ (s0 ).
Theorem 5.6 (Finite-horizon Dynamic Programming). The following holds:
72
1. Backward recursion: Set VT (s) = rT (s) for s ∈ ST .
For k = T − 1, . . . , 0, compute Vk (s) using the following recursion:
X
0 0
Vk (s) = max rk (s, a) + 0
pk (s |s, a) Vk+1 (s ) , s ∈ Sk .
a∈Ak s ∈Sk+1
Note that Theorem 5.6 specifies an optimal control policy which is a deterministic
Markov policy.
We will first establish that Vk∗ (s) ≥ Wk (s), and then that Vk∗ (s) ≤ Wk (s).
(a) We first show that Vk∗ (s) ≥ Wk (s). For that purpose, it is enough to find a
k
policy π k so that Vkπ (s) = Wk (s), since Vk∗ (s) ≥ Vkπ (s) for any strategy π.
Fix s ∈ Sk , and define π k as follows: Choose ak = ā, where
X
0 0
ā ∈ arg max rk (s, a) + 0
pk (s |s, a) Vk+1 (s ) ,
a∈Ak s ∈Sk+1
73
and then, after observing sk+1 = s0 , proceed with the optimal tail policy π k+1 (s0 )
π k+1 (s0 ) 0
that obtains Vk+1 (s ) = Vk+1 (s0 ). Proceeding similarly to the proof of Lemma 5.5
(value iteration for a fixed policy), we obtain:
k π k+1 (s0 ) 0
X
0
Vkπ (s) = rk (s, ā) + pk (s |s, ā) Vk+1 (s ) (5.1)
s0 ∈Sk+1
X
= rk (s, ā) + 0
pk (s0 |s, ā) Vk+1 (s0 ) = Wk (s), (5.2)
s ∈Sk+1
as was required.
k
(b) To establish Vk∗ (s) ≤ Wk (s), it is enough to show that Vkπ (s) ≤ Wk (s) for
any (general, randomized) ”tail” policy π k .
Fix s ∈ Sk . Consider then some tail policy π k = (πk , . . . πT−1 ). Note that this
means that at ∼ πt (a|hk:t ), where hk:t = (sk , ak , sk+1 , ak+1 , . . . , st ). For each state-
action pair s ∈ Sk and a ∈ Ak , let (π k |s, a) denote the tail policy π k+1 from time
k + 1 onwards which is obtained from π k given that sk = s, ak = a. As before, by
value iteration for a fixed policy,
πk (π k |s,a) 0
X X
0
Vk (s) = πk (a|s) rk (s, a) + 0
pk (s |s, a) Vk+1 (s ) .
a∈Ak s ∈Sk+1
74
5.4.4 The Q function
Let
∆
X
Q∗k (s, a) = rk (s, a) + pk (s0 |s, a) Vk+1
∗
(s0 ).
s0 ∈Sk
This is known as the optimal state-action value function, or simply as the Q-function.
Q∗k (s, a) is the expected return from stage k onward, if we choose ak = a and then
proceed optimally.
Theorem 5.6 can now be succinctly expressed as
and
πk∗ (s) ∈ arg max Q∗k (s, a).
a∈Ak
The Q function provides the basis for the Q-learning algorithm, which is one of the
basic Reinforcement Learning algorithms, and would be discussed in Chapter 11.
5.5 Summary
• The optimal value function can be computed by backward recursion. This
recursive equation is known as the dynamic programming equation, optimality
equation, or Bellman’s Equation.
• The optimization in each stage is performed in the action space. The total
number of minimization operations needed is T|S| - each over |A| choices. This
replaces “brute force” optimization in policy space, with tremendous computa-
tional savings as the number of Markov policies is |A|T|S| .
75
76
Chapter 6
This chapter covers the basic theory and main solution methods for stationary MDPs
over an infinite horizon, with the discounted return criterion, which we will refer to
as discounted MDPs.
The discounted return problem is the most “well behaved” among all infinite
horizon problems (such as average return and stochastic shortest path), and its theory
is relatively simple, both in the planning and the learning contexts. For that reason,
as well as its usefulness, we will consider here the discounted problem and its solution
in some detail.
77
• γ ∈ (0, 1) is the discount factor.
We observe that γ < 1 ensures convergence of the infinite sum (since the rewards
r(st , at ) are uniformly bounded). With γ = 1 we obtain the total return criterion,
which is harder to handle due to possible divergence of the sum.
Let Vγ∗ (s) denote the maximal expected value of the discounted return, over all
(possibly history dependent and randomized) control policies, i.e.,
Our goal is to find an optimal control policy π ∗ that attains that maximum
(for all initial states), and compute the numeric value of the optimal return Vγ∗ (s).
As we shall see, for this problem there always exists an optimal policy which is a
(deterministic) stationary policy.
Remark 6.1. As usual, the discounted performance criterion can be defined in terms
of cost:
∞
X
π π,s
Cγ (s) = E ( γ t c(st , at )) ,
t=0
where c(s, a) is the running cost function. Our goal is then to minimize the discounted
cost Cγπ (s).
Lemma 6.1. For π ∈ ΠSD , the value function V π satisfies the following set of |S|
linear equations:
X
V π (s) = r(s, π(s)) + γ p(s0 |s, π(s))V π (s0 ), ∀s ∈ S. (6.1)
s0 ∈S
78
Proof. We first note that
∞
∆
X
π π
V (s) = E ( γ t r(st , at )|s0 = s)
t=0
X∞
= Eπ ( γ t−1 r(st , at )|s1 = s),
t=1
since both the model and the policy are stationary. Now,
X∞
π π
V (s) = r(s, π(s)) + E ( γ t r(st , π(st ))|s0 = s)
"t=1 ∞
! #
X
= r(s, π(s)) + Eπ Eπ γ t r(st , π(st ))|s0 = s, s1 = s0 s0 = s
t=1
X X∞
0
= r(s, π(s)) + p(s |s, π(s))E ( γ t r(st , π(st ))|s1 = s0 )
π
s0 ∈S t=1
X ∞
X
= r(s, π(s)) + γ p(s0 |s, π(s))E ( π
γ t−1 r(st , at )|s1 = s0 )
s0 ∈S t=1
X
= r(s, π(s)) + γ p(s0 |s, π(s))V π (s0 ).
s0 ∈S
The first equality is by the definition of the value function. The second equality
follows from the law of total expectation, conditioning s1 = s0 and taking the expec-
tation over it. By definition at = π(st ). The third equality follows similarly to the
finite-horizon case (Lemma 5.5, in Chapter 1). The fourth is simple algebra, taking
one multiple of the discount factor γ outside. The last by the observation in the
beginning of the proof.
We can write the linear equations in (6.1) in vector form as follows. Define
the column vector rπ = (rπ (s))s∈S with components rπ (s) = r(s, π(s)), and the
transition matrix Pπ with components Pπ (s0 |s) = p(s0 |s, π(s)). Finally, let V π denote
a column vector with components V π (s). Then (6.1) is equivalent to the linear
equation set
V π = rπ + γPπ V π (6.2)
Lemma 6.2. The set of linear equations (6.1) or (6.2), with V π as variables, has a
unique solution V π , which is given by
V π = (I − γPπ )−1 rπ .
79
Proof. We only need to show that the square matrix I − γPπ is non-singular. Let
(λi ) denote the eigenvalues of the matrix Pπ . Since Pπ is a stochastic matrix (row
sums are 1), then |λi | ≤ 1 (See the proof of Theorem 4.7). Now, the eignevalues of
I − γPπ are (1 − γλi ), and satisfy |1 − γλi | ≥ 1 − γ > 0.
Combining Lemma 6.1 and Lemma 6.2, we obtain
Proposition 6.3. Let π ∈ ΠSD . The value function V π = [V π (s)] is the unique
solution of equation (6.2), given by
V π = (I − γPπ )−1 rπ .
Note that Line 3 in Algorithm 7 can equivalently be written in matrix form as:
Vn+1 = rπ + γPπ Vn .
80
Note that Vn (s) is the n-stage discounted return, with terminal reward rn (sn ) =
V0 (sn ). Comparing with the definition of V π , we can see that
X∞
π π
V (s) − Vn (s) = E ( γ t r(st , at ) − γ n V0 (sn )|s0 = s).
t=n
Denoting Rmax = maxs,a |r(s, a)|, V̄0 = maxs |V0 (s)| we obtain
Rmax
|V π (s) − Vn (s)| ≤ γ n ( + V̄0 )
1−γ
which converges to 0 since γ < 1.
Comments:
• The proof provides an explicit bound on |V π (s) − Vn (s)|. It may be seen that
the convergence is exponential, with rate O(γ n ).
∞
Similarly, V π = (γPπ )t rπ .
P
t=0
In summary:
∆
V ∗ (s) = Vγ∗ (s), ∀s ∈ S,
81
and refer to V ∗ as the optimal value function. Depending on the context, we consider
V ∗ either as a function V ∗ : S → R, or as a column vector V ∗ = [V(s)]s∈S .
The following optimality equation provides an explicit characterization of the
value function, and shows that an optimal stationary policy can easily be computed
if the value function is known. (See the proof in Section 6.5.)
The optimality equation (6.3) is non-linear, and generally requires iterative algo-
rithms for its solution. The main iterative algorithms are value iteration and policy
iteration. In the following we provide the algorithms and the basic claims. Later in
this chapter we formally prove the results regarding value iteration (Section 6.6) and
policy iteration (Section 6.7).
Proof. Using our previous results on value iteration for the finite-horizon problem,
namely the proof of Proposition 6.4, it follows that
Xn−1
π,s
Vn (s) = max E ( γ t Rt +γ n V0 (sn )).
π
t=0
82
Comparing to the optimal value function
X∞
∗ π,s
V (s) = max E ( γ t Rt ),
π
t=0
The value iteration algorithm iterates over the value functions, with asymptotic
convergence. The policy iteration algorithm iterates over stationary policies, with
each new policy better than the previous one. This algorithm converges to the
optimal policy in a finite number of steps.
1. Each policy πk+1 is improving over the previous one πk , in the sense that
V πk+1 ≥ V πk (component-wise).
Remark 6.2. An additional solution method for DP planning relies on a Linear Pro-
gramming formulation of the problem. See chapter 8.
83
6.4 Contraction Operators
The basic proof methods of the DP results mentioned above rely on the concept of
a contraction operator. We provide here the relevant mathematical background, and
illustrate the contraction properties of some basic Dynamic Programming operators.
3. ||x|| = 0 only if x = 0.
Common examples are the p-norm ||x||p = ( di=1 |xi |p )1/p for p ≥ 1, and in
P
particular the Euclidean norm (p = 2). Here we will mostly use the max-norm:
84
Theorem 6.8 (Banach’s fixed point theorem). Let T : Rd → Rd be a contraction
operator. Then
1. The equation T (v) = v has a unique solution V ∗ ∈ Rd .
2. For any v0 ∈ Rd , limn→∞ T n (v0 ) = V ∗ . In fact, ||T n (v0 ) − V ∗ || ≤ O(β n ), where
β is the contraction coefficient.
Proof. Fix any v0 and define vn+1 = T (vn ). We will show that: (1) there exists a
limit to the sequence, and (2) the limit is a fixed point of T .
Existence of a limit v ∗ of the sequence vn
We show that the sequence of vn is a Cauchy sequence. We consider two elements
vn and vm+n and bound the distance between them.
m−1
X
kvn+m − vn k = k vn+k+1 − vn+k k
k=0
m−1
X
≤ kvn+k+1 − vn+k k (according to the triangle inequality)
k=0
m−1
X
= kT n+k v1 − T n+k v0 k
k=0
m−1
X
≤ β n+k kv1 − v0 k (contraction n + k times)
k=0
β n (1 − β m )
= kv1 − v0 k
1−β
Since the coefficient decreases as n increases, for any > 0 there exists N > 0 such
that for all n, m ≥ N , we have kvn+m − vn k < . This implies that the sequence is
a Cauchy sequence, and hence the sequence vn has a limit. Let us call this limit v ∗ .
Next we show that v ∗ is a fixed point of the operator T .
The limit v ∗ is a fixed point
We need to show that T (v ∗ ) = v ∗ , or equivalently kT (v ∗ ) − v ∗ k = 0.
0 ≤ kT (v ∗ ) − v ∗ k
≤ kT (v ∗ ) − vn k + kvn − v ∗ k (according to the triangle inequality)
= kT (v ∗ ) − T (vn−1 )k + kvn − v ∗ k
≤ βk v ∗ − vn−1 k + k vn − v ∗ k
| {z } | {z }
→0 →0
85
Since v ∗ is the limit of vn , i.e., limn→∞ kvn − v ∗ k = 0 hence
kT v ∗ − v ∗ k = 0.
Uniqueness of v ∗
Assume that T (v1 ) = v1 , and T (v2 ) = v2 , and v1 6= v2 . Then
Definition 6.2. For a fixed stationary policy π : S → A, define the Fixed Policy
DP Operator T π : R|S| → R|S| as follows: For any V = (V (s)) ∈ R|S| ,
X
(T π (V ))(s) = r(s, π(s)) + γ 0
p(s0 |s, π(s))V (s0 ), ∀s ∈ S.
s ∈S
86
Proof. 1. Fix V1 , V2 . For every state s,
X
|(T π (V1 ))(s) − (T π (V2 ))(s)| = γ p(s0 |s, π(s))[V1 (s0 ) − V2 (s0 )]
s0 ∈S
X
≤γ p(s0 |s, π(s)) |V1 (s0 ) − V2 (s0 )|
s0 ∈S
X
≤γ p(s0 |s, π(s)) kV1 − V2 k∞ = γ kV1 − V2 k∞ .
s0 ∈S
Then X
T ∗ (V1 )(s) = r(s, ā) + γ p(s0 |s, ā)V1 (s0 )
s0 ∈S
X
T ∗ (V2 )(s) ≥ r(s, ā) + γ p(s0 |s, ā)V2 (s0 )
s0 ∈S
Since the same action ā appears in both expressions, we can now continue to
show the inequality (a) similarly to 1. Namely,
X
(T ∗ (V1 ))(s) − (T ∗ (V2 ))(s) ≤ γ p(s0 |s, ā) (V1 (s0 ) − V2 (s0 ))
s0 ∈S
X
≤γ p(s0 |s, ā) kV1 − V2 k∞ = γ kV1 − V2 k∞ .
s0 ∈S
(b) Showing T ∗ (V2 )(s) − T ∗ (V1 )(s) ≤ γkV1 − V2 k∞ . Similarly to the proof of
(a) we have
T ∗ (V2 )(s) − T ∗ (V1 )(s) ≤ γkV2 − V1 k∞ = γkV1 − V2 k∞ .
The inequalities (a) and (b) together imply that |T ∗ (V1 )(s) − T ∗ (V2 )(s)| ≤
γkV1 − V2 k∞ . Since this holds for any state s, it follows that ||T ∗ (V1 ) −
T ∗ (V2 )||∞ ≤ γkV1 − V2 k∞ .
87
6.5 Proof of Bellman’s Optimality Equation
We prove in this section Theorem 6.5, which is restated here:
2. By definition of π ∗ we have
∗
T π (V ∗ ) = T ∗ (V ∗ ) = V ∗ ,
where the last equality follows from part 1. Thus the optimal value function
∗
satisfied the equation T π (V ∗ ) = V ∗ . But we already know (from Proposi-
∗ ∗
tion 6.4) that V π is the unique solution of that equation, hence V π = V ∗ .
This implies that π ∗ achieves the optimal value (for any initial state), and is
therefore an optimal policy as stated.
88
6.6 Value Iteration (VI)
The value iteration algorithm allows to compute the optimal value function V ∗ iter-
atively to any required accuracy. The Value Iteration algorithm (Algorithm 8) can
be stated as follows:
Vn+1 = T ∗ (Vn ), n ≥ 0.
Note that the number of operations for each iteration is O(|A| · |S|2 ). Theorem 6.6
states that Vn → V ∗ , exponentially fast.
kV πn+1 − Vn+1 k = kT πn+1 (V πn+1 ) − Vn+1 k (because V πn+1 is the f ixed point of T πn+1 )
≤ kT πn+1 (V πn+1 ) − T ∗ (Vn+1 )k + kT ∗ (Vn+1 ) − Vn+1 k
89
Since πn+1 is maximal over the actions using Vn+1 , it implies that T πn+1 (Vn+1 ) =
T ∗ (Vn+1 ) and we conclude that:
and therefore
γ γ 1−γ
kVn+1 − V ∗ k ≤ kVn+1 − Vn k < ·· =
1−γ 1−γ 2γ 2
Returning to inequality (6.5), it follows:
2γ
kV πn+1 − V ∗ k ≤ kVn+1 − Vn k <
1−γ
Therefore the selected policy πn+1 is -optimal.
90
Lemma 6.11 (Policy Improvement). Let π be a stationary policy and π̄ be a π- im-
proving policy. We have V π̄ ≥ V π (component-wise), and V π̄ = V π if and only if π
is an optimal policy.
Proof. Observe first that
V π = T π (V π ) ≤ T ∗ (V π ) = T π̄ (V π )
The first equality follows since V π is the value function for the policy π, the inequality
follows because of the maximization in the definition of T ∗ , and the last equality by
definition of the improving policy π̄.
It is easily seen that T π is a monotone operator (for any policy π), namely V1 ≤
V2 implies T π (V1 ) ≤ T π (V2 ). Applying T π̄ repeatedly to both sides of the above
inequality V π ≤ T π̄ (V π ) therefore gives
V π ≤ T π̄ (V π ) ≤ (T π̄ )2 (V π ) ≤ · · · ≤ lim (T π̄ )n (V π ) = V π̄ , (6.6)
n→∞
where the last equality follows by Theorem 6.6. This establishes the first claim.
We now show that π is optimal if and only if V π̄ = V π . We showed that V π̄ ≥ V π .
If V π̄ > V π then clearly π is not optimal. Assume that V π̄ = V π . We have the
following identities:
V π = V π̄ = T π̄ (V π̄ ) = T π̄ (V π ) = T ∗ (V π ),
where the first equality is by our assumption. The second equality follows since V π̄ is
the fixed point of its operator T π̄ . The third follows since we assume that V π̄ = V π .
The last equality follows since T π̄ and T ∗ are identical on V π .
We have established that: V π = T ∗ (V π ), and hence V π and π is a fixed point of
T ∗ and therefore, by Theorem 6.5, policy π is optimal.
The policy iteration algorithm performs successive rounds of policy improvement,
where each policy πk+1 improves the previous one πk . Since the number of stationary
deterministic policies is bounded, so is the number of strict improvements, and the
algorithm must terminate with an optimal policy after a finite number of iterations.
In terms of computational complexity, Policy Iteration requires O(|A| · |S|2 +|S|3 )
operations per iteration, while Value Iteration requires O(|A| · |S|2 ) per iteration.
However, in many cases the Policy Iteration has a smaller number of iterations than
Value Iteration, as we show in the next section. Another consideration is that the
number of iterations of Value Iteration increases as the discount factor γ approaches
1, while the number of policies (which upper bound the number of iterations of Policy
Iteration) is independent of γ.
91
6.8 A Comparison between VI and PI Algorithms
In this section we will compare the convergence rate of the VI and PI algorithms.
We show that, assuming that the two algorithms begin with the same approximated
value, the PI algorithm converges in less iterations.
Theorem 6.12. Let {V In } be the sequence of values created by the VI algorithm (where
V In+1 = T ∗ (V In )) and let {P In } be the sequence of values created by PI algorithm,
i.e., P In = V πn . If V I0 = P I0 , then for all n we have V In ≤ P In ≤ V ∗ .
0
Since V In ≤ P In , and T π is monotonic it follows that:
0 0
T π (V In ) ≤ T π (P In )
T ∗ (P In ) = T πn+1 (P In )
92
6.9 Bibliography notes
The value iteration method dates back to to Bellman [10]. The computational com-
plexity analysis of value iteration first explicitly appeared in [70]. The work of Black-
well [14] introduces the contracting operators and the fixed point for the analysis of
MDPs.
The policy iteration originated in the work of Howard [42]. There has been
significant interest in bounding the number of iteration of policy iterations, with a
dependency only on the number of states and actions. A simple upper bound is the
number of policies, |A||S| , since each policy is selelcted at most once. The work of
[80] shows a lower bound of Ω(2|S|/2 ) for a special class of policy iteration, where
only a single state of all improving states is updated and two actions. The work of
[77] shows that if the policy iteration updates with all the improving states (as it
is define here) then the number of iterations is at most O(|A||S| /|S|). The work of
[32] shows a n-state and Θ(n) action MDP for which the policy iteration requires
Ω(2n/7 ) iterations for the average cost return, and [41] for the discounted return.
Surprisingly, for a constant discount factor, the bound on the number of iterations
is polynomial [132, 38].
93
94
Chapter 7
7.1 Definition
We consider a stationary (time-invariant) MDP, with a finite state space S, finite
action set A, a transition kernel P = {p(s0 |s, a)}, and rewards r(s, a).
Stochastic Shortest Path is an important class of planning problems, where the
time horizon is not set beforehand, but rather the problem continues until a certain
event occurs. This event can be defined as reaching some goal state. Let SG ⊂ S
define the set of goal states.
95
Definition 7.1 (Termination time). Define the termination time as the random vari-
able
τ = inf{t ≥ 0 : st ∈ SG },
the first time in which a goal state is reached, or infinity otherwise.
We shall make the following assumption on the MDP, which states that for any
policy, we will always reach a goal state in finite time.
Assumption 7.1. The state space is finite, and for any policy π, we have that τ < ∞
with probability 1.
For the case of positive rewards, Assumption 7.1 guarantees that the agent cannot
get ‘stuck in a loop’ and obtain infinite reward.1 This is similar to the assumption on
no negative cycles in deterministic shortest paths. When the rewards are negative,
the agent will be driven to reach the goal state as quickly as possible, and in principle,
Assumption 7.1 could be relaxed. We will keep it nonetheless, as it will significantly
simplify our analysis.
The total expected return for Stochastic Shortest Path problem is defined as:
τ −1
X
π π,s
Vssp (s) =E ( r(st , at ) + rG (sτ ))
t=0
Here rG (s), s ∈ SG specified the reward at goal states. Note that the expectation is
taken also over the random length of the run τ .
To simplify the notation, in the following we will assume a single goal state
SG = {sG }, and that rG (sτ ) = 0.2 We therefore write the value as
τ
Eπ,s P r(s , a ) , s 6= s
π t t G
Vssp (s) = t=0 . (7.1)
0, s = sG
π
Our objective is to find a policy that maximizes Vssp (s). Let π ∗ be the optimal policy
∗
and let Vssp (s) be its value, which is the maximal value from each state s.
96
7.2.1 Finite Horizon Return
Stochastic shortest path includes, naturally, the finite horizon case. This can be
shown by creating a leveled MDP where at each time step we move to the next level
and terminate at level T. Specifically, we define a new state space S 0 = S × T. For
any s ∈ S, action a ∈ A and time i ∈ T we define a transition function p0 ((s0 , i +
1)|(s, i), a) = p(s0 |s, a), and goal states SG = {(s, T) : s ∈ S}. Clearly, Assumption
7.1 is satisfied here.
97
Proof. From Assumption 7.1, every state s 6= sG is transient. For any i, j ∈ S \ sG
let qi,j = Pr(st = j for some t ≥ 1|s0 = i). Since state i is transient we have
qi,i < 1. Let Zi,j be the number of times the trajectory returns to state j when
starting from state i. Note that Zi,j is geometrically distributed with parameter qj,j ,
(k−1)
namely Pr(Zi,j = k) = qi,j qj,j (1 − qj,j ). Therefore the expected number of visits to
qi,j
state j when starting from state i is qj,j (1−q j,j )
and is finite.
We can write the value function as
X
π
Vssp (s) = E[Zs,s0 ]rπ (s0 ) < ∞,
s0 ∈S\sG
therefore,
∞
X
π
Vssp = (Pπ )t rπ .
t=0
Now, consider the equation (7.2). By unrolling the right hand side and noting that
limt→∞ (Pπ )t = 0 because the states are transient we obtain
∞
X
V = rπ + Pπ rπ + Pπ V = · · · = (Pπ )t rπ = Vssp
π
.
t=0
π
We have thus shown that the linear Equation 7.2 has a unique solution Vssp , and so
the claim follows.
Remark 7.1. At first sight, it seems that Equation 7.2 is simply Bellman’s equation for
the discounted setting (6.2), just with γ = 1. The subtle yet important differences are
that Equation 7.2 considers states S \ sG , and Proposition 7.1 requires Assumption
7.1 to hold, while in the discounted setting the discount factor guaranteed that a
solution exists for any MDP.
98
Algorithm 10 Value Iteration (for SSP)
1: Initialization: Set V0 = (V0 (s))s∈S\sG arbitrarily, V0 (sG ) = 0.
2: For n = 0, 1, 2, . . . n o
0 0
P
3: Set Vn+1 (s) = maxa∈A r(s, a) + s0 ∈S\sG p(s |s, a)Vn (s ) , ∀s ∈ S \ sG
Xn−1
π,s
Vn (s) = max E ( Rt +V0 (sn )).
π
t=0
Since any policy reaches the goal state with probability 1, and after reaching the goal
state the agent stays at the goal and receives 0 reward, we can write the optimal
value function as
Xτ ∞
X
∗ π,s π,s
Vssp (s) = max E ( Rt ) = max E ( Rt ).
π π
t=0 t=0
where the last equality is since Assumption 7.1 guarantees that with probability 1
the goal state will be reached, and from that time onwards the agent will receive 0
reward.
99
Algorithm 11 Policy Iteration (SSP)
1: Initialization: choose some stationary policy π0 .
2: For k = 0, 1, 2, . . .
3: Policy Evaluation: Compute V πk .
4: (For example, use the explicit formula V πk = (I − Pπk )−1 rπk )
5: Policy Improvement: nCompute πk+1 , a greedy policy withorespect to V πk :
πk+1 (s) ∈ arg maxa∈A r(s, a) + s0 ∈S\sG p(s0 |s, a)V πk (s0 ) , ∀s ∈ S \ sG .
P
6:
7: If πk+1 = πk (or if V πk satisfies the optimality equation)
8: Stop
Theorem 7.3 (Convergence of policy iteration for SSP). The following statements
hold:
1. Each policy πk+1 is improving over the previous one πk , in the sense that
V πk+1 ≥ V πk (component-wise).
Definition 7.2. For a fixed stationary policy π : S → A, define the Fixed Policy DP
Operator T π : R|S|−1 → R|S|−1 as follows: For any V = (V (s)) ∈ R|S|−1 ,
X
(T π (V ))(s) = r(s, π(s)) + p(s0 |s, π(s))V (s0 ), ∀s ∈ S \ sG .
s0 ∈S\sG
100
In the discounted MDP setting, we relied on the discount factor to show that the
DP operators are contractions. Here, we will use Assumption 7.1 to show a weaker
contraction-type result.
For any policy π (not necessarily stationary), Assumption 7.1 means that Pr(st=|S| =
sG |s0 = s) > 0 for all s ∈ S, since otherwise, the Markov chain corresponding to π
would have a state that is not communicating with sG . Let
= min min Pr(st=|S| = sG |s0 = s),
π s
which is well defined since the space of policies is compact. Therefore, we have that
for a stationary Markov policy π,
X
[(Pπ )|S| ]ij < 1 − , ∀i ∈ |S| − 1, (7.3)
j
From these results, we have that both (Pπ )|S| and k=1,...,|S| Pπk are (1−)-contractions.
Q
We are now ready to show the contraction property of the DP operators.
Theorem 7.4. Let Assumption 7.1 hold. Then (T π )|S| and (T ∗ )|S| are (1−)-contractions.
Proof. The proof is similar to the proof of Theorem 6.9, and we only describe the
differences. For T π , note that
((T π )|S| (V1 ))(s) − ((T π )|S| (V2 ))(s) = (Pπ )|S| [V1 − V2 ] (s) ,
and use the fact that (Pπ )|S| is a (1 − )-contraction to proceed as in Theorem 6.9.
For (T ∗ )|S| , note that
X
((T ∗ )|S| (V1 ))(s) = arg max r(s, a0 ) + Pr(s1 = s0 |s0 = s, a0 )r(s0 , a1 )
a0 ,...,a|S|−1
s0
X
+ Pr(s2 = s |s0 = s, a0 , a1 )r(s0 , a2 ) + . . .
0
s0
X
+ Pr(s|S| = s0 |s0 = s, a0 , . . . , a|S|−1 )V1 (s0 )
s0
To show (T ∗ )|S| (V1 )(s) − (T ∗ )|S| (V2 )(s) ≤ (1 − )kV1 − V2 k∞ : Let ā0 , . . . , ā|S|−1
denote actions that attains the maximum in (T ∗ )|S| (V1 )(s). Q Then proceed similarly
as in the proof of Theorem 6.9, and use the fact that k=1,...,|S| Pπk is a (1 − )-
contraction.
101
Remark 7.2. While T π and T ∗ are not necessarily contractions in the sup-norm,
they can be shown to be contractions in a weighted sup-norm; see, e.g., [13]. For
our discussion here, however, the fact that (T π )|S| and (T ∗ )|S| are contractions will
suffice.
Theorem 7.5 (Bellman’s Optimality Equation for SSP). The following statements
hold:
∗
1. Vssp is the unique solution of the following set of (nonlinear) equations:
X
0 0
V(s) = max r(s, a) + p(s |s, a)V(s ) , ∀s ∈ S \ sG . (7.5)
a∈A s0 ∈S\sG
Sketch Proof of Theorem 7.5: The proof is similar to the proof of the discounted
setting, but we cannot use Theorem 6.8 directly as we have not shown that T ∗ is a
contraction. However, a relatively simple extension of the Banach fixed point theo-
rem holds also when (T ∗ )k is a contraction, for some integer k (see, e.g., Theorem 2.4
in [65]). Therefore the proof follows, with Theorem 7.2 replacing Theorem 6.6.
102
Chapter 8
8.1 Background
A Linear Program (LP) is an optimization problem that involves minimizing (or
maximizing) a linear objective function subject to linear constraints. A standard
form of a LP is
minimize b> x, subject to Ax ≥ c, x ≥ 0. (8.1)
where x = (x1 , x2 , . . . , xn )> is a vector of real variables arranged as a column vector.
The set of constraints is linear and defines a convex polytope in Rn , namely a closed
and convex set U that is the intersection of a finite number of half-spaces.The set U
has a finite number of vertices, which are points that cannot be generated as a convex
combination of other points in U . If U is bounded, it equals the convex combination
of its vertices. It can be seen that an optimal solution (if finite) will be in one of
these vertices.
The LP problem has been extensively studied, and many efficient solvers exist.
In 1947, Danzig introduced the Simplex algorithm, which essentially moves greedily
along neighboring vertices. In the 1980’s effective algorithms (interior point and
others) were introduced which had polynomial time guarantees.
One of the most important notion in a linear program is duality, which in many
103
cases allows to gain insight to the solutions of a linear program. The following is the
definition of the dual LP.
The two dual LPs have the same optimal value, and (in many cases) the solution
of one can be obtained from that of the other. The common optimal value can be
understood by the following computation:
the dual is
maximize c> y, subject to A> y = b, y ≥ 0.
Representing a policy: The first step is to decide how to represent a policy, then
compute its expected return, and finally, maximize over all policies. Given a policy
π(a|s) we have seen how to compute its expected return by solving a set of lin-
ear equations. (See Lemma 5.5 in Chapter 5.4.2.) However, we are interested in
representing a policy in a way which will allow us to maximize over all policies.
The first natural attempt is to write variables which represent a deterministic
policy, since we know that there is a deterministic optimal policy. We can have a
variable z(s, a) for each action a ∈ A and state s ∈ S. The variable will represent
104
whether in state s we Pperform action a. This can be represented by the constraints
z(s, a) ∈ {0, 1} and a z(s, a) = 1 for every s ∈ S. Given z(s, a) we define a policy
π(a|s) = z(s, a).
One issue that immediately arises is that the Boolean constraints z(s, a) ∈ {0, 1}
are not linear. We P can relax the deterministic policies to stochastic policies and have
z(s, a) ≥ 0 and a z(s, a) = 1. Given z(s, a) we still define a policy π(a|s) = z(s, a),
but now in each state we have a distribution over actions.
The next step is to compute the return of the policy as a linear function. The
main issue that we have is that in order to compute the return of a policy from
state s we need to also compute the probability that the policy reaches the state s.
This probability can be computed by summing over all states s0 , the probability of
reachingPthe state s0 times the probability of performing action a0 in state s0 , i.e.,
q(s) = s0 q(s0 )z(s0 , a0 )p(s|a0 , s0 ), where q(s) is the probability of reaching state s.
The issue that we have is that both q(·) and z(·, ·) are variables, and therefore the
resulting computation is not linear in the variables.
There is a simple fix here, we can define x(s, a) = q(s)z(s, a), namely, x(s, a)
is the probability of reaching state s and performing action a. Given x(s, a) we
can define a policy π(a|s) = P x(s,a) 0 . For the finite horizon return, since we are
a0 x(s,a )
interested in Markov policies, we will add an index for the time and have xt (s, a) as
the probability that in time t we are in state s and perform action a. Recall that in
Section 3.2.4 we saw that a sufficient set of parameters is P rh0t−1 [at = a, st = s] =
Eh0t−1 [I[st = s, at = a]|h0t−1 ], where h0t−1 = (s0 , a0 , . . . , st−1 , at−1 ). We are essentially
using those same parameters here.
The variables: For each time t ∈ T = {0, . . . , T}, state s and action a we will
have a variable xt (s, a) ∈ [0, 1] that indicates the probability that at time t we are
at state s and perform action a. For the terminal states s we will have a variable
xT (s) ∈ [0, 1] that will indicate the probability that we terminate at state s.
The fesibility constraints: Given that we decided on the representation of x(s, a),
we now need to define what is the set of feasible solution for them. The simple
constraints are the non-negativity constraints, i.e., xt (s, a) ≥ 0 and xT (s) ≥ 0.
Our main set of constraints will need to impose the dynamics of the MDP. We
can view the feasibility constraints as flow constraints, stating that the probability
mass that leaves state s at time t is equal by the probability mass of reaching state
105
s at time t − 1. Formally,
X X
xt (s, a) = xt−1 (s0 , a0 )pt−1 (s|s0 a0 ).
a s0 ,a0
The objective Given the variables xt (s, a) and xT (s) we can write the expected
return, which we would like to maximize, as
X X
rt (s, a)xt (s, a) + rT (s)xT (s)
t,s,a s
The main observation is that the expected objective depends only on the probabilities
of being at time t in state s and performing action a.
Primal LP: Combining the above we derive the resulting linear program is the
following.
X X
max rt (s, a)xt (s, a) + rT (s)xT (s)
xt (s,a),xT (s)
t,s,a s
such that
X X
xt (s, a) ≤ xt−1 (s0 , a0 )pt−1 (s|s0 a0 ). ∀s ∈ St , t ∈ T
a s0 ,a0
X
xT (s) ≤ xT−1 (s0 , a0 )pT−1 (s|s0 a0 ) ∀s ∈ ST
s0 ,a0
xt (s, a) ≥ 0 ∀s ∈ St , a ∈ A, t ∈ {0, . . . , T − 1}
X
x0 (s0 , a) = 1
a
x0 (s, a) = 0, ∀s ∈ S0 , s 6= s0
106
Remarks: First, note that we replaced the flows equalities with inequalities. In the
optimal solution, since we are maximizing, and since the rewards are non-negative,
those flow inequalities will become equalities.
Second, note that we do not explicitly upper bound xt (s, a) ≤ 1, although it
should clearly hold in any feasible solution. While we do not imposeP it explicitly,
this is implicit in the linear program. To observe this, let Φ(t) = s,a xt (s, a).
From the initial conditions we have that Φ(0) = 1. When we sum the flow condition
(first inequality) over the states we have that Φ(t) ≤ Φ(t − 1). This implies that
Φ(t) ≤ 1. Again, in the optimal solution we will maximize those values and we will
have Φ(t) = Φ(t − 1).
Dual LP: Given the primal linear program we can derive the dual linear program.
min z0 (s0 )
zt (s)
such that
zT (s) = rT (s) ∀s ∈ St
X
zt (s) ≥ rt (s, a) + zt+1 (s0 )pt (s0 |s, a), ∀s ∈ St , a ∈ A, t ∈ T,
s0
zt (s) ≥ 0 ∀s ∈ St , t ∈ T
One can identify the dual random variables zt (s) with the optimal value function
Vt (s). At the optimal solution of the dual linear program one can show that we have
X
zt+1 (s0 )pt (s0 |s, a) ,
zt (s) = max rt (s, a) + ∀s ∈ St , t ∈ T,
a
s0
107
We will start with the primal linear program, which will compute the optimal
policy. In the finite horizon return we had for each time t state s and action a a
variable xt (s, a). In the discounted return we will consider stationary policies, so we
will drop the dependency on the time t. In addition we will replace the probabilities
by discounted fraction of time. Namely, for each state s and action a we will have
a variable x(s, a) that will indicate the discounted fraction of time we are at state s
and perform action a.
To better understand what we mean by the discounted fraction of time consider
a fixed stationary policy π and a trajectory (s0 , . . .) generated by π.PDefine the
discounted time of state-action (s, a) in the trajectory as X π (s, a) = t γ t I(st =
s, at = a), which is a random variable. We are interested in xπ (s, a) = E[X π (s, a)]
which is the expected discounted fraction of time policy π is in state s and performs
action a. This discounted fraction of time would be very handy in defining the
objective as well as defining the flow constraints.
Given the discounted fraction of time values x(s, a) for every s ∈ S and a ∈ A we
essentially have all the information we need. P First, the discounted fraction of time
that we are in a state s ∈ S is simply x(s) = a∈A x(s, a). We can recover a policy
that generates those discounted fraction of times by setting,
x(s, a)
π(a|s) = P 0
.
a0 ∈A x(s, a )
All this is under the assumption that the discounted fraction of time values x(s, a)
where generated by some policy. However, in the linear program we will need to
guarantee that indeed those values are feasible, namely, can be generated by the
given dynamics. For this we will introduce feability constraints.
The feasibility constraints: As in the finite horizon case, our main constraint will
be flow constraints, stating that the discounted fraction of time we reach state s
equals the discounted fraction of time we exit it, times the discounted factor. (We
are multiplying by the discount factor since we are moving one step to the future.)
Technically, it will be sufficient to use only an upper bound, and in the optimal
solution, maximizing the expected return, there will be an equality. Formally, for
s ∈ S, X X
x(s, a) ≤ γ x(s0 , a0 )p(s|s0 a0 ) + I(s = s0 )
a s0 ,a0
For the initial state s0 we add 1 for the incoming flow, since initially we start in it,
and not reach it from another state.
108
Let us verify that indeed the constraints show that when we sum over all states
and actions we get the correct value of 1/(1 − γ). If we sum the inequalities over all
states, we have
X X X X
x(s, a) ≤ γ x(s0 , a0 ) p(s|s0 a0 ) = γ x(s0 , a0 ) + 1,
s,a s0 ,a0 s s0 ,a0
P
which implies that s,a x(s, a) ≤ 1/(1 − γ), as we should expect.PNamely, in each
time we are in some state, therefore the sum over states should be t γ t = 1/(1−γ).
P
The objective: The discounted return, which we would like to maximize, is E[ t γ t r(st , at )].
We can regroup the sum by state and action and have
X X
E[ γ t r(st , at )I(st = s, at = a)],
s,a t
which is equivalent to
X X
r(s, a)E[ γ t I(st = s, at = a)].
s,a t
P
Since our variable are x(s, a) = E[ t γ t I(st = s, at = a)], and the expected return
would be
X
r(s, a)x(s, a)
s,a
Primal LP: Combining all the above, the resulting linear program is the following.
X
max r(s, a)x(s, a)
x(s,a)
s,a
such that
X X
x(s, a) ≤ γ x(s0 , a0 )p(s|s0 a0 ) + I(s = s0 ) ∀s ∈ S, a ∈ A,
a s0 ,a0
x(s, a) ≥ 0 ∀s ∈ S, a ∈ A.
109
Dual LP: Given the primal linear program we can derive the dual linear program.
min z(s0 )
z(s)
such that
X
z(s) ≥ r(s, a) + γ z(s0 )p(s0 |s, a), ∀s ∈ S, a ∈ A,
s0
z(s) ≥ 0 ∀s ∈ S.
One can identify the dual random variables z(s) with the optimal vale function
V(s). At the optimal solution of the dual linear program one can show that we have
X
z(s0 )pt (s0 |s, a) ,
z(s) = max r(s, a) + γ ∀s ∈ S,
a
s0
110
Chapter 9
Up until now, we have discussed planning under a known model, such as the MDP.
Indeed, the algorithms we discussed made extensive use of the model, such as it-
erating over all the states, actions, and transitions. In the remainder of this book,
we shall tackle the learning setting – how to make decisions when the model is not
known in advance, or too large for iterating over it, precluding the use of the planning
methods described earlier. Before diving in, however, we shall spend some time on
defining the various approaches to modeling a learning problem. In the next chap-
ters, we will rigorously cover some of these approaches. This chapter, similarly to
Chapter 2, is quite different than the rest of the book, as it discusses epistemological
issues more than anything else.
In the machine learning literature, perhaps the most iconic learning problem is su-
pervised learning, where we are given a training dataset of N samples, X1 , X2 , . . . , XN ,
sampled i.i.d. from some distribution, and corresponding labels Y1 , . . . , YN , generated
by some procedure. We can think of Yi as the supervisor’s answer to the question
“what to do when the input is Xi ?”. The learning problem, then, is to use this data to
find some function Y = f (X), such that when given a new sample X 0 from the data
distribution (not necessarily in the dataset), the output of f (X 0 ) will be similar to
the corresponding label Y 0 (which is not known to us). A successful machine learning
algorithm therefore exhibits generalization to samples outside its training set.
Measuring the success of a supervised learning algorithm in practice is straightfor-
ward – by measuring the average error it makes on a test set sampled from the data
distribution. The Probably Approximately Correct (PAC) framework is a common
framework for providing theoretical guarantees for a learning algorithm. A standard
PAC result gives a bound on the average error for a randomly sampled test data,
given a randomly sampled training set of size N , that holds with probability 1 − δ.
111
PAC results are therefore important to understand how efficient a learning algorithm
is (e.g., how the error reduces with N ).
In reinforcement learning, we are interested in learning how to solve sequential
decision problems. We shall now discuss the main learning model, why it is use-
ful, how to measure success and provide guarantees, and also briefly mention some
alternative learning models that are outside the scope of this book.
112
the reinforcement signal. As it turns out, RL algorithms essentially learn to solve
MDPs without requiring an explicit MDP model, and can therefore be applied even
to very large MDPs, for which the planning methods in the previous chapters do not
apply. The important insight is that if we have an RL algorithm, and a simulator of
the MDP, capable of generating r(st , at ) and st+1 ∼ p(·|st , at ), then we can run the
RL algorithm with the simulator replacing the real environment. To date, almost
all RL successes in game playing, control, and decision making have been obtained
under this setting.
Another motivation for this learning model comes from the field of adaptive con-
trol [4]. If the agent has an imperfect model of the MDP (what we called epistemic
uncertainty in Chapter 2), any policy it computes using it may be suboptimal. To
overcome this error, the agent can try and correct its model of the MDP or adapt its
policy during interaction with the real environment. Indeed, RL is very much related
to adaptive optimal control [113], which studies a similar problem.
In contrast with the supervised learning model, where measuring success was
straightforward, we shall see that defining a good RL agent is more involved, and we
shall discuss some dominant ideas in the literature.
113
throughout learning. A useful measure for this is the regret,
N
X N
X
Regret(N ) = r∗t − r(st , at ),
t=0 t=0
which measures the difference between the cumulative reward the agent obtained
on the N samples and the sum of rewards that an optimal policy would have ob-
tained (with the same amount of time steps N ), denoted here as r∗t . Any algorithm
that converges to an optimal policy would have N1 Regret(N ) → 0, but we can also
compare algorithms by the rate that the average regret decreases.
Interestingly, for an algorithm to be optimal in terms of regret, it must balance
between exploration – taking actions that yield information about the MDP, and
exploitation – taking actions that simply yield high reward. This is different from
PAC, where the agent should in principle devote all the N samples for exploration.
The challenges of learning from rewards (revisited) We have already discussed the
difficulty of specifying decision making problems using a reward in the preface to the
planning, Chapter 2. In the RL model, we assume that we can evaluate the observed
interaction of the agent with environment by scalar rewards. This is easy if we have
an MDP model or simulator, but often difficult otherwise. For example, if we want
to use RL to automatically train a robot to perform some task (e.g., fold a piece of
cloth), we need to write a reward function that can evaluate whether the cloth was
folded or not – a difficult task in itself. We can also directly query a human expert for
evaluating the agent. However, it turns out that humans find it easier to rank different
interactions than to associate their performance with a scalar reward. The field of
RL from Human Feedback (RLHF) studies such evaluation models, and has been
instrumental for tuning chatbots using RL [88]. It is also important to emphasize
that in the RL model defined above, the agent is only concerned with maximizing
reward, leading to behavior that can be very different from human decision making.
114
As argued by Lake et al. [64] in the context of video games, humans can easily
imagine how to play the game differently, e.g., how to lose the game as quickly
as possible, or how to achieve certain goals, but such behaviors are outside the
desiderata of the standard RL problem; extensions of the RL problem include more
general reward evaluations such as ‘obtain a reward higher than x’ [108, 21], or goal-
based formulations [46], and a key question is how to train agents that generalize to
new goals.
115
tions with several MDPs are considered at test time, and regret bounds can capture
the tradeoff between identifying the MDPs and maximizing rewards [37]. More gen-
erally, transfer learning in RL concerns how to transfer knowledge between different
decision making problems [119, 59]. It is also possible to search for policies that work
well across many different MDPs, and are therefore robust enough to generalize to
changes in the MDP. One approach, commonly termed domain randomization, trains
a single policy on an ensemble of different MDPs [122]. Another approach optimizes
a policy for the worst case MDP in some set, based on the robust MDP formula-
tion [87]. Yet another learning setting is lifelong RL, where an agent interacts with
an MDP that gradually changes over time [57].
116
Chapter 10
Until now we looked at planning problems, where we are given a complete model of
the MDP, and the goal is to either evaluate a given policy or compute the optimal
policy. In this chapter we will start looking at learning problems, where we need to
learn from interaction. This chapter will concentrate on model based learning, where
the main goal is to learn an accurate model of the MDP and use it. In following
chapters we will look at model free learning, where we learn a value function or a
policy without recovering the actual underlying model.
117
This is equivalent to,
Rmax
T log(1/γ) ≥ log .
ε(1 − γ)
Since log(1 + x) ≤ x, we can bound log(1/γ) = log(1 + 1−γ γ
) ≤ 1−γ
γ
. Since γ < 1,
γ 1 1 Rmax
we have that 1−γ ≤ 1−γ and hence it is sufficient to have T ≥ 1−γ log ε(1−γ) , and the
theorem follows.
118
Lemma 10.2 (Chernoff-Hoffding). Let R1 , . .P
. , Rm be m i..i.d. samples of a random
b= m m
variable R ∈ [0, 1]. Let µ = E[R] and µ 1
i=1 Ri . For any ε ∈ (0, 1) we have,
2m
b| ≥ ε] ≤ 2e−2ε
Pr[|µ − µ
In addition,
2 m/2 2 m/3
µ ≤ (1 − ε)µ] ≤ e−ε
Pr[b and µ ≥ (1 + ε)µ] ≤ e−ε
Pr[b
We will refer to the first bound as additive and the second set of bounds as
multiplicative.
Using the additive bound of Lemma 10.2, we have
. , Rm be m i..i.d. samples of a random variable R ∈ [0, 1].
Corollary 10.3. Let R1 , . . P
Let µ = E[R] and µ b= m m
1 1
i=1 Ri . Fix ε, δ > 0. Then, for m ≥ 2ε2 log(2/δ), with
probability 1 − δ, we have that |µ − µ b| ≤ ε.
We can now use the above concentration bound inP order to estimate the expected
r(s, a) = m1 m
rewards. For each state-action (s, a) let b i=1 Ri (s, a) be the average of
m samples. We can show the following:
2 2|S| |A|
Claim 10.4. Given m ≥ R2ε 2 log
max
δ
samples for each state action (s, a), then with
probability 1 − δ we have for every (s, a) that |r(s, a) − b
r(s, a)| ≤ ε.
Proof. First, we will need to scale the random variables to [0, 1], which will be
achieved by dividing them by Rmax . Then, by the Chernoff-Hoffding bound (Corol-
lary 10.3), using ε0 = Rmax
ε
and δ 0 = |S|δ|A| , we have that for each (s, a) we have that
with probability 1 − |S|δ|A| that | r(s,a)
Rmax
− brR(s,a)
max
ε
| ≤ Rmax .
We bound the probability over all state-action pairs using a union bound,
X
r(s, a) b r(s, a) ε r(s, a) br(s, a) ε
Pr ∃(s, a) : − > ≤ Pr − >
Rmax Rmax Rmax Rmax Rmax Rmax
(s,a)
X δ
≤ =δ
|S| |A|
(s,a)
119
Influence of reward estimation errors: Finite horizon
XT−1
VTπ (s0 ) =E π,s0
[ rt (st , at ) + rT (sT )].
t=0
XT−1
bTπ (s0 ) = Eπ,s0 [
V rt (st , at ) + b
b rT (sT )].
t=0
Note that in both cases we use the true transition probability. For a given trajectory
σ = (s0 , a0 , . . . , sT−1 , aT−1 , sT ) we define
T−1
! T−1
!
X X
error(π, σ) = rt (st , at ) + rT (sT ) − rt (st , at ) + b
b rT (sT ) .
t=0 t=0
Lemma 10.5. Assume that for every (s, a) and t we have |rt (s, a) − b rt (s, a)| ≤ ε
and for every s we have |rT (s) − b
rT (s)| ≤ ε. Then, for any policy π ∈ ΠM S we have
error(π) ≤ ε(T + 1).
Proof. Since π ∈ ΠM S it implies that π depends only on the time t and state st .
Therefore, probability of each trajectory σ = (s0 , a0 , . . . , sT−1 , aT−1 , sT ) is the same
under the true rewards rt (s, a) and the estimated rewards b rt (s, a),
120
For each trajectory σ = (s0 , a0 , . . . , sT−1 , aT−1 , sT ), we have,
T−1
X T−1
X
|error(π, σ)| = (rt (st , at ) + rT (sT )) − (b
rt (st , at ) + b
rT (sT ))
t=0 t=0
T−1
X
= (rt (st , at ) − b
rt (st , at )) + (rT (sT ) − b
rT (sT ))
t=0
T−1
X
≤ |rt (st , at ) − b
rt (st , at )| + |rT (sT ) − b
rT (sT )|
t=0
≤ εT + ε.
The lemma follows since error(π) = |E π,s0 [error(π, σ)]| ≤ ε(T + 1), and the bound
hold for every trajectory σ.
Proof. By Lemma 10.5, for any policy π, we have that error(π) ≤ ε(T + 1). This
implies that,
∗
b π∗ (s0 ) ≤ error(π ∗ ) ≤ ε(T + 1)
VTπ (s0 ) − VT
and
b πb∗ (s0 ) − V πb∗ (s0 ) ≤ error(b
V π ∗ ) ≤ ε(T + 1).
T T
b∗ is optimal for b
Since π rt we have,
bTπ∗ (s0 ) ≤ V
V bTπb∗ (s0 ).
121
Influence of reward estimation errors: discounted return
Fix a stationary stochastic policy π ∈ ΠSS . Again, define the expected return of
policy π with the true rewards
∞
X
Vγπ (s0 ) =E π,s0
[ r(st , at )γ t ]
t=0
122
Computing approximate optimal policy: discounted return
We now describe how to compute a near optimal policy for the discounted return.
2 2|S| |A|
We need a sample of size m ≥ R2ε 2 log
max
δ
for each random variable R(s, a). Given
the sample, we compute b r(s, a). As we saw in the finite horizon case, with probability
1 − δ, we have for every (s, a) that |r(s, a) − b r(s, a)| ≤ ε. Now we can compute the
∗
policy π
b for the estimated rewards b rt (s, a). Again, the main goal is to show that π b∗
is a near optimal policy.
∗ ∗ 2ε
Vγπ (s0 ) − Vγπb (s0 ) ≤
1−γ
ε
Proof. By Lemma 10.7 for any π ∈ ΠSS we have error(π) ≤ 1−γ
. Therefore,
∗ ∗
bγπ (s0 ) ≤ error(π ∗ ) ≤ ε
Vγπ (s0 ) − V
1−γ
and
b πb∗ (s0 ) − V πb∗ (s0 ) ≤ error(b ε
Vγ γ π∗) ≤ .
1−γ
b∗ is optimal for b
Since π r we have,
b π∗ (s0 ) ≤ V
V b πb∗ (s0 ).
γ γ
123
Theorem 10.9. Let q1 and q2 be two distributions over S. Let f : S → [0, Fmax ].
Then,
|Es∼q1 [f (s)] − Es∼q2 [f (s)]| ≤ Fmax kq1 − q2 k1
P
where kq1 − q2 k1 = s∈S |q1 (s) − q2 (s)|.
Proof. Consider the following derivation,
X X
|Es∼q1 [f (s)] − Es∼q2 [f (s)]| = | f (s)q1 (s) − f (s)q2 (s)|
s∈S s∈S
X
≤ f (s)|q1 (s) − q2 (s)|
s∈S
≤ Fmax kq1 − q2 k1 ,
where the first identity is the explicit expectation, the second is by the triangle
inequality, and the third is by bounding the values of f by the maximum possible
value.
When we measure the distance between two Markov chains M1 and M2 , it is
natural to consider the next state distributions of each state i, namely M [i, ·]. The
distance between the next state distribution for state i can be measured by the L1
norm, i.e., kM1 [i, ·] − M2 [i,P
·]k1 . We would like to take the worse case over states,
and define kM k∞,1 = maxi j |M [i, j]|. The measure that we will consider is kM1 −
M2 k∞,1 , and assume that kM1 − M2 k∞,1 ≤ α, namely, that for any state, the next
state distributions differ by at most α in norm L1 .
Clearly if α ≈ 0 then the distributions will be almost identical, but we would
like to have a quantitative bound on the difference, which will allow us to derive an
upper bound of the required sample size m.
Theorem 10.10. Assume that kM1 − M2 k∞,1 ≤ α. Let q1t and q2t be the distribution
over states after trajectories of length t of M1 and M2 , respectively. Then,
kq1t − q2t k1 ≤ αt
Proof. Let p0 be the distribution of the start state. Then q1t = p> t t >
0 M1 and q2 = p0 M2 .
t
124
This implies the following two simple facts. First, let q be a distribution, i.e.,
kqk1 = 1, and M a matrix such that kM k∞,1 ≤ α. Then,
For the induction step, let z t = q1t − q2t , and assume that kz t−1 k1 ≤ α(t − 1). We
have,
where the last inequality is derived as follows: for the first term we used Eq. (10.2),
and for the second term we used Eq. (10.3) with the inductive claim.
Proof. By Theorem 10.10 the distance between the state distributions of M and M c
at time t is bounded by αt. PSince the maximum reward is Rmax , by Theorem 10.9
the difference is bounded by Tt=0 αtRmax ≤ αT2 Rmax . For α ≤ Rmax
ε
T2
it implies that
the difference is at most ε.
125
We now present the simulation lemma for the discounted return case, which also
guarantees that approximate models have similar return.
2
Lemma 10.12. Fix α ≤ (1−γ)
Rmax
ε
, and assume that model M
c is an α-approximate model
of M . For the discounted return, for any policy π ∈ ΠSS , we have
|Vγπ (s0 ; M ) − Vγπ (s0 ; M
c)| ≤ ε
Proof. By Theorem 10.10 the distance between the state distributions of M and M
c
at time t is bounded by αt. PSince the maximum reward is Rmax , by Theorem 10.9
the difference is bounded by ∞ t
t=0 αtRmax γ . The sum
∞ ∞
X γ X t−1 γ 1
tγ t = tγ (1 − γ) = <
t=0
1 − γ t=0 (1 − γ)2 (1 − γ)2
where the last equality uses the expected value of a geometric distribution with
parameter γ. Using the bound for α implies that the difference is at most ε.
For completeness we give the proof. (The proof can also be found as Proposition
A6.6 of [126])
Proof. Note that,
k
X n̂i X n̂i
| − pi | = 2 max − pi ,
i=1
n S⊂[k]
i∈S
n
which follows by taking S = {i : n̂ni ≥ pi }.
We can now perform a concentration bound (Chernoff-Hoeffding, Lemma 10.2)
for each subset S ⊂ [k], and get that the deviation is λ with probability at most
2
e−nλ /2 . Using a union bound over all 2k subsets S we derive the lemma.
126
The above lemma implies that to get, with probability 1 − δ, accuracy α for each
(s, a), it is sufficient to sample m = O( |S|+log(|S|
α2
|A|/δ)
) samples for each state-action
pair (s, a). Plugging in the value of α, for the finite horizon, we have
R2max 4
m = O( 2 T (|S| + log(|S| |A|/δ))),
ε
and for the discounted return
R2max 1
m = O( (|S| + log(|S| |A|/δ)).
ε (1 − γ)4
2
Assume we have a sample of m for each (s, a). Then with probability 1 − δ we
have an α-approximate model M b∗ for M
c. We compute an optimal policy π c. This
∗
implies that π
b is a 2ε-optimal policy. Namely,
∗
|V ∗ (s0 ) − V πb (s0 )| ≤ 2ε
When considering the total sample size, we need to consider all state-action pairs.
For the finite horizon, the total sample size is
R2max 2
mT|S| |A| = O( |S| |A|T5 log(|S| |A|/δ)),
ε2
and for the discounted return
R2max
m|S| |A| = O( 2 4
|S|2 |A| log(|S| |A|/δ)).
ε (1 − γ)
We can now look on the dependency of our sample complexity and its dependence
on the various parameters.
2
1. The required sample size scales like Rmax
ε2
which looks like the right bound, even
for estimation of random variables expectations.
3. The dependency on the number of states |S| and actions |A|, is due to the fact
that we like a very high approximation of the next state distribution. We need
to approximate |S|2 |A| parameters, so for this task the bound is reasonable.
However, we will show that if we restrict the task to compute an approximate
optimal policy we can reduce the sample size by a factor of approximately |S|.
127
10.2.4 Improved sample bound: Approximate Value Iteration (AVI)
We would like to exhibit a better sample complexity, for the very interesting case of
deriving an approximately optimal policy. The following approach is off-policy, but
not model based, as we will not build an explicit model M
c. Instead, the construction
and proof would use the samples to approximate the Value Iteration algorithm (see
Chapter 6.6). Recall, that the Value Iteration algorithm works as follows. Initially,
we set the values arbitrarily,
V0 = {V0 (s)}s∈S .
In iteration n we compute for every s ∈ S
X
Vn+1 (s) = max{r(s, a) + γ p(s0 |s, a)Vn (s0 )}
a∈A
s0 ∈S
= max{r(s, a) + γEs0 ∼p(·|s,a) [Vn (s0 )]}.
a∈A
γ n
We showed that limn→∞ Vn = V ∗ , and that the error rate is O( 1−γ Rmax ). This
1 Rmax
implies that if we run for N iterations, where N = 1−γ log ε(1−γ) , we have an error of
at most ε. (See Chapter 6.6.)
We would like to approximate the Value Iteration algorithm using a sample.
Namely, for each (s, a) we have a sample of size m, i.e., {(s, a, ri , s0i )}i∈[1,m] The
Approximate Value Iteration (AVI) using the sample would be,
( m
)
1 X
Vbn+1 (s) = max br(s, a) + γ Vbn (s0i )
a∈A m i=1
r(s, a) = m1 m
P
where b i=1 ri (s, a).
The intuition is that if we have a large enough sample, AVI will approximate the
Value Iteration. We set m such that, with probability 1 − δ, for every (s, a) and any
iteration n ∈ [1, N ] we have:
m
1 Xb 0
E[Vbn (s0 )] − Vn (si ) ≤ ε0
m i=1
and also
r(s, a) − r(s, a)| ≤ ε0
|b
2
This holds for m = O( Vmaxε02
log(N |S| |A|/δ)), where Vmax bounds the maximum value.
I.e., for finite horizon Vmax = T Rmax and for discounted return Vmax = R1−γ
max
.
128
Assume that for every state s ∈ S we have
Vbn (s) − Vn (s) ≤ λ
Then
( m
) ( )
1 Xb 0
Vbn+1 (s) − Vn+1 (s) = max b
r(s, a) + γ Vn (si )} − max r(s, a) + γEs0 ∼p(·|s,a) [Vn (s0 )]
a m i=1 a
m
1 Xb 0
≤ max b
r(s, a) + γ Vn (si ) − r(s, a) − γEs0 ∼p(·|s,a) [Vn (s0 )]
a m i=1
( m
)
1 Xb 0
≤ max b r(s, a) − r(s, a) + γ Vn (si ) − Es0 ∼p(·|s,a) [Vn (s0 )]
a m i=1
m
0 1 Xb 0
≤ε + γ Vn (si ) − Es0 ∼p(·|s,a) [Vbn (s0 )]
m i=1
129
10.3 On-Policy Learning
In the off-policy setting, when given some trajectories, we learn the model and use it
to get an approximate optimal policy. Essentially, we assumed that the trajectories
are exploratory enough, in the sense that each (s, a) has a sufficient number of
samples. In the online setting it is the responsibility of the learner to perform the
exploration. This will be the main challenge of this section.
We will consider two (similar) tasks. The first is to reconstruct the MDP to
sufficient accuracy. Given such a reconstruction we can compute the optimal policy
for it and be guaranteed that it is a near optimal policy in the true MDP. The second
is to reconstruct only the parts of the MDP which have a significant influence on the
optimal policy. In this case we will be able to show that in most time steps we are
playing a near optimal action.
Theorem 10.15. For any strongly connected DDP there is a strategy ρ which recovers
the DDP in at most O(|S|2 |A|)
130
Proof. We first define the explored model. Given an observation set {(st , at , rt , st+1 )},
we define an explored model M f, where f˜(st , at ) = st+1 and r̃(s, a) = 0. For (s, a)
which do not appear in the observation set, we define f˜(s, a) = s and r̃(s, a) = Rmax .
We can now present the on-policy exploration algorithm. Initially set M f0 to have
f˜(s, a) = s and r̃(s, a) = Rmax for every (s, a). Initialize t = 0. At time t do the
following.
1. Compute π̃t∗ ∈ ΠSD , the optimal policy for M
ft , for the infinite horizon average
reward return.
2. If the return of π̃t∗ on M
ft is zero, then terminate.
131
We first define the optimistic observed model. Given an observation set {(st , at , rt , st+1 )},
we define an optimistic observed model M c, where fb(st , at ) = st+1 and b
r(s, a) = rt .
For (s, a) which do not appear in the observation set, we define f (s, a) = s and
b
r(s, a) = Rmax .
b
First, we claim that for any π ∈ ΠSS the optimistic observed model M c can only
increase the value compared to the true model M . Namely,
Vb π (s; M
c) ≥ V π (s; M ).
The increase holds for any trajectory, and note that once π reaches (s, a) that was
not observed, its reward will be Rmax forever. (This is since π ∈ ΠSS .)
We can now present the on-policy learning algorithm. Initially set M c0 to have
for every (s, a) the f (s, a) = s and r̃(s, a) = Rmax . Initialize t = 0. At time t do the
b
following.
bt∗ (st ).
2. Use at = π
3. Observe the reward rt and the next state st+1 and add (st , at , rt , st+1 ) to the
observation set.
4. Modify Mft to Mft+1 by setting for state st and action at the transition f˜(st , at ) =
st+1 and the reward r̃(st , at ) = rt . (Again, note that this will have an effect
only the first time we encounter (st , at ).)
We can now state the convergence of the algorithm to the optimal policy.
Proof. We first claim that the model M ct can change at most |S| |A| times (i.e.,
ct 6= M
M ct+1 ). Each time we change the observed model M ct , we observe a new (s, a)
for the first time. Since there are |S| |A| such pairs, this bounds the number of
changes of Mct .
Next, we show that we either make a change in M ct during the next |S| steps or
we never make any more changes. The model M is deterministic, if we do not change
bτ∗ ∈ ΠSD reach a cycle and continue
the policy in the next |S| time steps, the policy π
on this cycle forever. Hence, the model will never change.
132
We showed that the number of changes is at most |S| |A|, and the time between
changes is at most |S|. This implies that after time τ ≤ |S|2 |A| we never change.
The return of π bτ∗ after time τ is identical in M
cτ and M , since all the edges it
∗ ∗
traverses are known. Therefore, V πbτ (s; M ) = V πbτ (s; Mcτ ). Since π bτ∗ is the optimal
∗ ∗
policy in M cτ we have that V πbτ (s; M cτ ), where π ∗ is the optimal policy
cτ ) ≥ V π (s; M
∗
in M . By the optimism we have V π (s; M cτ ) ≥ V π∗ (s; M ). We established that
∗ ∗
V πbτ (s; M ) ≥ V π (s; M ), but due to the optimality of π ∗ we have π ∗ = π bτ∗ .
In this section we used the infinite horizon average reward, however this is not
critical. If we are interested in the finite horizon, or the discounted return, we can
use them to define the optimal policy, and the claims would be almost identical.
133
As in the DDP we will maintain an explored model. Given an observation set
{(st , at , rt , st+1 )}, we define a state-action (s, a) pair known if we have m times ti ,
1 ≤ i ≤ m, where sti = s and ati = a, otherwise it is unknown. We define the
observed distribution of a known state-action (s, a) to be
We define the explored model M f as follows. We add a new state s1 . For each
known state-action (s, a), we set the next state distribution p̃(·|s, a) to be the observed
distribution b p(·|s, a), and the reward to be zero, i.e., r̃(s, a) = 0. For unknown state-
action (s, a), we define p̃(s0 = s1 |s, a) = 1 and r̃(s, a) = 1. For state s1 we have
p̃(s0 = s1 |s1 , a) = 1 and r̃(s1 , a) = 0 for any action a ∈ A. The terminal reward of
any state s is zero, i.e., r̃T (s) = 0. Note that the expected value of any policy π in
M
f is exactly the probability it will reach an unknown state-action pair.
We can now specify the E 3 (Explicit Explore or Exploit) algorithm. The al-
gorithm has three parameters: (1) m, how many samples we need to change a
state-action from unknown to known, (2) T, the finite horizon parameter, and (3)
ε, δ ∈ (0, 1), the accuracy and confidence parameters.
Initially all state-action pairs are unknown and we set M f accordingly. We initialize
t = 0, and at time t do the following.
1. Compute π̃t∗ , the optimal policy for M
f, for the finite horizon return with horizon
T.
5. For each (s, a) which became known for the first time, update M
f entries for
(s, a).
At termination we define M 0 as follows. For each known state-action pair (s, a),
we set the next state distribution to be the observed distribution b
p(·|s, a), and the
134
reward to be the observed reward, i.e., b r(s, a). For unknown (s, a), we can define
the rewards and next state distribution arbitrarily. For concreteness, we will use the
following: b
p(s, a) = s and b
r(s, a) = Rmax .
Theorem 10.17. Let m ≥ |S|+log(T|S|
α2
|A|/δ) ε/4
and α = Rmax T2
. The E 3 (Explicit Explore
or Exploit) algorithm recovers an MDP M 0 , such that for any policy π the expected
return on M 0 and M differ by at most ε(TRmax + 1), i.e.,
π π
|VM 0 (s0 ) − VM (s0 )| ≤ εTRmax + ε.
In addition, the expected number of time steps until termination is at most O(mT|S| |A|/ε)
Proof. We set the sample size m such that with probability 1 − δ we have that
for every state s and action a we have that both the observed and true next state
distribution are α close and the difference between the observed and true reward is
at most α. Namely, kp(|s, a) − b p(·|s, a)k1 ≤ α and |r(s, a) − b
r(s, a)| ≤ α. As we saw
|S|+log(T|S| |A|/δ)
before, by Lemma 10.13, it is sufficient to have that m ≥ c α2
, for some
c > 0.
Let Mft be the model at time t. We define an intermediate model M f0 t to be
the model where we replace the observed next-state distributions with the true next
state distributions for the known state-action pairs. Since the two models are α-
approximate, their expected return differ by at most αT2 Rmax ≤ ε/4.
Note that the probability of reaching some unknown state in the true model M
and the intermediate model M f0 at time t is identical. This is since the two models
t
agree on the known states, and once an unknown state is reached, we are done.
We will show that while the probability of reaching some unknown state in the
true model is large (larger than 0.75ε) we will not terminate. This will guarantee that
when we terminate the probability of reaching any unknown state is negligible, and
hence we can conceptually ignore such state and still be near optimal. The second
part is to show that we do terminate and bound the expected time until termination.
For this part we will show that once every policy has a low probability of reaching
some unknown state in the true model (less than 0.25ε) then we will terminate.
Assume there is a policy π that at time t in the true model M has a probability of
at least (3/4)ε to reach an unknown state. (Note that the set of known and unknown
states change with t.) Recall that this implies that π has the same probability in M ft0 .
Therefore, this policy π has a probability of at least (1/2)ε to reach an unknown state
in M
ft since M f0 and Mft are α-approximate. This implies that we will not terminate
t
while there is such a policy π.
Similarly, once at time t, every policy π in the true model M has a probability
of at most (1/4)ε to reach an unknown state, then we are guaranteed to terminate.
135
This is since the probability of π to reach an unknown state is identical in M and M f0 .
t
Since the expected return of π in M f0 and M f differ by at most ε/4, the probability
t
of π to reach an unknown state in M ft is at most ε/2. This is exactly our termination
condition, and we will terminate.
Assume termination at time t. At time t every policy π has a probability of at
most (1/2)ε to reach some unknown state in M ft . This implies that π has a probability
of at most (3/4)ε to reach some unknown state in M .
After the algorithm terminates, we define the model M 0 using the observed distri-
butions and rewards for any known state-action pair. Since every known state-action
pair is sampled m times, we have that with probability 1 − δ the model M 0 is an
α-approximation of the true model M , in the known state-action pairs.
π π
When we compare |VM 0 (s0 ) − VM (s0 )| we separate the difference due to trajec-
tories that include unknown states and due to trajectories in which all the states are
known states. The contribution of trajectories with unknown states is at most εTRmax ,
since the probability of reaching any unknown state is at most (3/4)ε < ε and the
maximum return is TRmax . The difference in trajectories in which all the states are
known states is at most ε/4 < ε since M and M 0 are α approximate, and the selection
of α guarantees that the difference in expectation is at most ε/4 (Lemma 10.11).
In each iteration, until we terminate, we have a probability of at least ε/4 to
reach some unknown state-action. We can reach unknown state-action pairs at most
m|S| |A|. Therefore the expected number of time steps is O(mT|S| |A|/ε).
136
Initialization: Initially, we set for each state-action (s, a) a next state distribution
which always returns to s, i.e., p(s|s, a) = 1 and p(s0 |s, a) = 0 for s0 6= s. We set
the reward to be maximal, i.e., r(s, a) = Rmax . We mark (s, a) to be unknown.
Execution: At time t. (1) Build a model M ct , explained later. (2) Compute π bt∗
the optimal finite horizon policy for M ct , where T is the horizon, and (3) Execute
∗
π
bt (st ) and observe a trajectory (s0 , a0 , r0 , s1 , . . . , sT ).
Building a model: At time t, if the number of samples of (s, a) is for the first time
at least m, then: modify p(·|s, a) to the observed transition distribution b p(·|s, a), and
r(s, a) to the average observed reward b r(s, a), and mark (s, a) as known. Note that
we update each (s, a) only once, when it moves from unknown to known.
Note that there are two main differences between R-MAX and E 3 . First, when a
state-action becomes known, we set the reward to be the observed reward (and not
zero, as in E 3 ). Second, there is no test for termination, but we continuously run the
algorithm (although at some point the policy will stop changing).
Here is the basic intuition for algorithm R-MAX. We consider the finite horizon
return with horizon T. In each episode we run π bt∗ for T time steps. Either, with some
non-negligible probability we explore a state-action (s, a) which is unknown, in this
case we make progress on the exploration. This can happen at most m|S| |A| times.
Alternatively, with high probability we do not reach any state-action (s, a) which is
unknown, in which case we are optimal on the observed model, and near optimal on
the true model.
For the analysis define an event N EWt , which is the event that we visit some
unknown state-action (s, a) during the iteration t.
bt∗ , we have,
Claim 10.18. For the return of π
∗
V πbt (s0 ) ≥ V ∗ (s0 ) − Pr[N EWt ]TRmax − λ
where λ is the approximation error for any two models which are α-approximate.
Proof. Let π ∗ be the optimal policy in the true model M . Since we selected policy
bt∗ for our model M
π ct , we have V πbt∗ (s0 ; M
ct ) ≥ V π∗ (s0 ; M
ct ).
We now define an intermediate model M c0 t which replaces the transitions and
rewards in the known state-action pairs by the true transition probabilities and re-
wards. We have that M c0 t and M ct are α-approximate. By the definition of λ we have
∗ ∗
V π (s0 ; M c0 t ) − λ. In addition, V π∗ (s0 ; M
ct ) ≥ V π (s0 ; M c0 t ) ≥ V π∗ (s0 ; M ) = V ∗ (s0 ),
since in Mc0 t we only increased the rewards of the unknown state-action pairs such
that when we reach them we are guarantee maximal rewards until the end of the
trajectory.
137
∗
For our policy π c0 t ) + λ ≥ V πbt∗ (s0 ; M
bt∗ we have that V πbt (s0 ; M ct ), since the models
0
are α-approximate. In M and M t , any trajectory that does not reach any un-
c
known state-action pair, has the same probability in both models. This implies that
∗ ∗
V πbt (s0 ; M ) ≥ V πbt (s0 ; M
c0 t ) − Pr[N EWt ]TRmax , since the maximum return is TRmax .
Combining all the inequalities derives the claim.
We set the sample size m such that λ ≤ ε/2.
We consider two cases, depending on the probability of N EWt . First, we con-
sider the case that the probability of N EWt is small. If Pr[N EW ] ≤ 2TRεmax , then
∗
V πbt (s0 ) ≥ V ∗ (s0 ) − ε/2 − ε/2, since we assume that λ ≤ ε/2.
Second, we consider the case that the probability of N EWt is large. If Pr[N EWt ] >
ε
2TRmax
. Then, there is a good probability to visit an unknown state-action pair (s, a),
but this can happen at most m|S| |A|. Therefore, the expected number of such
iterations is at most m|S| |A| 2TRεmax . This implies the following theorem.
Theorem 10.19. With probability 1 − δ algorithm R-MAX will not be ε-optimal, i.e.,
have an expected return less than V ∗ − ε, in at most
2TRmax
m|S| |A|
ε
iterations.
Remark: Note that we do not guarantee a termination after which we can fix the
policy. The main technical issue that we have is that the probability of the event
N EWt is not monotone non-increasing. This is since when we switch policies, we
might considerably increase the probability of reaching unknown state-action pairs.
For this reason we settle for a weaker guarantee that the number of sub-optimal
iterations is bounded. Note that in E 3 we separated the exploration and exploitation
and have a clear transition between the two, and therefore we can terminate and
output a near-optimal policy.
138
The work of [24] gives a Probably Approximate Correct (PAC) 2 bound for rein-
Rmax |S|2 T2 |A|
forcement learning, for the finite horizon, an upper bound of Õ ε2
log(1/δ)
2 2
and lower bound of Ω̃ Rmax |S|ε2
|A|T
. Other PAC bounds for MDP include [116, 117,
66].
The PhD thesis of Kakade [47] introduced the PAC-MDP model. The model
considers the number of episodes in which the learner’s policy expected value is
worse than away from the optimal value. The R-MAX algorithm [17] was presented
before the introduction of the PAC-MDP model, although conceptually it falls in
this category. The PAC-MDP model has been further studied in [110, 109, 69]. The
analysis of the R-MAX algorithm as a PAC-MDP algorithm appears in [110, 116].
Another line of model-base learning algorithms is based on learning the dynamics,
without considering the rewards. Later, the learner can adapt to any reward function
and derive an optimal policy for it, which is also named ”Best Policy Identification
(BPI)”. The first work in this direction is [33] which gives an efficient algorithm
for the discounted return, with a reset assumption. The Explicit Explore or Exploit
(E 3 ) algorithm of [54] improves in both allowing a wide range of return functions
and does not need the reset assumption.
The term ”reward free exploration” is due to [45] which give a polynomial com-
plexity using a reduction to online learning. The work of [51] improves the bound,
and their algorithm is based on that of [33], and show that O(|S|2 |A|T4 log(1/δ)/ε2 )
episodes learns a near optimal model. This bound was improved in [81], reducing
the dependency on the horizon from T4 to T3 .
139
140
Chapter 11
In this chapter we consider model-free learning algorithms. The main idea of model-
free algorithms is to avoid learning the MDP model directly. The model based
methodology was the following. During the learning we estimate model of the MDP,
and later, we derive the optimal policy of the estimated model. The main point was
that an optimal policy of a near-accurate MDP is an near-optimal policy in the true
MDP.
The model-free methodology is going to be different. We will never learn an
estimated model, but rather we will directly learn the value function of the MDP.
The value function can be either the Q-function (as is the case in Q-learning and
SARSA) or the V-function (as is the case in Temporal Difference (TD) algorithms
and the Monte-Carlo approach).
We will first look at the case of deterministic MDPs, and develop a Q-learning
algorithm that learns the Q-function directly from interaction with the MDP. We will
then extend our approach to general MDPs, where our handling of stochasticity will
be based on the stochastic approximation technique. We will first look at learning V π
for a fixed policy, using either temporal difference on Monte-Carlo methods, and then
look at learning the optimal Q-function, using the Q-learning and SARSA methods.
At the end of the chapter we have a few miscellaneous topics, including, evaluating
one policy while following a different policy (using importance sampling) and the
actor-critic methodology.
141
current state st , the current action at , the current reward rt = r(st , at ), and the
resulting next state st+1 ∼ P(·|st , at ). Throughout the interaction, the agent collects
transition tuples, (st , at , rt , st+1 ), which will effectively be the data used for learning
the value MDP’s value function. That is, all our learning algorithms will take as input
transition tuples, and output estimates of value functions. For some algorithms, the
time index of tuples in the data is not important, and we shall sometimes denote the
tuples as (s, a, r, s0 ), understanding that both notations above are equivalent.
As with any learning method, the data we learn from has substantial influence
on what we can ultimately learn. In our setting, the agent can control the data
distribution, through its choice of actions. For example, if the agent chooses actions
according to a Markov policy π, we should expect to obtain tuples that roughly
follow the stationary distribution of the Markov chain corresponding to π. If π
is very different from the optimal policy, for example, this data may not be very
useful for estimating V ∗ . Therefore, different from the supervised machine learning
methodology, in reinforcement learning the agent must consider both how to learn
from data, but also how to collect it. As we shall see, the agent will need to explore
the MDP’s state space in its data collection, to guarantee that the optimal value
function can be learned. In this chapter we shall devise several heuristics for effective
exploration. In proceeding chapters we will dive deeper into how to provably explore
effectively.
st+1 = f (st , at )
rt = r(st , at )
142
Recall our definition of the Q-function (or state-action value function), specialized
to the present deterministic setting:
or, in terms of Q∗ :
143
Theorem 11.1 (Convergence of Q-learning for DDP).
Assume a DDP model. If each state-action pair is visited infinitely-often, then
bt (s, a) = Q∗ (s, a), for all (s, a).
limt→∞ Q
Proof. The proof would be done by considering the maximum difference between Q
bt
and Q∗ . Let
∆t , kQbt − Q∗ k∞ = max |Q bt (s, a) − Q∗ (s, a)| .
s,a
The first step is to show that after an update at time t the difference at the updated
bt and Q∗ can be bounded by γ∆t . This does not imply
state-action (st , at ) between Q
that ∆t would shrink, since it is the maximum over all state-action pair. Later we
show that eventually, after we update each state-action pairs at least once, then we
are guaranteed to have the difference shrink by a factor of at least γ.
First, at every stage t:
= γ| max bt (s0t , a0 )
Q − max Q ∗
(s0t , a00 )|
a0 00
a
≤ γ max bt (s0 , a0 )
|Q − Q (s0t , a0 )|
∗
t
a0
≤ γ∆t .
where the first inequality uses the fact that | maxx1 f1 (x1 )−maxx2 f (x2 )| ≤ maxx |f1 (x)−
f2 (x)|, and the second inequality follows from the bound on kQ bt − Q∗ k∞ = ∆t . This
implies that the difference at (st , at ) is bounded by γ∆t , but this does not imply
that ∆t+1 ≤ γ∆t , since it is the maximum over all state-action pairs.
Next, we show that eventually ∆t+τ would be at most γ∆t . Consider now some
interval [t, t1 ] over which each state-action pairs (s, a) appear at least once. Using
the above relation and simple induction, it follows that ∆t1 ≤ γ∆t . Since each state-
action pair is visited infinitely often, there is an infinite number of such intervals,
and since γ < 1, it follows that ∆t → 0, as t goes to infinity.
Remark 11.1. Note that the Q-learning algorithm does not need to receive a contin-
uous trajectory, but can receive arbitrary quadruples (st , at , rt , s0t ). We do need that
for any state-action pair (s, a) we have infinitely many times t for which st = s and
at = a.
Remark 11.2. We could have also relaxed theupdate to use a step-size α ∈ (0, 1) as
follows: Q
bt+1 (st , at ) := (1 − α)Q bt (st+1 , a0 ) . The proof
bt (st , at ) + α rt + γ maxa0 Q
144
follows similarly, only with a bound |Qbt+1 (st , at ) − Q∗ (st , at )| ≤ (1 − α(1 − γ)) ∆t ,
and it is clear that (1 − α(1 − γ)) < 1 when γ < 1. For the deterministic case, there
is no reason to choose α < 1. However, we shall see that taking smaller update steps
will be important in the non-deterministic setting.
Remark 11.3. We note that in the model based setting, if we have a single sample
for each state-action pair (s, a), then we can completely reconstruct the DDP. The
challenge in the model free setting is that we are not reconstructing the model, but
rather running a direct approximation of the value function. The DDP model is used
here mainly to give intuition to the challenges that we will later encounter in the
MDP model.
145
Figure 11.1: First vs. every visit example
First visit We update every state that appears in the episode, but update it only
once. Given an episode (s1 , a1 , r1 , . . . , sk , ak , rk ) for each state s that appears in
the episode, we consider the first appearance of s, say sj , and update V b π (s) using
Gs = ki=j ri . Namely, we compute the actual return from the first visit to state s,
P
and use it to update our approximation. This is clearly an unbiased estimator of the
return from state s, e.g., E[Gs ] = V π (s).
Every visit We do an update at each step of the episode. Namely, given an episode
(s1 , a1 , r1 , . . . , sk , ak , rk ) for each state sj that appears in the episode, we update
each V b π (sj ) using Gsj = Pk ri . We compute the actual return from every state
i=j
sj until the end and use it to update our approximation. Note that a state can be
updated multiple times in a single episode using this approach. We will later show
that this estimator is biased, due to the dependency between different updates of the
same state in the same episode.
First versus Every visit: To better understand the difference between first visit
and every visit we consider the following simple test case. We have a two state
MDP, actually a Markov Chain. In the initial state s1 we have a reward of 1 and
with probability 1 − p we stay in that state and with probability p move to the
terminating state s2 . See Figure 11.1.
146
The expected value is V(s1 ) = 1/p, which is the expected length of an episode.
(Note that the return of an episode is its length, since all the rewards are 1.) Assume
we observe a single trajectory, (s1 , s1 , s1 , s1 , s2 ), and all the rewards are 1. What
would be a reasonable estimate for the expected return from s1 .
First visit takes the naive approach, considers the return from the first occur-
rence of s1 , which is 4, and uses this as an estimate. Every visit considers four runs
from state s1 , we have: (s1 , s1 , s1 , s1 , s2 ) with return 4, (s1 , s1 , s1 , s2 ) with return 3,
(s1 , s1 , s2 ) with return 2, and (s1 , s2 ) with return 1. Every visit averages the four
and has G = (4 + 3 + 2 + 1)/4 = 2.5. On the face of it, the estimate of 4 seems to
make more sense. We will return to this example later.
Theorem 11.2. Assume that we execute n episodes using policy π and each episode
has length at most H. Then, with probability 1 − δ, for any α-good state s, we have
b π (s)−V π (s)| ≤ λ, assuming n ≥ (2m/α) log(2|S|/δ) and m = (H 2 /λ2 ) log(2|S|/δ).
|V
Proof. Let p(s) be the probability that policy π visits state s in an episode. Since s is
α-good, the expected number of episodes in which s appears is p(s)n ≥ 2m log(2|S|/δ).
Using the relative Chernoff–Hoeffding bound (Lemma 10.2) we have that the prob-
ability that we have at least m samples of state s is at least 1 − δ/(2|S|).
Given that we have at least m samples from state s using the additive Chernoff–
Hoeffding bound (Lemma 10.2) we have that with probability at least 1 − δ/(2|S|)
that |Vb π (s) − V π (s)| ≤ λ. (Since episodes have return in the range [0, H] we need to
normalize by dividing the rewards by H, which creates the H 2 term in m. A more
refine bound can be derived by noticing that the variance of the return of an episode
147
is bounded by H and not H 2 , and using an appropriate concentration bound, say
Bernstein inequality.)
Finally, the theorem follows from a union bound over the bad events.
Next, we relate the First Visit Monte-Carlo updates to the maximum likelihood
model for the MDP. Going back to the example of Figure 11.1 and observing the
sequence (s1 , s1 , s1 , s1 , s2 ). The only unknown parameter is p.
The maximum likelihood approach would select the value of p that would max-
imize the probability of observing the sequence (s1 , s1 , s1 , s1 , s2 ). The likelihood of
the sequence is, (1 − p)3 p. We like to solve for
Taking the derivative we have (1 − p)3 − 3(1 − p)2 p = 0, which give p∗ = 1/4. For the
maximum likelihood (ML) model M we have p∗ = 1/4 and therefore V (s1 ; M ) = 4.
In general the Maximum Likelihood model value does not always coincide with the
First Visit Monte-Carlo estimate. However we can make the following interesting
connection.
Clearly, when updating state s using First Visit, we ignore all the episodes
that do not include s, and also for each of the remaining episodes, that do include
s, we ignore the prefix until the first appearance of s. Let us modify the sample by
deleting those parts (episodes in which s does not appear, and for each episode that
s appears, start it at the first appearance of s). Call this the reduced sample.
Maximum Likelihood model The maximum likelihood model, given a set of episodes,
is simply the observed model. (We will not show here that the observed model is
indeed the maximum likelihood model, but it is a good exercise for the reader to
show it.) Namely, for each state-action pair (s, a) let n(s, a) be the number of times
it appears, let n(s, a, s0 ) be the number of times s0 is observed following executing
action a in state s. The observed transition model is b p(s0 |s, a) = n(s, a, s0 )/n(s, a).
Assume that in the i-th execution of action P a in state s we observe a reward ri then
n(s,a)
the observed reward is b r(s, a) = (1/n(s, a)) i=1 ri .
Theorem 11.3. Let M be the maximum likelihood MDP for the reduced sample. The
expected value of s0 in M , i.e., V(s; M ), is identical to the First Visit estimate of
b π (s0 ).
s0 , i.e., V
148
Proof. Assume that we have N episodes in the reduced sample and the sum of the
rewards in the i-th episode is Gi . The First Visit Monte Carlo estimate would be
b π (s0 ) = (1/N ) PN Gi .
V i=1
Consider the maximum likelihood model. Since we have a fixed deterministic
policy, we can ignore actions, and define n(s) = n(s, π(s)) and b r(s, π(s)) = b
r(s).
We set the initial state s0 to be the state we are updating.
We want to compute the expected number of visits µ(s) to each state s in the
ML model M . We will show that µ(s) = n(s)/N . This implies that the expected
reward for state s0 in M would be
n(v) N
X X n(v) 1 X 1 X
π v
V (s0 ; M ) = µ(v)b
r(v) = ri = Gj
v v
N n(v) i=1
N j=1
where the last equality follows by changing the order of summation (from states to
episodes).
It remains to show that µ(s) = n(s)/N . We have the following identities. For
v 6= s0 :
X
µ(v) = b(v|u)µ(u)
u
P P
Note that n(v) = u n(u, v) for v 6= s0 and n(s0 ) = N + u n(u, s0 ), and recall that
b(v|u) = n(u, v)/n(u). One can verify the identities by plugging in these values.
149
One way to average the updates is to average for each episode the updates and
average across episodes. Namely,
N ni
1 X 1 X
Gi,j
N i=1 ni j=1
An alternative approach is to sum the updates and divide by the number of updates,
PN Pni
i=1 j=1 Gi,j
PN
i=1 ni
We will use the latter scheme, but it is worthwhile understanding the difference
between the two. Consider for example the case that we have 10 episodes, in 9 we
have a single visit to s and a return of 1, and in the 10-th we have 11 visits to s and
all the returns are zero. The first averaging would give an estimate of 9/10 while the
second would give an estimate of 9/20.
Consider the case of Figure 11.1. For a single episode of length k we have that
the sum of the rewards is k(k + 1)/2, since there are updates of lengths k, . . . , 1
and recall that the return equals the length since all rewards are 1. The number of
updates is k, so we have that the estimate of a single episode is (k + 1)/2. When
we take the expectation we have that E[(k + 1)/2] = (1/p + 1)/2 which is different
from the expected value of 1/p. (Recall that the Every Visit updates k times using
values k, . . . , 1. In addition, E[k] = 1/p which is also the expected value.) If we have
a single episode then both averaging schemes are identical.
When we have multiple episodes, we can see the difference between the two aver-
aging schemes. The first will be biased random variables of E[(k+1)/2] = (1/p+1)/2,
so it will converge to this value rather than 1/p. The second scheme, which we will
use in Every Visit updates, will have the bias decrease with the number of episodes.
The reason is that we sum separately the returns, and the number of occurrences.
This implies that we have
since E[k 2 ] = 2/p2 − 1/p. This implies that if we average many episodes we will get
an almost unbiased estimate using Every Visit.
We did all this on the example of Figure 11.1, but this indeed generalizes. Given
an arbitrary episodic MDP, consider the following mapping. For each episode, mark
the places where state s appears (the state we want to approximate its value). We
150
Figure 11.2: The situated agent
now have a distribution of rewards from going from s back to s. Since we are in an
episodic MDP, we also have to terminate, and for this we can add another state, from
which we transition from state s and have the reward distribution as the rewards
from the last appearance of s until the end of the episode. This implies that we have
two states MDP as described in Figure 11.2.
For this MDP, the value is V π (s1 ) = 1−p
p
r1 + r2 . The single episode expected
1−p
estimate of Every Visit is V π (s1 ) = 2p r1 + r2 . The m episodes expected estimate
m 1−p
of Every Visit is V (s1 ) = m+1 p
r1 + r2 . This implies that if we have a large
number of episodes the bias of the estimate becomes negligible. (For more details,
see Theorem 7 in [106].)
P
and the total squared error is SE = s SE(s).
Our goal is to select a value Vb se (s) for every state, which would minimize the SE.
The minimization is achieved by minimizing the square error of each s, and setting
151
the values
P
i,j:s=si,j Gi,j
Vb se (s) = ,
|(i, j) : s = si,j |
We can also use the Monte-Carlo methodology to learn the optimal policy. The main
idea is to learn the Qπ function. This is done by simply updating for every (s, a).
(The updates can be either Every Visit or First Visit.) The problem is that we
need the policy to be “exploring”, otherwise we will not have enough information
about the actions the policy does not perform.
For the control, we can maintain an estimates of the Qπ function, where the
current policy is π. After we have a good estimate of Qπ we can switch to a policy
which is greedy with respect to Qπ . Namely, each time we reach a state s, we select
a “near-greedy” action, for example, use ε-greedy.
We will show that updating from one ε-greedy policy to another ε-greedy policy,
using policy improvement, does increase the value of the policy. This will guarantee
that we will not cycle, and eventually converge.
Recall that an ε-greedy policy, can be define in the following way. For every state
s there is an action ās , which is the preferred action. The policy does the following:
(1) with probability 1 − ε selects action ās . (2) with probability ε, selects each action
a ∈ A, with probability ε/|A|.
Assume we have an ε-greedy policy π1 . Compute Qπ1 and define π2 to be ε-greedy
with respect to Qπ1 .
Theorem 11.4. For any ε-greedy policy π1 , the ε-greedy improvement policy π2 has
V π2 ≥ V π1 .
Proof. Let ās = arg maxa Qπ1 (s, a) be the greedy action w.r.t. Qπ1 . We now lower
152
bound the value of Qπ2 .
X
Ea∼π2 (·|s) [Qπ1 (s, a)] = π2 (a|s)Qπ1 (s, a)
a∈A
ε X π1
= Q (s, a) + (1 − ε)Qπ1 (s, ās )
|A| a∈A
ε X π1 X π1 (a|s) − ε/|A|
≥ Q (s, a) + (1 − ε) Qπ1 (s, a)
|A| a∈A a∈A
1 − ε
X
π1 π1
= π1 (a|s)Q (s, a) = V (s)
a∈A
The inequality follows, since we are essentially concentrating of the action that π1 (·|s)
selects with probability 1 − ε, and clearly ās , by definition, guarantees a higher value.
It remains to show, similar to the basic policy improvement, that we have
V π2 (s) ≥ max Qπ1 (s, a) ≥ Ea∼π2 (·|s) [Qπ1 (s, a)] ≥ V π1 (s).
a
T π2 (V π1 ) = Tε∗ (V1π ) ≥ T π1 (V π1 ) = V π1
(T π2 )k (V π1 ) ≥ (T π2 )k−1 (V π1 ) ≥ · · · ≥ V π1
153
2. Does not assume the environment is Markovian
Going back to the Q-learning algorithm in Section 11.2, we see that Monte-Carlo
methods do not use the bootstrapping idea, which can mitigate the first two draw-
backs, by updating the estimates online, before an episode is over. In the following
we will develop bootstrapping based methods for model-free learning in MDPs. To
facilitate our analysis of these methods, we shall first describe a general framework
for online algorithms.
154
a fixed point, while the second relates Eq. 11.1 to an ordinary differential equation
(ODE), and looks at convergence to stable equilibrium points of the ODE. While
the technical details of each approach are different, the main idea is similar: in both
cases we will choose step sizes that are large enough such that the expected update
converges, yet are small enough such that the noise terms do not take the iterates
too far away from the expected behavior. The contraction method will be used to
analyse the model-free learning algorithms in this section, while the ODE method
will be required for the analysis in later chapters, when function approximation is
introduced.
This can be seen as a special case of the general stochastic approximation form in
Eq. 11.1, where f (X) = H(X) − X.
We call aP sequence of learning rates {αt (s, a)} is well P
formed if For every (s, a)
we have (1) t αt (s, a)I(st = s, at = a) = ∞, and (2) t αt2 (s, a)I(st = s, at =
a) = O(1).
We will mainly look at (B, γ) well behaved iterative algorithms, where B > 0 and
γ ∈ (0, 1), which have the following properties:
1. Step size: sequence of learning rates {αt (s, a)} is well formed.
2. Noise: E[ωt (s)|ht−1 ] = 0 and |ωt (s)| ≤ B, where ht−1 is the history up to time
t.
155
We will not give a proof of this important theorem, but we will try to sketch the
main proof methodology.
There are two distinct parts to the iterative algorithms. The part (HXt ) is
contracting, in a deterministic manner. If we had only this part (say, ωt = 0 always)
then the contraction property of H will give the convergence (as we saw before in
Remark 11.2). The main challenge is the addition of the stochastic noise ωt . The
noise is unbiased, so on average the expectation is zero. Also, the noise is bounded
by a constant B. This implies that if we average the noise over a long time interval,
then the average should be very close to zero.
The proof considers the kXt − X ∗ k, and works in phases. In phase i, at any time
t in the phase we have kXt − X ∗ k ≤ λi . In each phase we have a deterministic
contraction using the operator H. The deterministic contraction implies that the
space contracts by a factor γ < 1. Taking into account the step size αi , following
Remark 11.2, let γ̃i = (1 − αi (1 − γ)) < 1. We have to take care of the stochastic
noise. We make the phase long enough so that the average of the noise is less than
λi (1 − γ̃i )/2 factor. This implies that the space contracts by λi+1 ≤ γ̃i λi + (1 −
γ̃i )λi /2 = λi (1 + γ̃i )/2 < λi . To complete our proof, we need to show that the
decreasing sequence λi convergesQto zero. Without loss of generality, let λ0 = 1.
Then we need to evaluate λ∞ = ∞ 1−γ
i=0 1 − α i 2 . We have
∞ !! ∞ ! ∞
!
Y 1−γ X 1−γ 1−γ X
exp log 1 − αi = exp log 1 − αi ≈ exp − αi ,
2 2 2
i=0 i=0 i=0
which converges to zero for a well-behaved algorithm, due to the first step size rule.
d
θ(t) = f (θ(t)),
dt
or θ̇ = f (θ).
2
We provide an introduction to ODEs in Appendix B.
156
Given {Xt , αt }, we define a continuous-time process θ(t) as follows. Let
t−1
X
tt = αk .
k=0
Define
θ(tt ) = Xt ,
and use linear interpolation in-between the tt ’s.
Thus, the time-axis t is rescaled according to the gains {αt }.
θn
θ0 θ2
θ1
θ3
n
0 1 2 3
θ (t)
t
t0 t1 t2 t3
α0 α1 α2
Note that over a fixed ∆t, the “total gain” is approximately constant:
X
αk ' ∆t ,
k∈K(t,∆t)
where K(t, ∆t) = {k : t ≤ tk < t + ∆t}. Plugging in the update of Eq. 11.1, we have
X
θ(t + ∆t) = θ(t) + αt [f (Xt ) + ωt ] .
t∈K(t,∆t)
We now make two observations about the terms in the sum above:
157
1. For large t, αt becomes small and the summation
P is over many terms; thus the
noise term is approximately “averaged out”: αt ωt → 0.
2. For small ∆t, Xt is approximately constant over K(t, ∆t) : f (Xt ) ' f (θ(t)).
We thus obtain:
θ(t + ∆t) ' θ(t) + ∆t · f (θ(t)),
and rearranging gives,
θ(t + ∆t) − θ(t)
' f (θ(t)).
∆t
For ∆t → 0, this reduces to the ODE:
θ̇(t) = f (θ(t)).
158
Remark 11.4. More generally, even if the ODE is not globally stable, Xt can be
shown to converge to an invariant set of the ODE (e.g., a limit cycle).
Remark 11.5. A major assumption in the last result is the boundedness of (Xt ).
In general this assumption has to be verified independently. However, there exist
several results that rely on further properties of f to deduce boundedness, and hence
convergence. One technique is to consider the function fc (θ) = f (cθ)/c, c ≥ 1. If
fc (θ) → f∞ (θ) uniformly, one can consider the ODE with f∞ replacing f [16]. In
particular, for a linear f , we have that fc = f , and this result shows that boundedness
is guaranteed. We make this explicit in the following theorem.
159
example of such a system. In Chapter 12, we will encounter a similar case when
establishing convergence of an RL algorithm with linear function approximation.
Example 11.1. Consider the following linear recurrence equation in R2 , where for
simplicity we omit the noise term
Xt+1 = Xt + αt AXt ,
where A ∈ R2×2 . Clearly, X ∗ = [0, 0] is a fixed point. Let X0 = [0, 1], and con-
−0.9 −0.9
sider two different values of the matrix A, namely, Acontraction = , and
0 −0.9
−3 −3
Ano-contraction = . Note that the resulting operator H(X) = X + AX can be
2.1 1.9
0.1 −0.9 −2 −3
either Hcontraction = , and Hno-contraction = . It can be verified
0 0.1 2.1 2.9
that kHcontraction Xk < kXk for any X 6= 0. However, note that kHno-contraction X0 k =
k[−3, 2.9]k > k[0, 1]k, therefore Hno-contraction is not a contraction in the Euclidean
norm (nor in any other weighted p-norm).
The next plot shows the evolution of the recurrence when starting from X0 , for a
constant step size αt = 0.2. Note that both iterates converge to X ∗ , as it is an
asymptotically stable fixed point for the ODE Ẋ = AX for both values of A. How-
ever, the iterates for Hcontraction always reduce the distance to X ∗ , while the iterates
for Hno-contraction do not. Thus, for Hno-contraction , only the ODE method would have
worked for showing convergence.
160
11.5 Temporal Difference algorithms
In this section we will look at temporal differences methods, which work in an online
fashion. We will start with T D(0) which uses only the most recent observations for
the updates, and we will continue with methods that allow for a longer look-ahead,
and then consider T D(λ) which averages multiple look-ahead estimations.
In general, temporal differences (TD) methods, learn directly from experience,
and therefore are model-free methods. Unlike Monte-Carlo algorithms, they will use
incomplete episodes for the updates, and they are not restricted to episodic MDPs.
The TD methods update their estimates given the current observation and in that
direction, similar in spirit to Q-learning and SARSA.
11.5.1 TD(0)
Fix a policy π ∈ ΠSD , stationary and deterministic. The goal is to learn the value
function V π (s) for every s ∈ S. (The same goal as Monte-Carlo learning.) The
TD algorithms will maintain an estimate of the value function of the policy π, i.e.,
maintain an estimate Vbt (s) for V π (s). The TD algorithms will use their estimates Vb
for the updates. This implies that unlike Monte-Carlo, there will be an interaction
between the estimates of different states and at different times.
161
As a starting point, we can recall the value iteration algorithm.
E π [Vbt (st )] = E π [rt + γ Vbt (st+1 )] = E π [r(s, a) + γ Vbt (s0 )|s = st , a = π(s)].
The T D(0) will do an update in this direction, namely, [rt + γ Vbt (st+1 )].
∆t = rt + γ Vb (st+1 ) − Vb (st )
We would like to compare the T D(0) and the Monte-Carlo (MC) algorithms. Here
is a simple example with four states S = {A, B, C, D} where {C, D} are terminal
states and in {A, B} there is one action (essentially, the policy selects a unique
action). Assume we observe eight episodes. One episode is (A, 0, B, 0, C), one episode
(B, 0, C), and six episodes (B, 1, D). We would like to estimate the value function
of the non-terminal states. For V (B) both T D(0) and M C will give 6/8 = 0.75.
The interesting question would be: what is the estimate for A? MC will average
only the trajectories that include A and will get 0 (only one trajectory which gives 0
reward). The T D(0) will consider the value from B as well, and will give an estimate
162
Figure 11.3: TD(0) vs. Monte-Carlo example
of 0.75. (Assume that the T D(0) continuously updates using the same episodes until
it converges.)
We would like to better understand the above example. For the above example
the empirical MDP will have a transition from A to B, with probability 1 and reward
0, from B we will have a transition to C with probability 0.25 and reward 0 and a
transition to D with probability 0.75 and reward 1. (See, Figure 11.3.) The value of
A in the empirical model is 0.75. In this case the empirical model agrees with the
T D(0) estimate, we show that this holds in general.
The following theorem states that the value of the policy π on the maximum
likelihood model (Definition 11.1), which is the empirical model, is identical to that
of T D(0) (running on the sample until convergence, namely, continuously sampling
uniformly t ∈ [1, T ] and using (st , ar , rt , st+1 ) for the T D(0) update).
Theorem 11.8. Let VTπD be the estimated value function of π when we run T D(0)
π
until convergence. Let VEM be the value function of π on the empirical model. Then,
π π
VT D = VEM .
Proof sketch. The update of T D(0) is Vb (st ) = Vb (st ) + αt (st , at )∆t , where ∆t =
rt + γ Vb (st+1 ) − Vb (st ). At convergence we have E[∆t ] = 0 and hence,
1 X
Vb (s) = r(s, a) + γEs0 ∼bp(·|s,a) [Vb (s0 )]
rt + γ Vb (st+1 ) = b
n(s, a) s :s =s,a =a
t+1 t t
where a = π(s).
It is worth to compare the above theorem to the case of Monte Carlo (Theo-
rem 11.3). Here we are using the entire sample, and we have the same ML model for
163
any state s. In the Monte-Carlo case we used a reduced sample, which depends on
the state s and therefore we have a different ML model for each state, based on its
reduced sample.
Theorem 11.9 (Convergence T D(0)). If the sequence of learning rates {αt (s, a)} is
well formed then Vb converges to V π , with probability 1.
We will show the convergence using the general theorem for stochastic approxi-
mation iterative algorithm (Theorem 11.5).
We first define a linear operator H for the policy π,
X
(Hv)(s) = r(s, π(s)) + γ p(s0 |s, π(s))v(s0 )
s0
Note that H is the operator T π we define in Section 6.4.3. Theorem 6.9 shows that
the operator H is a γ-contracting.
We now would like to re-write the T D(0) update to be a stochastic approximation
iterative algorithm. The T D(0) update is,
The requirement of the step sizes follows since they are well formed. The noise ωt
has both E[ωt |ht−1 ] = 0 and |ωt | ≤ Vmax . The operator H is γ-contracting with a
fix-point V π . Therefore, using Theorem 11.5, we established Theorem 11.9.
164
Figure 11.4: Markov Reward Chain
Comparing T D(0) and M C algorithms: 3 We can see the difference between T D(0)
and M C in the Markov Chain in Figure 11.4. To get an approximation of state s2 ,
i.e., |Vb (s2 ) − 12 | ≈ ε. The Monte-Carlo will require O(1/(βε2 )) episodes (out of which
only O(1/ε2 ) start at s2 ) and the T D(0) will require only O(1/ε2 + 1/β) since the
estimate of s3 will converge after 1/ε2 episodes which start from s1 .
Algorithm 14 Q-learning
1: Initialize: Set Q0 (s, a) = 0, for all s, a.
2: For t = 0, 1, 2, . . .
3: Observe: (st , at , rt , s0t ).
4: Update:
h i
0 0
Qt+1 (st , at ) := Qt (st , at ) + αt (st , at ) rt + γ max
0
Q (s
t t , a ) − Q (s ,
t t t a )
a
3
YM: needed to check if the example is from Sutton
165
It is worth to try and gain some intuition regarding the Q learning algorithm. Let
Γt = rt +γ maxa0 Qt (s0t , a0 )−Qt (st , at ). For simplicity assume we already converged,
Qt = Q∗ . Then we have that E[Γt ] = 0 and (on average) we maintain that Qt = Q∗ .
Clearly we do not want to assume that we converge, since this is the entire goal of the
algorithm. The main challenge in showing the convergence is that in the updates we
use Qt rather than Q∗ . We also need to handle the stochastic nature of the updates,
where there are both stochastic rewards and stochastic next state.
The next theorem states the main convergence property of Q-learning.
Theorem 11.10 (Q-learning convergence).
Assume every state-action pair (s, a) occurs infinitely often, and the sequence of
learning rates {αt (s, a)} is well formed. Then, Qt converges with probability 1 to Q∗
Note that the statement of the theorem has two requirements. The first is that
every state-action pair occurs infinitely often. This is clearly required for convergence
(per state-action). Since Q-learning is an off-policy algorithm, it has no influence
on the sequence of state-action it observes, and therefore we have to make this
assumption. The second requirement is two properties regarding the learning rates
α. The first states that the learning rates are large enough that we can (potentially)
reach a value. The second states that the learning rates are sufficiently small (sum
of squares finite) so that we will be able to converge locally.
We will show the convergence proof by the general technique of stochastic ap-
proximation.
≤ γkq1 − q2 k∞ .
166
In this section we re-write the Q-learning algorithm to follow the iterative stochas-
tic approximation algorithms, so that we will be able to apply Theorem 11.5.
Recall that,
Qt+1 (st , at ) := (1 − αt (st , at ))Qt (st , at ) + αt (st , at )[rt + γ max
0
Qt (s0t , a0 )]
a
Let Φt = rt + γ maxa0 Qt (st+1 , a0 ). This implies that E[Φt ] = (HQt )(st , at ). We can
define the noise term as ωt (st , at ) = Φt −(HQt )(st , at ) and have E[ωt (st , at )|ht−1 ] =
0. In addition |ωt (st , at )| ≤ Vmax = R1−γ
max
.
We can now rewrite the Q-learning, as follows,
Qt+1 (st , at ) := (1 − αt (st , at ))Qt (st , at ) + αt (st , at )[(HQt )(st , at ) + ωt (st , at )]
In order to apply Theorem 11.5, we have the properties of the noise ωt , and of
the contraction of H. Therefore, we can derive Theorem 11.10, since the step size
requirement is part of the theorem.
167
PN
1. Linear step
P∞ size: α t (s, a) = 1/n(s,
P∞ a). We have that n=1 1/n = ln(N ) and
2 2
therefore n=1 1/n = ∞. Also, n=1 1/n = π /6 = O(1)
a) = 1/(n(s, a))θ . We
2. PolynomialPstep size: For θ ∈ (1/2, 1) we have αt (s,P
have that N θ
n=1 1/n ≈ (1 − θ) N
−1 1−θ
and therefore ∞ θ
n=1 1/n = ∞. Also,
P ∞ 2θ 1
n=1 1/n ≤ 2θ−1 , since 2θ > 1.
The linear step size, although many times popular in practice, might lead to slow
converges. Here is a simple example. We have a single state s and single action a
and r(s, a) = 0. However, suppose we start with Q0 (s, a) = 1. We will analyze the
convergence with the linear step size. Our update is,
1 1 1−γ
Qt = (1 − )Qt−1 + [0 + γQt−1 ] = (1 − )Qt−1
t t t
When we solve the recursion we get that Qt = Θ(1/t1−γ ).4 This implies that for
t ≤ (1/ε)1/(1−γ) we have Qt ≥ ε.
In contrast, if we use a polynomial step size, we have,
1 1 1−γ
Qt = (1 − θ
)Qt−1 + θ [0 + γQt−1 ] = (1 − )Qt−1
t t tθ
1−θ
When we solve the recursion we get that Qt = Θ(e−(1−γ)t ). This implies that for
1
t ≥ 1−γ log1/(1−θ) (1/ε) we have Qt ≤ ε. This is a poly-logarithmic dependency on
ε, which is much better. Also, note that θ is under our control, and we can set for
example θ = 2/3. Note that unlike θ, the setting of the discount factor γ has a huge
influence on the objective function and the effective horizon.
168
The specific algorithm that we present is called SARSA. The name comes from
the fact that the feedback we observe (st , at , rt , st+1 , at+1 ), ignoring the subscripts
we have SARSA. Note that since it is an on-policy algorithm, the actions are actually
under the control of the algorithm, and we would need to specify how to select them.
When designing the algorithm we need to think of two contradicting objectives
in selecting the actions. The first is the need to explore, perform each action in-
finitely
P often. This implies that we need, for each state s and action a, to have
that t πt (a|s) = ∞. Then by the Borel-Cantelli lemma we have with probability
1 an infinite number of times that we select action a in state s (actually, we need
independence of the events, or at least a Martingale property, which holds in our
case). On the other hand we would like not only our estimates to converge, as done
in Q-learning, but also the return to be near optimal. For this we need the action
selection to converge to being greedy with respect to the Q function.
Algorithm 15 SARSA
1: Initialize: Set Q0 (s, a) = 0, for all s, a.
2: For t = 0, 1, 2, . . .
3: Observe: (st , at , rt , st+1 ).
4: Select at+1 = π(st+1 ; Qt ).
5: Update:
Selecting the action: As we discussed before, one of the main tasks of an on-policy
algorithm is to select the actions. It would be natural to select the action is state st
as a function of our current approximation Qt of the optimal Q function.
Given a state s and a Q function Q, we first define the greedy action in state s
according to Q as
ā = arg max Q(s, a)
a
The first idea might be to simply select the greedy action ā, however this might be
devastating. The main issue is that we might be avoiding exploration. Some actions
might look better due to errors, and we will continue to execute them and not gain
any information about alternative actions.
For a concrete example, assume we initialize Q0 to be 0. Consider an MDP with a
single state and two actions a1 and a2 . The reward of action a1 and a2 are a Bernoulli
random variables with parameters 1/3 and 3/4, respectively. If we execute action a1
169
first and get a reward of 1, then we have Q1 (s, a1 ) > 0 and Q1 (s, a2 ) = 0. If we select
the greedy action, we will always select action a1 . We will both be sub-optimal in
the return and never explore a2 which will result that we will not converge to Q∗ .
For this reason we would not select deterministically the greedy action.
In the following we will present two simple ways to select the action by π(s; Q)
stochastically. Both ways will give all actions a non-zero probability, and thus guar-
antee exploration.
The εn -greedy, has as a parameter a sequence of εn and selects the actions as
follows. Let nt (s) be the number of times state s was visited up to time t. At
time t in state s policy εn -greedy (1) with probability 1 − εn sets π(s; Q) = ā where
n = nt (s), and (2) with probability εn /|A|, selects π(s; Q) = a, for each a ∈ A.
Common values for εn are linear, εt = 1/n, or polynomial, εt = 1/nθ for θ ∈ (0.5, 1).
The soft-max, has as a parameter a sequence of βt ≥ 0 and selects π(s; Q) = a, for
βt Q(s,a)
each a ∈ A, with probability P 0 e eβt Q(s,a0 ) . Note that for βt = 0 we get the uniform
a ∈A
distribution and for βt → ∞ we get the maximum. We would like the schedule of
the βt to go to infinity (become greedy) but need it to be slow enough (so that each
action appears infinitely often).
Therefore, T ∗, will converge to a fix-point Q∗, . Now we want to relate the fix-point
of Q∗, to the optimal Q∗ , which is a fixed point of T ∗ . For Q∗, , since it is the fix
point of T ∗, , we have
" #
X
Q∗, (s, a) =r(s, a) + γEs0 ∼p(·|s,a) Q∗, (s0 , b0 ) + (1 − ) max Q∗, (s0 , a0 )
|A| b0 ∈A a0 ∈A
170
For Q∗ , since it is the fix point of T ∗ , we have
∆ ≤ γ∆ + γVmax
Proof. First, we show that for any state s we have V ∗ (s) − Q∗ (s, π(s)) ≤ 2∆.
Since kQ − Q∗ k∞ ≤ ∆ we have |Q∗ (s, π(s)) − Q(s, π(s))| ≤ ∆ and |Q∗ (s, a∗ ) −
Q(s, a∗ )| ≤ ∆, where a∗ is the optimal action in state s. This implies that Q∗ (s, a∗ )−
Q∗ (s, π(s)) ≤ Q(s, a∗ ) − Q(s, π(s)) + 2∆. Since policy π is greedy w.r.t. Q we have
Q(s, π(s)) ≥ Q(s, a∗ ), and hence V ∗ (s)−Q∗ (s, π(s)) = Q∗ (s, a∗ )−Q∗ (s, π(s)) ≤ 2∆.
Next,
171
where r0 = E[R(s, π(s))] and s1 is the state reached when doing action π(s) in state
s. As we role out to time t we have,
t−1
X t
X
∗ i ∗
V (s) ≤ E[ t
γ ri ] + γ E[V (st )] + 2∆γ i
i=0 i=1
where ri is the reward in time i in state si , si+1 is the state reached when doing
action π(si ) in state si , and we start with s0 = s. This implies that in the limit we
have
2∆
V ∗ (s) ≤ V π (s) + ,
1−γ
since V π (s) = E[ ∞ i
P
i=0 γ ri ].
The above lemma uses the greedy policy, but as we discussed before, we would
like to add exploration. We would like to claim that if ε is small, then the difference
in return between the greedy policy and the ε-greedy policy would be small. We will
show a more general result, showing that for any policy, if we add a perturbation of
ε to the action selection, then the effect on the expected return is at most O(ε).
Fix a policy π and let πε be a policy such that for any state s we have that
kπ(·|s) − πε (·|s)k1 ≤ ε. Namely, there is a policy ρ(a|s) such that πε (a|s) = (1 −
ε)π(a|s) + ερ(a|s). Hence, at any state, with probability at least 1 − ε policy πε and
policy π use the same action selection.
Lemma 11.13. Fix πε and policy π such that for any state s we have that kπ(·|s) −
πε (·|s)k1 ≤ ε. Then, for any state s we have
εγ ε
|V πε (s) − V π (s)| ≤ ≤
(1 − γ)(1 − γ(1 − ε)) (1 − γ)2
172
Since the rewards are bounded, namely, rt ∈ [0, Rmax ], the difference is maximized if
we set all the rewards to Rmax , and have
∞
X ∞
X
π πε t
V (s) − V (s) ≤ γ Rmax − (1 − ε)t γ t Rmax
t=0 t=0
Rmax Rmax
= −
1 − γ 1 − γ(1 − ε)
εγRmax
=
(1 − γ)(1 − γ(1 − ε))
We can now combine the results and claim that SARSA with ε-greedy converges
to the optimal policy. We will need that εn -greedy uses a sequence of εn > 0 such
that εn converges to zero as n increases. Call such a policy monotone εn -greedy
policy.
Theorem 11.14. For any λ > 0 there is a time τ such that at any time t > τ the
algorithm SARSA, using a monotone εn -greedy policy, plays a λ-optimal policy.
Proof. Consider the sequence εn . Since it converges to zero, there exists a value N
such that for any n ≥ N we have εn ≤ 0.25λ(1 − γ)2 .
Since we are guaranteed that each state action is sampled infinitely often, there
is a time τ1 such that each state is sampled at least N times.
Since Qt converges to Q∗ , there is a time τ2 such that for any t ≥ τ2 we have
kQt − Q∗ k∞ ≤ ∆ = 0.25λ(1 − γ).
Set τ = max{τ1 , τ2 }. By Lemma 11.13 the difference between the εn -greedy policy
and the greedy policy differs by at most 2εn /(1 − γ)2 ≤ λ. By Lemma 11.12 the
difference between the optimal and greedy policy is bounded by 2∆/(1 − γ) = λ/2.
This implies that the policies played at time t > τ are λ-optimal.
173
(n)
We can relate the ∆t to the ∆t as follows:
n−1
X
(n)
∆t = γ i ∆t+i
i=0
We can use any parameter n for the n-step look-ahead. If the episode ends before
step n we can pad it with rewards zero. This implies that for n = ∞ we have that n-
step look-ahead is simply the Monte-Carlo estimate. However, we need to select some
parameter n. An alternative idea is to simply average over the possible parameters
n. One simple way to average is to use exponential averaging with a parameter
λ ∈ (0, 1). This implies that the weight of each parameter n is (1 − λ)λn−1 .
This leads us to the T D(λ) update:
∞
X (n)
Vb (st ) = Vb (st ) + αt (1 − λ) λn−1 ∆t .
n=1
Remark: While both γ and λ are used to generate exponential decaying values, their
goal is very different. The discount parameter γ defines the objective of the MDP,
the goal that we like to maximize. The exponential averaging parameter λ is used
by the learning algorithm to average over the different look-ahead parameters, and
is selected to optimize the convergence.
The above describes the forward view of T D(λ), where we average over future
rewards. If we will try to implement it in a strict way this will lead us to wait until
the end of the episode, since we will need to first observe all the rewards. Fortunately,
174
there is an equivalent form of the T D(λ) which uses a backward view. The backward
view updates at each time step, using an incomplete information. At the end of the
episode, the updates of the forward and backward updates will be the same.
The basic idea of the backward view is the following. Fix a time t and a state
s = st . We have at time t a temporal difference ∆t = rt + γVt (st+1 ) − Vt (st ).
Consider how this ∆t affects all the previous times τ < t where sτ = s = st . The
influence is exactly (γλ)t−τ ∆t . This implies that for every such τ we can do the
desired update, however, we can aggregate all those updates to a single update. Let,
X t
X
et (s) = (γλ)t−τ = (γλ)t−τ I(sτ = s)
τ ≤t:sτ =s τ =1
The above et (s) defines the eligibility trace and we can compute it online using
Note that for T D(0) we have that λ = 0 and the eligibility trace becomes et (s) =
I(s = st ). This implies that we update only st and Vbt+1 (st ) = Vbt (st ) + αt ∆t .
T D(λ) algorithm
175
∆0 ∆1 ∆2 ∆3 ∆4 ∆5 ∆6 ∆7
s0 = s 1 λγ (λγ) (λγ) (λγ) (λγ) (λγ)6
2 3 4 5
(λγ)7
s2 = s 1 λγ (λγ)2 (λγ)3 (λγ)4 (λγ)5
s5 = s 1 λγ (λγ)2
e0 (s) e1 (s) e2 (s) e3 (s) e4 (s) e5 (s) e6 (s) e7 (s)
Figure 11.5: An example for T D(λ) updates of state s that occurs at times 0, 2
and 5. The forward update appear the rows. Each column is the coefficients of the
update of ∆i , and their sum equals ei (s).
∞
X ∞
X
∆VtB (s) = ∆VtF (s)I(st = s)
t=0 t=0
6
YM: should we move to the Harm van Seijen, Richard S. Sutton: True Online TD(lambda).
ICML 2014: 692-700
176
Proof. Consider the sum of the forward updates for state s:
∞
X ∞
X ∞
X (n)
∆VtF (s) = α(1 − λ) λn−t ∆t I(s = st )
t=0 t=0 n=t
∞
X ∞
X n
X
= α(1 − λ) λn−t γ i ∆t+i I(s = st )
t=0 n=t i=0
∞ X
X ∞ X
n
= α(1 − λ)λn−k λk−t γ k−t ∆k I(s = st )
t=0 n=0 k=t
X∞ X ∞ ∞
X
k−t
= α(γλ) ∆k I(s = st ) (1 − λ)λi
t=0 k=t i=0
X∞ X ∞
= α(γλ)k−t ∆k (s)I(s = st ) (11.2)
t=0 k=t
(n)
where
Pn the first identity is the definition, the second identity follows since ∆t =
i
i=0 γ ∆t+i , in the third identity we substitute k for t + i and sum over n, k and t,
in the forth identity we substitute i for Pn − k and isolate the terms that depend on
i, and in the last identity we note that ∞ i
i=0 (1 − λ)λ = 1.
For the backward view for state s we have
∞
X ∞
X
∆VtB (s) = α∆t (s)et (s) (11.3)
t=0 t=0
X∞ t
X
= α∆t (s) (γλ)t−k I(s = st )
t=0 k=0
∞ X
X ∞
= α(γλ)t−k ∆t (s)I(s = st ) (11.4)
k=0 t=k
Note that if we interchange k and t in Eq. (11.2) and in Eq. (11.4), then we have
the identical expressions.
11.5.8 SARSA(λ)
We can use the idea of eligibility traces also in other algorithms, such as SARSA.
Recall that given (st , at , rt , st+1 , at+1 ) the update of SARSA is
177
(n) Pn−1 i
Similarly, we can define an n-step look-ahead qt = i=0 γ rt+i + γ n Qt (st+n , at+n )
(n)
and set Qt+1 (st , at ) = Qt (st , at ) + αt (qt − Qt (st , at )).
We can now define SARSA(λ) using exponential averaging with parameter λ.
Namely, we define qtλ = (1 − λ) ∞ n−1 (n)
P
n=1 λ qt . This makes the forward view of
SARSA(λ) to be Qt+1 (st , at ) = Qt (st , at ) + αt (qtλ − Qt (st , at )).
Similar to T D(λ), we can define a backward view using eligibility traces:
e0 (s, a) = 0
et (s, a) = γλet−1 (s, a) + I(s = st , a = at )
11.6 Miscellaneous
11.6.1 Importance Sampling
Importance sampling is a simple general technique to estimate the mean with respect
to a given distribution, while sampling from a different distribution. To be specific,
let Q be the sampling distribution and P the evaluation distribution. The basic idea
is the following
X X P (x) P (x)
Ex∼P [f (x)] = P (x)f (x) = Q(x) f (x) = Ex∼Q [ f (x)]
x x
Q(x) Q(x)
This implies that given a sample {x1 , . . . , xm } from Q, we can estimate Ex∼P [f (x)]
using m
P P (xi )
i=1 Q(xi ) f (xi ). The importance sampling gives an unbiased estimator, but
the variance of the estimator might be huge, since it depends on P (x)/Q(x).
We would like to apply the idea of importance sampling to learning in MDPs.
Assume that there is a policy π that selects the actions, and there is a policy ρ that
we would like to evaluate. For the importance sampling, given a trajectory, we need
to take the ratio of probabilities under ρ and π.
T
ρ(s1 , a1 , r1 , . . . , sT , aT , rT , sT +1 ) Y ρ(at |st )
=
π(s1 , a1 , r1 , . . . , sT , aT , rT , sT +1 ) t=1 π(at |st )
178
where the equality follows since the reward and transition probabilities are identical,
and cancel.
For Monte-Carlo, the estimates would be
T T
ρ/π
Y ρ(at |st ) X
G = ( rt )
t=1
π(at |st ) t=1
and we have
Vb ρ (s1 ) = Vb ρ (s1 ) + α(Gρ/π − Vb ρ (s1 ))
This updates might be huge, since we are multiplying the ratios of many small
numbers.
For the T D(0) the updates will be
and we have
ρ/π
Vb ρ (s1 ) = Vb ρ (s1 ) + α(∆t − Vb ρ (s1 ))
This update is much more stable, since we have only one factor multiplying the
observed reward.
Example 11.2. Consider an MDP with a single state and two actions (also called
multi-arm bandit, which we will cover in Chapter 14). We consider a finite hori-
zon return with parameter T. Policy π at each time selects one of the two actions
uniformly at random. The policy ρ selects action one always.
Using the Monte Carlo approach, when considering complete trajectories, only
after expected 2T trajectories we have a trajectory in which for T times action one
was selected. (Note that the update will have weight 2T .)
Using the T D(0) updates, each time action one is selected by π we can do an
update the estimates of ρ (with a factor of 2).
To compare the two approaches, consider the number of trajectories required to
get an approximation for the return of ρ. Using Monte-Carlo, we need O(T2T /2 )
trajectories, in expectation. In contrast, for T D(0) we need only O(T/2 ) trajectories.
The huge gap is due to the fact that T D(0) utilizes partial trajectories while Monte-
Carlo requires the entire trajectory to agree with ρ.
179
11.6.2 Algorithms for Episodic MDPs
Modifying the learning algorithms above from the discounted to the episodic setting
requires a simple but important change. We show it here for Q-learning, but the
extension to the other algorithms is immediate.
Note that we removed the discount factor, and also explicitly used the fact that
the value of a goal state is 0. The latter is critical for the algorithm to converge,
under the Assumption 7.1 that a goal state will always be reached.
180
Chapter 12
This chapter starts looking at the case where the MDP model is large. In the current
chapter we will look at approximating the value function. In the next chapter we
will consider learning directly a policy and optimizing it.
When we talk about a large MDP, it can be due to a few different reasons. The
most common is having a large state space. For example, Backgammon has over 1020
states, Go has over 10170 and robot control typically has a continuous state space.
The curse of dimensionality is a common term for this problem, and relates to states
that are composed of several state variables. For example, the configuration of a
robot manipulator with N joints can be described using N variables for the angles
at each joint. Assuming that each variable can take on M different values, the size
of the state space, M N , i.e., grows exponentially with the number of state variables.
Another dimension is the action space, which can even be continuous in many
applications (say, robots). Finally, we might have complex dynamics which are hard
to describe succinctly (e.g., the next state is the result of a complex simulation), or
are not even known to sufficient accuracy.
Recall Bellman’s dynamic programming equation,
( )
X
V(s) = max r(s, a) + γ p(s0 |s, a)V(s0 ) ∀s ∈ S.
a∈A
s0 ∈S
Dynamic programming requires knowing the model and is only feasible for small
problems, where iterating over all states and actions is feasible. The model-free and
model-based learning algorithms described in Chapters 11 and 10 do not require
knowing the model, but require storing either value estimates for each state and
181
action, or state transition probabilities for every possible state, action, and next
state. Scaling up our planning and RL algorithms to very large state and action
spaces is the challenge we shall address in this chapter.
182
We mention that the approaches above are not mutually exclusive, and often in
practice, the best performance is obtained by combining different approaches. For
example, a common approach is to combine a T step lookahead with an approximate
terminal value function,
"t+T−1 #
0
X
π(st ) = argmax Eπ b ∗ (st+T )) .
r(st0 ) + V
π 0 ∈Π
t0 =t
We shall also see, in the next chapter, that value function approximations will be a
useful component in approximate policy optimization. In the rest of this chapter,
we focus on value function approximation. We will consider (mainly) the discounted
return with a discount parameter γ ∈ (0, 1). The results extend very naturally to
the finite horizon and episodic settings.
where w ∈ Rd are the model parameters and φ(s) ∈ Rd are the model’s features
(a.k.a. basis functions). Similarly, for state-action value functions, we use state-
action features, φ(s, a) ∈ Rd , and approximate the value by Q bπ (s, a; w) = wT φ(s, a).
Popular example of state feature vectors include radial basis functions φj (s) ∝
(s−µ )2
exp( σj j ), and tile features, where φj (s) = 1 for a set of states Aj ⊂ S, and
φj (s) = 0 otherwise. For state-action features, when the number of actions is finite
A = {1, 2, . . . , |A|}, a common approach is to extend the state features independently
for every action. That is, consider the following construction for φ(s, i) ∈ Rd·|A| ,
i ∈ A:
183
where 0 is a vector of d zeros.
For most interesting problems, however, designing appropriate features is a dif-
ficult problem that requires significant domain knowledge, as the structure of the
value function may be intricate. In the following, we assume that the features φ(s)
(or φ(s, a)) are given to us in advance, and we will concentrate on general methods
for calculating the weights w in a way that minimizes the approximation error as best
as possible, with respect to the available features.
b 0 )]].
b(s) = arg max[r(s, a) + γEs0 ∼p(·|s,a) [V(s
π
a
Proof. Consider two operators T π and T ∗ (see Chapter 6.4.3). The first, T π , is
and it converges to V ∗ (see Theorem 6.9). In addition, recall that we have shown
that both T π and T ∗ are γ-contracting (see Theorem 6.9).
b is greedy with respect to V
Since π b = T ∗V
b we have T πb V b (but this does not hold
for other value functions V 0 6= V).
b
184
Then,
kV πb − V ∗ k∞ = kT πb V πb − V ∗ k∞
≤ kT πb V πb − T πb Vk b − V ∗ k∞
b ∞ + kT πb V
b ∞ + kT ∗ V
≤ γkV πb − Vk b − T ∗ V ∗ k∞
≤ γkV πb − Vk b − V ∗ k∞
b ∞ + γkV
≤ γ(kV πb − V ∗ k∞ + kV ∗ − Vk b − V ∗ k∞ ,
b ∞ ) + γkV
where in the second inequality we used the fact that since since π is greedy with
respect to V b = T ∗ V.
b then T πb V b
Reorganizing the inequality and recalling that kV ∗ − Vk
b ∞ ≤ ε, we have
(1 − γ)kV πb − V ∗ k∞ ≤ 2εγ,
185
We first need to discuss how to sample the states si in an i.i.d. way. We can generate
a trajectory, but we need to be careful, since adjacent states are definitely dependent!
One solution is to space the sampling from the trajectory using the mixing time of
π.1 This will give us samples si which are sampled (almost) from the stationary
distribution of π and are (almost) independent. In the episodic setting, we can
sample different episodes, and states from different episodes are guaranteed to be
independent.
Second, we need to define a loss function, which will tradeoff the different approx-
imation errors. Since P the value is a real scalar, a natural candidate is the average
squared error loss, m m 1 bπ π 2
i=1 (V (si ) − V (si )) . With this loss, the corresponding
supervised learning problem is least squares regression.
The hardest, and most confusing, ingredient is the labels V π (si ). In supervised
machine leaning we assume that someone gives us the labels to build a classifier.
However, in our problem, the value function is exactly what we want to learn, and
it is not realistic to assume any ground truth samples from it!
Our main task, therefore, would be to replace the ground truth labels with quan-
tities that we can measure, using simulation or interaction with the system. We shall
start by formally defining least squares regression in a way that will be convenient
to extend later to RL.
1
See Chapter 4 for definition.
186
A practical iterative algorithm for solving (12.1) is the stochastic gradient descent
(SGD) method, which updates the parameters by
1 1 T T
ŵLS = min (Φ̂w − Ŷ )T (Φ̂w − Ŷ ) = min w (Φ̂ Φ̂)w − 2wT Φ̂T Ŷ + Ŷ T Ŷ . (12.3)
w N w N
Noting that (12.3) is a quadratic form, the least squares solution is calculated to be:
Proposition 12.2. Assume that ΦT ΞΦ is not singular. We have that limN →∞ ŵLS =
wLS , where
wLS = (ΦT ΞΦ)−1 ΦT ΞY.
1 T
Similarly, limN →∞ N
Φ̂ Ŷ = ΦT ΞY . Plugging into Eq. (12.4) completes the proof.
187
Using the stochastic approximation technique, a similar result holds for the SGD
update.
Proposition 12.3. Consider the SGD update in Eq. (12.2) with linear features,
Note that the expected LS solution can also be written as the solution to the
following expected least squares problem:
Observe that ΦwLS ∈ R|X| denotes a vector that contains the approximated function
g(x; wLS ) for every x. This is the best approximation, in terms of expected least
square error, of f onto the linear space spanned by the features φ(x). Recalling that
Y is a vector of ground truth f values, we view this approximation as a projection of
Y onto the space spanned by Φw, and we can write the projection operator explicitly
as:
Πξ Y = ΦwLS = Φ(ΦT ΞΦ)−1 ΦT ΞY.
Geometrically, Πξ Y is the vector that is closest to Y on the linear subspace
p Φw,
where the distance function is the ξ-weighted Euclidean norm, ||z||ξ = hz, ziξ ,
where hz, z 0 iξ = z T Ξz 0 .
We conclude this discussion by noting that although we derived Eq. (12.5) as the
expectation of the least square method, we could also take an alternative view: the
least squares method in (12.4) and the SGD algorithm are two different sampling
based approximations to the expected least squares solution in (12.5). We will take
this view when we develop our RL algorithms later on.
188
Figure 12.1: Example: MC vs. TD with function approximation.
Chapter
PT 11.3, simply sums the observed discounted reward from a state Rt (s) =
τ
τ =0 γ rτ , starting at the first visit of s in episode t. Clearly, we have E[Rt (s)] =
π
V (s), since samples are independent, so we can set Ut (s) = Rt (s).
For calculating the approximation, we can apply the various least squares algo-
rithms outlined above. In particular, for a linear approximation, and a large sample,
we understand that the solution will approach the projection, Φw = Πξ V π .
189
approximates V π (s3 ) and V π (s4 ) as the same value, V b3/4 . In this approximation we
effectively use the full N samples to estimate V b3/4 , resulting in variance 1/N . We can
π b π (s2 ), which will result in variance
now use bootstrapping to estimate V b (s1 ) and V
1/N + 1/(N/2) = 3/N , smaller than the MC estimate!
However, note that for ε 6= 0, the bootstrapping solution will also be biased:
taking N → ∞ we see that V b3/4 will converge to ε/2, and therefore V b π (s1 ) and
b π (s2 ) will converge to 1 + ε/2 and 2 + ε/2, respectively.
V
Thus, we see that bootstrapping, when combined with function approximation,
allowed us to reduce variance by exploiting the similarity between values of different
states, but at the cost of a possible bias in the expected solution. As it turns out,
this phenomenon is not limited to the example above, but can be shown to hold more
generally [52].
In the following, we shall develop a rigorous formulation of bootstrapping with
function approximation, and use it to suggest several approximation algorithms. We
will also bound the bias incurred by this approach.
Φw = Πξ T π {Φw}, (12.6)
where Πξ is the projection operator onto Sb under some ξ-weighted Euclidean norm.
Let us try to intuitively interpret the PBE. We are looking for an approximate
value function Φw ∈ R|S| , which by definition is within our linear approximation
190
space, such that after we apply to it T π , and project the result (which does not
necessarily belong to Sb anymore) back to S,
b we obtain the same approximate value.
Since the true value is a fixed point of T π , we have reason to believe that a fixed
point of Πξ T π may provide a reasonable approximation. In the following, we shall
investigate this hypothesis, and build on Eq. (12.6) to develop various learning al-
gorithms. We remark that the PBE is not the only way of defining an approximate
value function, and other approaches have been proposed in the literature. However,
the PBE is the basis for the most popular RL algorithms today.
3. If Πξ T π has a fixed point Φw∗ , how far is it from the best approximation
possible, namely, Πξ V π ?
Answering the first two points will characterize the approximate solution we seek.
The third point above relates to the bias of the bootstrapping approach, as described
in the example in Section 12.3.3.
Let us assume the following:
Assumption 12.1. The Markov chain corresponding to π has a single recurrent class
and no transient states. We further let
N
1 X
ξj = lim P (st = j|s0 = s) > 0,
N →∞ N
t=1
which is the probability of being in state j when the process reaches its steady state,
given any arbitrary s0 = s.
191
2. The unique fixed point Φw∗ of Πξ T π satisfies,
1
||V π − Φw∗ ||ξ ≤ ||V π − Πξ V π ||ξ , (12.7)
1−γ
and
1
||V π − Φw∗ ||2ξ ≤ ||V π − Πξ V π ||2ξ . (12.8)
1 − γ2
We remark that the bound in (12.8) is stronger than the bound in (12.7) (show
this!). We nevertheless include the bound (12.7) for didactic purpose, as it’s proof is
slightly different. Proposition 12.4 shows that for the particular projection defined
by weighting the Euclidean norm according to the stationary distribution of the
Markov chain, we can both guarantee a solution to the PBE, and bound its bias
with respect to the best solution possible under this weighting, Πξ V π . Fortunately,
we shall later see that this specific weighting is suitable for developing on-policy
learning algorithms. However, the reader should note that for a different ξ, the
conclusions of Proposition 12.4 do not necessarily hold.
P Pn 2
where the last equality is since by definition of ξi , i ξi pij = ξj , and j=1 ξj zj =
||z||2ξ .
192
We claim that J − Πξ J and Πξ J − Jb are orthogonal under h·, ·iξ (this is known as
the error orthogonality for weighted Euclidean-norm projections). To see this, recall
that
Πξ = Φ(ΦT ΞΦ)−1 ΦT Ξ,
so
ΞΠξ = ΞΦ(ΦT ΞΦ)−1 ΦT Ξ = ΠTξ Ξ.
Now,
b ξ = (J − Πξ J)T Ξ(Πξ J − J)
hJ − Πξ J, Πξ J − Ji b
= J T ΞΠξ J − J T ΞJb − J T ΠTξ ΞΠξ J + J T ΠTξ ΞJb
= J T ΞΠξ J − J T ΞJb − J T ΞΠξ Πξ J + J T ΞΠξ Jb
= J T ΞΠξ J − J T ΞΠξ J − J T ΞJb + J T Jb = 0,
||Πξ J1 −Πξ J2 ||2ξ = ||Πξ (J1 −J2 )||2ξ ≤ ||Πξ (J1 −J2 )||2ξ +||(I−Πξ )(J1 −J2 )||2ξ = ||J1 −J2 ||2ξ ,
where the first inequality is by linearity of Πξ , and the last equality is by the
Pythagorean theorem of Lemma 12.6, where we set J = J1 − J2 and Jb = 0.
In order to prove the contraction ∀J1 , J2 ∈ R|S| :
Πξ non-expansive
||Πξ T π J1 − Πξ T π J2 ||ξ ≤ ||T π J1 − T π J2 ||ξ
definition of T π Lemma 12.5
= γ||P π (J1 − J2 )||ξ ≤ γ||J1 − J2 ||ξ ,
and therefore Πξ T π is a contraction operator.
We now prove the error bound in (12.7).
||V π − Φw∗ ||ξ ≤ ||V π − Πξ V π ||ξ + ||Πξ V π − Φw∗ ||ξ
= ||V π − Πξ V π ||ξ + ||Πξ T π V π − Πξ T π Φw∗ ||ξ
≤ ||V π − Πξ V π ||ξ + γ||V π − Φw∗ ||ξ ,
193
where the first inequality is by the triangle inequality, the second equality is since
V π is T π ’s fixed point, and Φw∗ is Πξ T π ’s fixed point, and the second inequality is
by the contraction of Πξ T π . Rearranging gives (12.7).
We proceed to prove the error bound (12.8).
where the first equality is by the Pythagorean theorem, and the remainder follows
similarly to the proof of (12.7) above.
Solution approaches:
1. Matrix inversion (LSTD): We have that
w∗ = A−1 b.
194
Proposition 12.8. We have that
Es∼ξ [φ(s)r(s, π(s))] = b,
and
Es∼ξ,s0 ∼P π (·|s) φ(s)(φT (s) − γφT (s0 )) = A.
Proof. We have
X
Es∼ξ [φ(s)r(s, π(s))] = φ(s)ξ(s)r(s, π(s)) = ΦT ΞRπ = b.
s
Also,
Es∼ξ,s0 ∼P π (·|s) φ(s)(φT (s) − γφT (s0 ))
X
= ξ(s)P π (s0 |s)φ(s)(φT (s) − γφT (s0 ))
s,s0
X X X
= φ(s)ξ(s)φT (s) − γ φ(s)ξ(s) P π (s0 |s)φT (s0 )
s s s0
= Φ ΞΦ − γΦT ΞP π Φ = A.
T
7: Compute A
bN :
N
bN = 1
X
A φ(st )(φT (st ) − γφT (st+1 ))
N t=1
8: b−1bbN
Return wN = AN
195
From the ergodicity property of Markov chains (Theorem 4.9), we have the
following result.
with probability 1.
Remark 12.1. Projected value iteration can be used with more general regression
algorithm. Let Πgen denote a general regression algorithm, such as a non-
linear least squares fit, or even a non-parametric regression such as K-nearest
neighbors. We can consider the iterative algorithm:
To realize this algorithm, we use the same samples as above, and only replace
the regression algorithm. Note that convergence in this case is not guaranteed,
as in general, Πgen T π is not necessarily a contraction in any norm.
196
Algorithm 18 TD(0) with Linear Function Approximation
1: Initialize: Set w0 = 0.
2: For t = 0, 1, 2, . . .
3: Observe: (st , at , rt , st+1 ).
4: Update:
where the temporal difference term is the approximation (w.r.t. the weights at
time t) of r(st , π(st )) + γV(st+1 ) − V(st ).
This algorithm can be written as a stochastic approximation:
wt+1 = wt + αt (b − Awt + ωt ),
197
Proof. We write Eq. (12.10) as
wt+1 = wt + αt (b − Awt + ωt ),
where the noise ωt = r(st , π(st ))φ(st ) − b + (γφ(s0t )> − φ(st )> )wt φ(st ) + Awt
satisfies:
E[ωt |ht−1 ] = E[ωt |wt ] = 0,
where the first equality is since the states are drawn i.i.d., and the second is
from Proposition 12.8. We would like to use Theorem 11.7 to show convergence.
From Proposition 12.4 we already know that w∗ corresponds to the unique fixed
point of the linear dynamical system f (w) = −Aw + b. We proceed to show
that w∗ is globally asymptotically stable, by showing that the eigenvalues of A
have a positive real part. Let z ∈ R|S| . We have that
z T ΞP π z = z T Ξ1/2 Ξ1/2 P π z
≤ kΞ1/2 zkkΞ1/2 P π zk
= kzkξ kP π zkξ
≤ kzkξ kzkξ = z T Ξz.
where the first inequality is by Cauchy-Schwarz, and the second is by Lemma
12.5.
We claim that the matrix Ξ(I − γP π ) is positive definite. To see this, observe
that for any z ∈ R|S| 6= 0 we have
198
Therefore,
xT Ax + y T Ay
α= > 0.
xT x + y T y
Remark 12.3. A similar convergence result holds for the standard TD(0) of
Eq. 12.11, using a more sophisticated proof technique that accounts for noise
that is correlated (depends on the state). The main idea is to show that since
the Markov chain mixes quickly, the average noise is still close to zero with
high probability [124].
For a general (not necessarily linear) function approximation, the TD(0) algorithm
takes the form:
wn+1 = wn + αn r(sn , π(sn )) + V(s
b n+1 , wn ) − V(s
b n , wn ) ∇w V(s
b n , w).
It can be derived as a stochastic gradient descent algorithm for the loss function
b w) − V π (s)||ξ ,
Loss(w) = ||V(s,
and replacing the unknown V π (s) with a Bellman-type estimator r(s, π(s))+f (s0 , w).
199
12.4.1 Approximate Policy Iteration
The algorithm: iterate between projection of V πk onto Sb and policy improvement
via a greedy policy update w.r.t. the projected V πk .
The key question in approximate policy iteration, is how errors in the value-
function approximation, and possibly also errors in the greedy policy update, affect
the error in the final policy. The next result shows that if we can guarantee that the
value-function approximation error is bounded at each step of the algorithm, then the
error in the final policy will also be bounded. This result suggests that approximate
policy iteration is a fundamentally sound idea.
Theorem 12.11. If for each iteration k the policies are approximated well over S:
bk (s) − V πk (s)| ≤ δ,
max |V
s
max |T πk+1 V
bk − T V
bk | < ε,
s
Then
ε + 2γδ
lim sup max |V πk (s) − V ∗ (s)| ≤ .
k s (1 − γ)2
Online - SARSA
As we have seen earlier, it is easier to define a policy improvement step using the
Q function. We can easily modify the TD(0) algorithm above to learn Q bπ (s, a) =
f (s, a; w).
200
Algorithm 19 SARSA with Function Approximation
1: Initialize: Set w0 = 0.
2: For t = 0, 1, 2, . . .
3: Observe: st
4: Choose action: at
5: Observe rt , st+1
6: Update:
N
bbk = 1
X
N φ(st , at )r(st , at )
N t=1
5: Solve:
bkN )−1bbkN
wk = (A
201
It is also possible to collect data from the modified at each iteration k, instead of
from the initial policy.
Online - Q Learning
The function approximation version of online Q-learning resembles SARSA, only
with an additional maximization over the next action:
Algorithm 21 Q-learning with Function Approximation
1: Initialize: Set w0 = 0.
2: For t = 0, 1, 2, . . .
3: Observe: st
4: Choose action: at
5: Observe rt , st+1
6: Update:
b t+1 , a; wt ) − Q(s
wt+1 = wt + αt r(st , at ) + γ max Q(s b t , at ; wt ) ∇w Q(s
b t , at , w).
a
202
Figure 12.2: Two state snippet of an MDP
Batch – Fitted Q
In this approach, we iteratively project (fit) the Q function based on the projected
equation:
b n+1 ) = ΠT ∗ Q(w
Q(w b n ).
Assume we have a data set of samples {si , ai , s0i , ri },obtained from some data
collection policy. Then, the rightn hand side of the equation denotes o a regression
0
problem where the samples are: (si , ai ), ri + γ maxa Q(s b i , a; wn ) . Thus, by solv-
ing a sequence of regression problems we approximate a solution to the projected
equation.
Note that approximate VI algorithms are off-policy algorithms. Thus, in both Q-
learning and fitted-Q, the policy that explores the MDP can be arbitrary (assuming
of course it explores ‘enough’ interesting states).
203
Figure 12.3: The three state MDP. All rewards are zero.
When we use on-policy, we have all transitions. Assume that the second transition
happens n ≥ 0 times. Then we have
wt+1
= (1 + α(2γ − 1))(1 − 4α(1 − γ))n (1 − 4α) < 1 − α
wt
This implies that wt converges to zero, as desired.
Now consider an off-policy that truncates the episodes after n transitions of the
second state, where n 1/p, and in addition γ > 1 − 1/(40n). This implies that in
most updates we do not reach the terminal state and we have
wt+1
= (1 + α(2γ − 1))(1 − 4α(1 − γ))n > 1
wt
204
and therefore, for the some setting of n we have that the weight wt diverges.
We might hope that the divergence is due to the online nature of the TD updates.
We can consider an algorithm that in each iteration minimizes the square error.
Namely, X
wt+1 = arg min b t ; w) − E π [rt + γ V(s
[V(s b t+1 ; wt )]]2
w
s
205
206
Chapter 13
This chapter continues looking at the case where the MDP models are large state
space. In the previous chapter we looked at approximating the value function. In
this chapter we will consider learning directly a policy and optimizing it.
where τ is the termination time, which we will assume to bounded with probability
one. We are given a distribution over the initial state of the MDP, µ(s0 ), and
define J(θ) , E [V π (s0 )] = µ> V π to be the expected value of the policy (where the
expectation is with respect to µ).
The optimization problem we consider is:
207
This maximization problem can be solved in multiple ways. We will mainly
explore gradient based methods.
In the setting that the MDP is not known, we shall assume that we are allowed
to simulate ‘rollouts’ from a given policy, s0 , a0 , r0 , . . . , sτ , aτ , rτ , where s0 ∼ µ,
at ∼ π(·|st , θ), and st+1 ∼ P(·|st , at ). We shall devise algorithms that use such
rollouts to modify the policy parameters θ in a way that increases J(θ).
Log linear policy We will assume a feature encoding of the state and action pairs,
i.e., φ(s, a) ∈ Rd . Given the parameter θ, The linear part will compute ξ(s, a) =
φ(s, a)> θ. Given the values of ξ(s, a) for each a ∈ A, the policy selects action a with
probability proportional to eξ(s,a) . Namely,
eξ(s,a)
π(a|s, θ) = P ξ(s,b)
b∈A e
Gaussian linear policy This policy representation applies when the action space
is a real number, i.e., A = R. The encoding is of states, i.e., φ(s) ∈ Rd , and the
actions are any real number. Given a state s we compute ξ(s) = φ(s)> θ. We
select an action a from the normal distribution with mean ξ(s) and variance σ 2 , i.e.,
N (ξ(s), σ 2 ). (The Gaussian policy has an additional parameter σ.)
Non-linear policy Note that in both the log linear and Gaussian linear policies
above, the dependence of µ on θ was linear. It is straightforward to extend these
policies such that µ depends on θ in a more expressive and non-linear manner. A
popular parametrization is a feed-forward neural network, also called a multi-layered
perceptron (MLP). An MLP with d inputs, 2 hidden layers of sizes h1 , h2 , and k
outputs has parameters θ0 ∈ Rd×h1 , θ1 ∈ Rh1 ×h2 , θ2 ∈ Rh2 ×k . The MLP computes
µ ∈ Rk as follows:
ξ(s) = θ2T fnl θ1T fnl θ0T φ (s) ∈ Rk ,
where fnl is some non-linear function that is applied element-wise to each component
of a vector, for example the Rectified Linear Unit (ReLU) defined as ReLU(x) =
208
max(0, x). Once µ is computed, selecting an action proceeds similarly as above, e.g.,
by sampling from the normal distribution with mean ξ(s) and variance σ 2 .
Simplex policy This policy representation will be used mostly for pedagogical rea-
sons, and can express any Markov stochastic policy. For a finite state and action
space, let θ ∈ [0, ∞)S×A , and denote θs,a the parameter corresponding to state s
and action a. We define π(a|s, θ) = P θ0s,aθ 0 . Clearly, any Markov policy π̃ can be
a s,a
represented by setting θs,a = π̃(a|s).
Proof. We have,
∞
X
dπ (s) = µ(s) + P (st = s|µ, π)
t=1
∞ X
X
= µ(s) + P (st−1 = s0 |µ, π)P π (s|s0 )
t=1 s0
X ∞
X
0
= µ(s) + π
P (s|s ) P (st−1 = s0 |µ, π)
s0 t=1
X
= µ(s) + d (s )P (s|s0 ).
π 0 π
s0
Writing the result in matrix notation gives the first result. For the second result,
Proposition 7.1 showed that (I − P π ) is invertible.
To deal with large state spaces, as we did in previous chapters, we will want to use
sampling to approximate quantities that depend on all states. Note that expectations
209
over the state visitation frequencies can be approximated by sampling from policy
rollouts.
Proof. We have
" τ
# " τ X
#
X X
π π
E g(st ) = E I(st = s)g(st )
t=0 t=0
" τs #
X X
π
= E I(st = s)g(st )
s
" t=0
τ
#
X X
= Eπ I(st = s)g(s)
s t=0
" τ
#
X X
= g(s)Eπ I(st = s)
s t=0
X
= g(s)dπ (s),
s
Lemma 13.3. For any two policies, π and π 0 , corresponding to parameters θ and θ0 ,
we have X 0 X
J(θ0 ) − J(θ) = dπ (s) π 0 (a|s) (Qπ (s, a) − V π (s)) . (13.2)
s a
0 0
Proof. We have that V π = (I − P π )−1 r, and therefore
0 0 0 0
V π − V π = (I − P π )−1 r − (I − P π )−1 (I − P π )V π
0
0
= (I − P π )−1 r + P π V π − V π .
210
0 0
Multiplying both sides by µ, and by Proposition 13.1 dπ = µ(I − P π )−1 , this gives
0
0
J(θ0 ) − J(θ) = dπ r + P π V π − V π .
0
π 0 (a|s)Qπ (s, a) = r(s) + P π (s0 |s)V π (s0 ).
P P
Finally, note that a s0
Given some policy π(a|s), an improved policy π 0 (a|s) must satisfy that the right
hand side of Eq. 13.2 is positive. Let us try to intuitively understand this crite-
rion. First, consider the simplex policy parametrization above, which can express
any Markov policy. Consider the policy iteration update π 0 (s) = arg maxa Qπ (s, a).
Substituting in the right hand side of Eq. 13.2 yields a non-negative value for every
s, and therefore an improved policy as expected.
For some policy parametrizations, however, the terms in the sum in Eq. 13.2
cannot be made positive for all s. To obtain policy improvement, the terms need
to be balanced such that a positive sum is obtained. This is not straightforward for
two reasons. First, for large state spaces, it is not tractable to compute the sum over
s, and sampling must be used to approximate this sum. However, straightforward
0
sampling of states from a fixed policy will not work, as the weights in the sum, dπ (s),
depend on the policy π 0 ! The basic insight is that when we modify π, we directly
influence the action distribution, but we also indirectly change the state distribution,
which influences the expected reward.
The following example shows that indeed, balancing the sum with respect to
weights that correspond to the current policy π does not necessarily lead to a policy
improvement.
Example 13.1. Consider the finite horizon MDP in Figure 13.1, where the policy is
parametrized by θ = [θ1 , θ2 ] ∈ [0, 1]2 and let π correspond to θ1 = θ2 = 1/4. It is easy
to verify that dπ (s1 ) = 1, dπ (s2 ) = 1/4, and dπ (s3 ) = 3/4. Simple calculations give
that
θ1 1 1 −3 1
−13/8)θ+
plug in the three states. For state s1 we have θ1 (3/4 (1 − θ1 )(1/4 − 3/8) =
2
− 8 . For state s2 we have 4 4 θ2 + (1 − θ2 ) 4 = 16 − 4 . For state s3 we have
2
211
Figure 13.1: Example MDP
+ (1 − θ2 ) −1
3 3
3θ 3
4
θ
4 2 4
= 42 − 16 . Maximizing over θ we have,
θ1 1 1 θ2 3θ2 3 θ1
arg max − + − + − = arg max = 1.
θ1 2 8 16 4 4 16 θ1 2
θ1 1 1 θ2 3θ2 3 θ2
arg max − + − + − = arg max = 1.
θ2 2 8 16 4 4 16 θ2 2
0
However, for π 0 that corresponds to θ0 = [1, 1] we have that V π (s1 ) = 0 < V π (s1 ).
Intuitively, we expect that if the difference π 0 − π is ‘small’, then the difference in
0
the state visitation frequencies dπ − dπ would also be ‘small’, allowing us to safely
0
replace dπ in the right hand side of Eq. 13.2 with dπ . This is the route taken by
several algorithmic approaches, which differ in the way of defining a ‘small’ policy
perturbation. Of particular interest to us is the case of an infinitesimal perturbation,
that is, the policy gradient ∇θ J(θ). In the following, we shall describe in detail
several algorithms for estimating the policy gradient.
212
where α is a learning rate. For a small enough learning rate, each update is guaran-
teed to increase J(θ).
In the following, we shall explore several different approaches for calculating the
gradient ∇θ J(θ) using rollouts from the MDP.
∂ ˆ + δei ) − J(θ)
J(θ ˆ
J(θ) ≈ ,
∂θi δ
ˆ is unbiased estimator of J(θ). A more symmetric approximation is some-
where J(θ)
times better,
∂ ˆ + δei ) − J(θ
J(θ ˆ − δei )
J(θ) ≈ .
∂θi 2δ
The problem is that we need to average many samples of J(θ ˆ ± δei ) to overcome
the noise. Another weakness is that we need to do the computation per dimension.
In addition, the selection of δ is also critical. A small δ might have a large noise rate
that we need to overcome (by using many samples). A large δ run the risk of facing
the non-linearity of J.
Rather then performing separately the computation and optimization per dimen-
sion, we can perform a more global approach and use a least squares estimation of
the gradient. Consider a random vector ui , then we have
213
where G is our estimate for ∇J(θ).
We can reformulate the problem in matrix notation and define ∆J (i) = J(θ +
δui ) − J(θ) and ∆J = [· · · , ∆J (i) , · · · ]> . We define ∆θ(i) = δui , and the matrix
[∆Θ] = [· · · ∆θ(i) , · · · ]> , where the i-th row is ∆θ(i) .
We would like to solve for the gradient, i.e,
∆J ≈ [∆Θ]x .
One issue that we neglected is that we actually do not a have the value of J(θ).
The solution is to solve also for the value of J(θ). We can define a matrix M =
[1, [∆Θ]], i.e., adding a column of ones, a vector of unknowns x = [J(θ), ∇J(θ)], and
have the target be z = [· · · , J(θ + δui ), · · · ]. We can now solve for z ≈ M x, and this
will recover an estimate also for J(θ).
Assumption 13.1. The gradient ∇π(a|s, θ) exists and is finite for every θ ∈ Rd ,
s ∈ S, and a ∈ A.
We will mainly try to make sure that we are able to use it to get estimates, and
the quantities would be indeed observable by the learner.
Proof. For simplicity we consider that θ is a scalar; the extension to the vector case
214
is immediate. By definition we have that
where the second equality uses Lemma 13.3 P and theπ third equality is since
πθ πθ πθ
P
a πθ+δθ (a|s)V (s) = V (s), and V (s) = a πθ (a|s)Q (s, a). The fourth equal-
θ
ity holds by definition of the derivative, and using Assumption 13.1. Note that As-
sumption 13.1 guarantees that π is continuous in θ, and therefore P π is continuous
in θ, and by Proposition 13.1 we must have limδθ→0 dπθ+δθ (s) = dπ (s).
The Policy Gradient Theorem gives us a way to compute the gradient. We can
sample states from the distribution dπ (s) using the policy π. We still need to resolve
the sampling of the action. We are going to observe the outcome of only one action
in state s, and the theorem requires summing over all of them! In the following we
will slightly modify the theorem so that we will be able to use only the action a
selected by the policy π, rather than summing over all actions.
Consider the following simple identity,
∇f (x)
∇f (x) = f (x) = f (x)∇ log f (x) (13.3)
f (x)
This implies that we can restate the Policy Gradient Theorem as the following corol-
lary,
Corollary 13.5 (Policy Gradient Corollary). Consider a random rollout from the
policy s0 , a0 , r0 , . . . , sτ , aτ , rτ , where s0 ∼ µ, at ∼ π(·|st , θ), st+1 ∼ P(·|st , at ), and
τ is the termination time. We have
X X
∇J(θ) = dπ (s) π(a|s)Qπ (s, a)∇ log π(a|s)
s∈S a∈A
" τ
#
X
π π
=E Q (st , at )∇ log π(at |st ) .
t=0
215
Proof. The first equality is by the identity above, and the second is by definition of
dπ (s), similarly to Proposition 13.2.
Note that in the above corollary both the state s and action a are sampled using
the policy π. This avoids the need to sum over all actions, and leaves only the action
selected by the policy.
We next provide some examples for the policy gradient theorem.
Example 13.2. Consider an MDP with a single state s (which is also called Multi-
Arm Bandit, see Chapter 14). Assume we have only two actions, action a1 has
expected reward r1 and action a2 has expected reward r2 .
The policy π is define with a parameter θ = (θ1 , θ2 ), where θi ∈ R. Given θ the
probability of action ai is pi = eθi /(eθ1 + eθ2 ). We will also select a horizon of length
one, i.e., T = 1. This implies that Qπ (s, ai ) = ri .
In this simple case we can compute directly J(θ) and ∇J(θ). The expected return
is simply,
eθ1 eθ2
J(θ) = p1 r1 + p2 r2 = θ1 r1 + r2
e + eθ2 eθ1 + eθ2
Note that ∂θ∂ 1 p1 = p1 − p21 = p1 (1 − p1 ) and ∂θ∂ 2 p1 = −p1 p2 = −p1 (1 − p1 ). The
gradient is
p1 (1 − p1 ) −p1 (1 − p1 ) +1
∇J(θ) = r1 + r2 = (r1 − r2 )p1 (1 − p1 ) .
−p1 (1 − p1 ) p1 (1 − p1 ) −1
Updating in the direction of the gradient, in the case that r1 > r2 , would increase θ1
and decrease θ2 , and eventually p1 will converge to 1.
To apply the Policy gradient theorem we need to compute the gradient,
p1 p1 (1 − p1 )
∇θ π(a1 |s; θ) = ∇ =
1 − p1 −p1 (1 − p1 )
where we used the fact that there is only a single state s, and that Qπ (s, ai ) = ri .
Example 13.3. Consider the following deterministic MDP. We have states S =
{s0 , s1 , s2 , s3 } and actions A = {a0 , a1 }. We start at s0 . Action a0 from any state
leads to s3 . Action a1 moves from s0 to s1 , from s1 to s2 and from s2 to s3 . All the
216
rewards are zero except the terminal reward at s2 which is 1. The horizon is T = 2.
This implies that the optimal policy performs in each state a1 and has a return of 1.
We have a log-linear policy parameterized by θ ∈ R4 . In state s0 it selects action
a1 with probability p1 = eθ1 /(eθ1 + eθ2 ), and in state s1 it selects action a1 with
probability p2 = eθ3 /(eθ3 + eθ4 ).
For this simple MDP we can specify the expected return J(θ) = p1 p2 . We can
also compute the gradient and have
p1 (1 − p1 )p2 (1 − p1 )
−p1 (1 − p1 )p2
= p1 p2 −(1 − p1 )
∇J(θ) = p1 p2 (1 − p2 ) (1 − p2 )
−p1 p2 (1 − p2 ) −(1 − p2 )
The policy gradient theorem will use the following ingredients. The Qπ is: Qπ (s0 , a1 ) =
p2 , Qπ (s1 , a1 ) = 1 and all the other entries are zero. The weights of the states are
dπ (s0 ) = 1, dπ (s1 ) = p1 , dπ (s2 ) = p1 p2 and dπ (s3 ) = 2 − p1 − p1 p2 . The gradient of
the action in each state is:
1 1 0 1
− p21 0 − p1 (1 − p1 ) 1 = p1 (1 − p1 ) −1
0
∇π(a1 |s0 ; θ) = p1
0 0 0 0
0 0 0 0
Similarly
0 0 0 0
0
− p22 0 − p2 (1 − p2 ) 0 = p2 (1 − p2 ) 0
∇π(a1 |s1 ; θ) = p2
1 1 0 1
0 0 1 −1
The policy gradient theorem states that the expected return gradient is
dπ (s0 )Qπ (s0 , a1 )π(a1 |s0 ; θ)∇ log π(a1 |s0 ; θ)+dπ (s1 )Qπ (s1 , a1 )π(a1 |s1 ; θ)∇ log π(a1 |s1 ; θ)
where we dropped all the terms that evaluate to zero. plugging in our values we
have
1 0 (1 − p1 )
−1
+ p1 p2 (1 − p2 ) 0 = p1 p2 −(1 − p1 )
p2 p1 (1 − p1 )
0 1 (1 − p2 )
0 −1 −(1 − p2 )
which is identical to ∇J(θ).
217
Example 13.4. Consider the bandit setting with continuous action A = R, where the
MDP has only a single state and the horizon is T = 1. The policy and reward are
given as follows:
r(a) = a,
(a − θ)2
1
π(a) = √ exp − ,
2πσ 2 2σ 2
where the parameter is θ ∈ R and σ is fixed and known. As in Example 13.2, we have
that Qπ (s, a) = a. Also, J(θ) = Eπ [a] = θ, and thus ∇J(θ) = 1. Using Corollary
13.5, we calculate:
a−θ
∇ log π(a) = 2
,
σ
π a(a − θ)
∇J(θ) = E
σ2
1
= 2 Eπ [a2 ] − (Eπ [a])2 = 1.
σ
Note the intuitive interpretation of the policy gradient here: we average the difference
of an action from the mean action a − θ and the value it yields Qπ (s, a) = a. In
this case, actions above the mean lead to higher reward, thereby ‘pushing’ the mean
action θ to increase. Note that indeed the optimal value of θ is infinite.
218
Simplex policy For the Simplex policy class, we have
X 1 1
∇θs,a log π(a|s; θ) = ∇ log θs,a − ∇ log θs,b = −P .
b
θs,a b θs,b
and for b0 6= a,
X 1
∇θs,b0 log π(a|s; θ) = −∇ log θs,b = − P .
b b θs,b
Algorithm 22 REINFORCE
1: Input step size α
2: Initialize θ0 arbitrarily
3: For j = 0, 1, 2, . . .
4: Sample rollout
P (s0 , a0 , r0 , . . . , sτ , aτ , rτ ) using policy πθj .
5: Set Rt:τ = τi=t ri
6: Update policy parameters:
τ
X
θj+1 = θj + α Rt:τ ∇ log π(at |st ; θj )
t=0
Baseline function
One caveat with the REINFORCE algorithm as stated above, is that is tends to
have high variance in estimating the policy gradient, which in practice leads to slow
1
We implicitly assume that no state appears twice in the trajectory, and therefore the ‘every
visit’ and ‘first visit’ Monte-Carlo updates are equivalent.
219
convergence. A common and elegant technique to reduce variance is to to add to
REINFORCE a baseline function, also termed a ‘control variate’.
The baseline function b(s) can depend in an arbitrary way on the state, but
does not depend on the action. The main observation would be that we can add or
subtract any such function from our Qπ (s, a) estimate, and it will still be unbiased.
This follows since
X X
b(s)∇π(a|s; θ) = b(s)∇ π(a|s; θ) = b(s)∇1 = 0. (13.4)
a a
This gives us a degree of freedom to select b(s). Note that by setting b(s) = 0 we
get the original theorem. In many cases it is reasonable to use for b(s) the value of
the state, i.e., b(s) = V π (s). The motivation for this is to reduce the variance of the
estimator. If we assume that the magnitude of the gradients k∇ log π(a|s)k is similar
for all actions a ∈ A, we are left with Eπ [(Qπ (s, a) − b(s))2 ] which is minimized by
b(s) = Eπ [Qπ (s, a)] = V π (s).
The following example shows this explicitly.
Example 13.5. Consider the bandit setting of Example 13.4, where we recall that
2
r(a) = a, πθ (a) = √2πσ 1
2
exp(− (a−θ)
2σ 2
). Find a fixed baseline b that minimizes the
variance of the policy gradient estimate.
The policy gradient formula in this case is:
(a − b)(a − θ)
∇θ J(θ) = E = 1,
σ2
and we can calculate the variance
1 1
Var [(a − b)(a − θ)] = 4 E ((a − b)(a − θ))2 − 1
σ 4 σ
1
= 4 E ((a − θ)(a − θ) + (θ − b)(a − θ))2 − 1
σ
1
= 4 E (a − θ)4 + 2(θ − b)(a − θ)3 + (θ − b)2 (a − θ)2 − 1
σ
1
= 4 E (a − θ)4 + (θ − b)2 (a − θ)2 − 1 ,
σ
which is minimized for b = θ = V(s).
220
We are left with the challenge of approximating V π (s). On the one hand this
is part of the learning. On the other hand we have developed tools to address this
in the previous chapter on value function approximation (Chapter 12). We can use
V π (s) ≈ V (s; w) = b(s). The good news is that any b(s) will keep the estimator
unbiased, so we do not depend on V (s; w) to be unbiased.
We can now describe the REINFORCE algorithm with baseline function. We will
use a Monte-Carlo sampling to estimate V π (s) using a class of value approximation
functions V (·; w) and this will define our baseline function b(s). Note that now we
have two parameter vectors: θ for the policy, and w for the value function.
Note that the update for θ follows the policy gradient theorem with a baseline
V (st ; w), and the update for w is a stochastic gradient descent on the mean squared
error with step size β.
221
The critic maintains an approximate Q function Q(s, a; w). For each time t it
defines the TD error to be Γt = rt + Q(st+1 , at+1 ; w) − Q(st , at ; w). The update will
be ∆w = αΓt ∇Q(st , at ; w). The critic send the actor the TD error Γt .
The actor maintains a policy π which is parameterized by θ. Given a TD error
Γt it updates ∆θ = βΓt ∇ log π(at |st ; θ). Then it selects at+1 ∼ π(·|st+1 ; θ).
We need to be careful in the way we select the function approximation Q(·; w)
since it might introduce a bias (note that here we use the function approximation
to estimate Q(s, a) directly, and not the baseline as in the REINFORCE method
above). The following theorem identifies a special case which guarantee thats we will
not have such a bias.
Let the expected square error of w is
1
SE(w) = Eπ [(Qπ (s, a) − Q(s, a; w))2 ]
2
A value function is compatible if,
∇w Q(s, a; w) = ∇θ log π(a|s; θ)
Theorem 13.6. Assume that Q is compatible and w minimizes SE(w), then,
τ
X
∇θ J(θ) = Eπ [Q(st , at ; w)∇ log π(at |st ; θ)]
t=1
222
We can summarize the various updates for the policy gradient as follows:
• REINFORCE (which is a Monte-Carlo estimate) uses Eπ [Rt ∇ log π(a|s; θ)].
• Q-function with actor-critic uses Eπ [Q(at |st ; w)∇ log π(a|s; θ)].
• A-function with actor-critic uses Eπ [A(at |st ; w)∇ log π(a|s; θ)], where A(a|s; w) =
Q(s, a; w) − V (s; w). The A-function is also called the Advantage function.
• TD with actor-critic uses Eπ [Γ∇ log π(a|s; θ)], where Γ is the TD error.
223
(b) Non convex function (c) Non convex function
(a) A convex function with with a sub-optimal local with a single global mini-
single global minimum. minimum. mum.
Figure 13.2: Gradient descent with a proper step size will converge to a global
optimum in (a) and (c), but not in (b).
of the value in the policy. Perhaps a more intuitive way of combining two policies is
by selecting which policy to run at the beginning of an episode, and using only that
policy throughout the episode. For such a non-Markovian policy, the expected value
will simply be the average of the values of the two policies.
Remark 13.2. From the linear programming formulation in Chapter 8.3, we know
that the value is linear (and thereby convex) in the state-action frequencies. While a
policy can be inferred from state-action frequencies, this mapping is non-linear, and
as the example above shows, renders the mapping from policy to value not necessarily
convex.
Following Example 13.6, we should not immediately expect policy gradient al-
gorithms to converge to a globally optimal policy. Interestingly, in the following we
shall show that nevertheless, for the simplex policy there are no local optima that
are not globally optimal.
Before we show this, however, we must handle a delicate technical issue. The
simplex policy is only defined for θs,a ≥ 0. What happens if some θs,a = 0 and
∂J(θ)
∂θs,a
< 0? We shall assume that in this case, the policy gradient algorithm will
maintain θs,a = 0. We can therefore consider a modified gradient at θs,a = 0:
˜
∂J(θ)
∂J(θ)
˜
∂J(θ) ∂J(θ)
= max 0, , = .
∂θs,a ∂θs,a ∂θs,a ∂θs,a
θs,a =0 θs,a 6=0
π
We shall make a further assumption that d (s) > 0 for all s, π. To understand
why this is necessary, consider an initial policy π0 that does not visit a particular
224
state s at all, and therefore dπ0 (s) = 0. From the policy gradient theorem, we will
have that ∂J(θ)
∂θs,a
= 0, and therefore the policy at s will not improve. If the optimal
policy in other states does not induce a transition to s, we cannot expect convergence
to optimal policy in s. In other words, the policy must explore enough to cover the
state space.
Furthermore, for simplicity, we shall assume that the optimal policy is unique.
Let us now calculate the policy gradient for the simplex policy.
P
a00 θs,a00 −θs,a0
0
∂π(a |s) (Pa00 θs,a00 )2 if a0 = a,
=
∂θs,a P −θs,a0 2 if a0 6= a.
( 00 θ 00 )
a s,a
Now, assume that π is not optimal, therefore there exists some s for which
maxa Qπ (s, a) > V π (s) (otherwise, V π would satisfy the Bellman optimality equation
and would therefore be optimal). In this case, we have that ∂J(θ)
∂θs,a
> 0 and therefore
θ is not a local optimum.
Lastly, we should verify that the optimal policy π ∗ is indeed a global optimum.
The unique optimal policy is deterministic, and satisfies
( ∗ ∗
∗ 1 if Qπ (s, a) = V π (s),
π (a|s) =
0 else .
Consider any θ∗ such that for all s, a satisfies
( ∗ ∗
∗ > 0 if Qπ (s, a) = V π (s),
θs,a = .
0 else .
∂J(θ)
By the above, we have that for the optimal action ∂θs,a
= 0, and for non-optimal
∗ ∗ ∂J(θ) ˜
actions Qπ (s, a) − V π (s) < 0, therefore, ∂θs,a
<0 and ∂J(θ)
∂θs,a
= 0.
225
13.8 Proximal Policy Optimization
Recall our discussion about the policy difference lemma: if the difference π 0 − π is
0
‘small’, then the difference in the state visitation frequencies dπ − dπ would also be
0
‘small’, allowing us to safely replace dπ in the right hand side of Eq. 13.2 with dπ .
The Proximal Policy Optimization (PPO) algorithm is a popular heuristic that takes
this approach, and has proved to perform very well empirically.
To simplify our notation we write the advantage function Aπ (s, a) = Qπ (s, a) −
V π (s). The idea is to maximize the policy that leads to policy improvement
0
X X
max
0
dπ (s) π 0 (a|s)Aπ (s, a),
π ∈Π
s a
0
by replacing dπ with the visitation frequencies of the current policy dπ , and perform-
ing the search over a limited set of policies Π that is similar to π. The main trick
in PPO is that this constrained optimization can be done implicitly, by maximizing
the following objective:
π 0 (a|s)
π 0 (a|s) π
X X
π π
PPO(π) = max d (s) π(a|s) min A (s, a), clip
, 1 − , 1 + A (s, a) ,
π0
s a
π(a|s) π(a|s)
(13.5)
where clip (x, xmin , xmax ) = min{max{x, xmin }, xmax }, and is some small constant.
Intuitively, the clipping in this objective prevents the ratio between the new policy
π 0 (a|s) and the previous policy π(a|s) to grow larger than , assuring that maximizing
the objective indeed leads to an improved policy.
To optimize the PPO objective using a sample rollout, we let Γt denote an esti-
mate of the advantage at state st , at , and take gradient descent steps on:
τ
π 0 (at |st , θ) π 0 (at |st , θ)
X
∇θ min Γt , clip , 1 − , 1 + Γt .
t=0
π(at |st ) π(at |st )
226
Algorithm 24 PPO
1: Input step sizes α, β, inner loop optimization steps K, clip parameter
2: Initialize θ, w arbitrarily
3: For j = 0, 1, 2, . . .
4: Sample rollout
Pτ (s0 , a0 , r0 , . . . , sτ , aτ , rτ ) using policy π.
5: Set Rt:τ = i=t ri
6: Set Γt = Rt:τ − V (st ; w)
7: Set θprev = θ
8: For k = 1, . . . , K
9: Update policy parameters:
τ
X π(at |st , θ) π(at |st , θ)
θ := θ+α∇θ min Γt , clip , 1 − , 1 + Γt
t=0
π(at |st , θprev ) π(at |st , θprev )
In this section, for didactic purposes, we show two alternative proofs for the policy
gradient theorem (Theorem 13.4). The first proof is based on an elegant idea of
unrolling of the value function, and the second is based on a trajectory-based view.
The trajectory-based proof will also lead to an interesting insight about partially
observed systems.
227
Proof. For each state s we have
X
∇V π (s) =∇ π(a|s)Qπ (s, a)
a
X
= Qπ (s, a)∇π(a|s) + π(a|s)∇Qπ (s, a)
a
X X
= Qπ (s, a)∇π(a|s) + π(a|s) P(s1 |s, a)∇V π (s1 )
a s1
X X
= Qπ (s, a)∇π(a|s) + P π (s1 |s)∇V π (s1 )
a s1
X X X
π
= Q (s, a)∇π(a|s) + P π (s1 |s) Qπ (s1 , a)∇π(a|s1 )
a s1 a
X
π π π
+ P (s2 |s1 )P (s1 |s)∇V (s2 )
s1 ,s2
XX ∞ X
= P (st = s|s0 = s, π) Qπ (s, a)∇π(a|s)
s∈S t=0 a
where the first identity follows since by averaging Qπ (s, a) over the actions a, with
the probabilities induce by π(a|s), we have both correct expectation of the immediate
reward and the next state is distributed correctly. The second equality follows from
the gradient of a multiplication, i.e., ∇AB = A∇B + B∇A. The third follows since
∇Qπ (s, a) = ∇[r(s, a) + s0 P(s0 |s, a)V π (s0 |s, a)]. The next two identities role the
P
policy one step in to the future. The last identity follows from unrolling s1 to s2 etc.,
and then reorganizing the terms. The term that depends on ∇V π (s2 ) vanishes for
t → ∞ because we assume that the termination time is bounded with probability 1.
Using this we have
X
∇J(θ) = ∇ µ(s)V π (s)
s
∞
!
X X X
= µ(s) P (st = s|s0 = s, π) Qπ (s, a)∇π(a|s)
s t=0 a
∞
!
X X X
= P (st = s|µ, π) Qπ (s, a)∇π(a|s)
s t=0 a
X X
π π
= d (s) ∇π(a|s)Q (s, a)
s a
228
where the last equality is by definition of dπ .
Pr(X) = µ(s0 )π(a0 |s0 , θ)P(s1 |s0 , a0 )π(a1 |s1 , θ) · · · P(sG |sτ , aτ ). (13.6)
where the first equality is by (13.6), and the second equality is since the transitions
and initial distribution do not depend on θ. We therefore have that
" τ τ
#
X X
π
∇J(θ) = E ∇ log π(at |st , θ) r(st0 , at0 ) . (13.7)
t=0 t0 =0
We next show that in the sums in (13.7), it suffices to only consider rewards that
come after ∇ log π(at |st , θ). For t0 < t, we have
Eπ [∇ log π(at |st , θ)r(st0 , at0 )] = Eπ [Eπ [∇ log π(at |st , θ)r(st0 , at0 )| s0 , a0 , . . . , st ]]
= Eπ [r(st0 , at0 )Eπ [ ∇ log π(at |st , θ)| s0 , a0 , . . . , st ]] = 0,
229
where the first equality is from the law of total expectation, and the last is similar
to (13.4). So we have
" τ τ
#
X X
π
∇J(θ) = E ∇ log π(at |st , θ) r(st0 , at0 ) . (13.8)
t=0 t0 =t
Note that the REINFORCE Algorithm 22 can be seen as estimating the expectation
in (13.8) from a single roll out. To finally obtain the policy gradient theorem, using
the law of total expectation again, we have
" τ τ
# "∞ ∞
#
X X X X
Eπ ∇ log π(at |st , θ) r(st0 , at0 ) = Eπ ∇ log π(at |st , θ) r(st0 , at0 )
t=0 t0 =t t=0 t0 =t
∞
" ∞
#
X X
= Eπ ∇ log π(at |st , θ) r(st0 , at0 )
t=0 t0 =t
∞
" " ∞
##
X X
= Eπ Eπ ∇ log π(at |st , θ) r(st0 , at0 ) st , at
t=0 t0 =t
∞
" " ∞
##
X X
= Eπ ∇ log π(at |st , θ)Eπ r(st0 , at0 ) st , at
t=0 t0 =t
∞
X
= Eπ [∇ log π(at |st , θ)Qπ (st , at )]
t=0
" τ
#
X
= Eπ ∇ log π(at |st , θ)Qπ (st , at ) ,
t=0
which is equivalent to the expression in Corollary 13.5. The first equality is since the
terminal state is absorbing, and has reward zero. The justification for exchanging
the expectation and infinite sum in the second equality is not straightforward. In
this case it holds by the Fubini theorem, using Assumption 7.1.
Partially Observed States We note that the derivation of (13.7) follows through
if we consider policies that cannot access the state, but only some encoding φ of it,
π(a|φ(s)). Even though the optimal Markov policy in an MDP is deterministic, the
encoding may lead to a system that is not Markovian anymore, by coalescing certain
states which have identical encoding. Considering stochastic policies and using a
policy gradient approach can be beneficial in such situations, as demonstrated in the
following example.
230
Figure 13.3: Grid-world example
Example 13.7 (Aliased Grid-world). Consider the example in Figure 13.3. The green
state is the good goal and the red ones are the bad. The encoding of each state is the
location of the walls. In each state we need to choose a direction. The problem is
that we have two states which are indistinguishable (marked by question mark).
It is not hard to see that any deterministic policy would fail from some start state
(either the left or the right one). Alternatively, we can use a randomized policy in
those states,with probability half go right and probability half go left. For such a policy
we have a rather short time to reach the green goal state (and avoid the red states).
The issue here was that two different states had the same encoding, and thus
violated the Markovian assumption. This can occur when we encode the state with
a small set of features, and some (hopefully, similar) states coallesce to a single
representation.
Remark 13.3. The state aliasing example above is a specific instance of a more general
decision making problem with partial observability, such as the partially observed
MDP (POMDP). While a treatment of POMDPs is not within the scope of this book,
we mention that the policy gradient approach applies to such models as well [8].
231
232
Chapter 14
Multi-Arm bandits
We consider a simplified model of an MDP where there is only a single state and a
fixed set A of k actions (a.k.a., arms). We consider a finite horizon problem, where
the horizon is T. Clearly, the planning problem is trivial, simply select the action
with the highest expected reward. We will concentrate on the learning perspective,
where the expected reward of each action is unknown. In the learning setting we
would have a single episode of length T.
At each round 1 ≤ t ≤ T the learner selects and executes an action. After
executing the action, the leaner observes the reward of the action. However, the
rewards of the other actions in A are not revealed to the learner.
The reward for action i at round t is denoted by rt (i) ∼ Di , where the support
of the reward distribution Di is [0, 1]. We assume that the rewards are i.i.d. (inde-
pendent and identically distributed) across time steps, but can be correlated across
actions in a single time step.
Motivation
1. News: a user visits a news site and is presented with a news header. The user
either clicks on this header or not. The goal of the website is to maximize the
number of clicks. So each possible header is an action in a bandit problem, and
the clicks are the rewards
2. Medical Trials: Each patient in the trial is prescribed one treatment out of
several possible treatments. Each treatment is an action, and the reward for
each patient is the effectiveness of the prescribed treatment.
3. Ad selection: In website advertising, a user visits a webpage, and a learning
algorithm selects one of many possible ads to display. If an advertisement is
233
displayed, the website observes whether the user clicks on the ad, in which
case the advertiser pays some amount va ∈ [0, 1]. So each advertisement is an
action, and the paid amount is the reward.
Model
µi = EX∼Di [X]
• The leaner observes either full feedback, the reward for each possible action, or
bandit feedback, only the reward rt of the selected action at . For most of the
chapter we will consider the bandit setting.
The regret as define above is a random variable and we can consider the expected
regret, i.e., E[Regret]. This regret is a somewhat unachievable objective, since even
if the learner would have known the complete model, and would have selected the
optimal action in each time, it would still have a regret. This would follow from
the difference between the expectation and the realizations of the rewards. For this
234
reason we would concentrate on the Pseudo Regret, which compares the learner’s
expected cumulative reward to the maximum expected cumulative reward.
" T # " T #
X X
Pseudo Regret = maxE rt (i) − E rt (at )
i
t=1 t=1
T
X
∗
= µ ·T− µa t
t=1
Note that the difference between the regret and the Pseudo Regret is related to the
difference between taking the expected maximum (in Regret) versus the maximum
expectation (Pseudo Regret). In this chapter we will only consider pseudo regret
(and sometime call it simply regret).
We will use extensively the following concentration bound.
Theorem 14.1 (Hoeffding’s inequality). Given X1 , . . . , Xm i.i.d random variables s.t
Xi ∈ [0, 1] and E[Xi ] = µ we have
m
1 X
P r[ Xi − µ ≥ ] ≤ exp(−22 m)
m i=1
1 1
Pm
or alternatively, for m ≥ 22
log(1/δ), with probability 1−δ we have that m i=1 Xi −
µ ≤ .
• In time t + 1 we choose:
235
We now would like to compute the expected regret of the greedy policy. W.l.o.g.,
we assume that µ1 ≥ µ2 , and define ∆ = µ1 − µ2 ≥ 0.
∞
X
Pseudo Regret = (µ1 − µ2 ) Pr [avgt (2) ≥ avgt (1)]
t=1
Note that the above is an equivalent formulation of the pseudo regret. In each time
step that greedy selects the optimal action, clearly the difference is zero, so we can
ignore those time steps. In time steps which greedy selects the alternative action,
action 2, it has a regret of µ1 − µ2 compared to action 1. This is why we sum over all
time steps, the probability that we select action 2 time the regret in that case, i.e.,
µ1 − µ2 . Since we select action 2 at time t when avgt (2) ≥ avgt (1), the probability
that we select action 2 is exactly the probability that avgt (2) ≥ avgt (1).
We would like now to upper bound the probability of avgt (2) ≥ avgt (1). Clearly,
at any time t,
E[avgt (2) − avgt (1)] = µ2 − µ1 = −∆
∞
X
E [Pseudo Regret] = ∆ Pr [avgt (2) ≥ avgt (1)]
t=1
∞
2t
X
≤ ∆e−2∆
Zt=1∞
2
≤ ∆e−2∆ t dt
0 ∞
1 −2∆2 t
= − e
2∆ 0
1
=
2∆
We have established the following theorem.
Theorem 14.2. In the full information two actions multi-arm bandit model, the greedy
algorithm guarantees a pseudo regret of at most 1/2∆, where ∆ = |µ1 − µ2 |.
Notice that this regret bound does not depend on the horizon T!
236
14.0.2 Stochastic Multi-Arm Bandits: lower bound
We will now see that we cannot get a regret that does not depend on T for the bandit
feedback, when we observe only the reward of the action we selected.
Considering the following example. For action a1 we have the following distribu-
tion,
1
a1 ∼ Br
2
For action a2 there are two alternative equally likely distributions, each with
probability 1/2,
1 1 3 1
a2 ∼ Br w.p. or a2 ∼ Br w.p.
4 2 4 2
In this setting, since the distribution of action a1 is known, the optimal policy
will select action a2 for some time M (potentially, M = T is also possible) and then
switches to action a1 . The reason is that once we switch to action a1 we will not
receive any new information regarding the optimal action, since the distribution of
action a1 is known.
Let Si = {t : at = i} be the set of times where we played action i. Assume by
way of contradiction
X
E ∆i |Si | = E [P seudoRegret] = R
i∈{1,2}
14.1 Explore-Then-Exploit
We will now develop an algorithm with a vanishing average regret. The algorithm
will have two phases. In the first phase it will explore each action for M times. In
the second phase it will exploit the information from the exploration, and will always
play on the action with the highest average reward in the first phase.
2. After kM rounds we always choose the action that had highest average reward
during the explore phase.
238
Define:
Sj = {t : at = j, t ≤ k · M }
1 X
µ̂j = rj (t)
M t∈S
j
µj = E[rj (t)]
∆j = µ∗ − µj
where ∆j is the difference in expected reward of action j and the optimal action.
We can now write the regret as a function of those parameters:
k
X k
X h i
E [Pseudo regret] = ∆j · M + (T − k · M ) ∆j P r j = arg max µ̂i
i
j=1 j=1
| {z } | {z }
Explore Exploit
2k 2
Pr [∃j : |µ̂j − µj | ≥ λ] ≤ 4 ≤
| {z } T f or k≤T T3
B
Define the “bad event” B = {∃j : |µ̂j − µj | ≥ λ}. If B did not happen then for
each action j, such that µ̂j ≥ µ̂∗ , we have
µj + λ ≥ µ̂j ≥ µ̂∗ ≥ µ∗ − λ
therefore:
2λ ≥ µ∗ − µj = ∆j
and therefore:
∆j ≤ 2λ
239
Then, we can bound the expected regret as follows:
k
!
X 2
E[P seudoRegret] ≤ ∆j M + (T − k · M ) · 2λ + 3 · T
j=1
| {z
B didn’t happen
} |T {z }
| {z } B happened
Explore
r
2 log T 2
≤k·M +2· ·T+ 2
M T
2
If we optimize the number of exploration phases M and choose M = T 3 , we get:
2 p 2 2
E[P seudoRegret] ≤ k · T 3 + 2 · 2 log T · T 3 + 2
T
√
which is sub-linear but more than the O( T) rate we would expect.
240
Where the tτ ’s are the rounds when we chose action i.
Now we fix m and get:
" r #
2 log T 2
∀i∀m Pr V̂m (i) − µi ≤ ≥1− 4
m T
if G happened then:
∀i∀t µi ∈ [LCBt (i), U CBt (i)]
Therefore:
2
P r ∀i∀t µi ∈ [LCBt (i), U CBt (i)] ≥ 1 − 2
T
241
• For each j ∈ S if there exists i ∈ S such that:
U CBt (j) < LCBt (i)
We remove j from S, that is we update:
S ← S − {j}
242
Theorem 14.3. The pseudo regret of successive action elimination is bounded by
O( ∆1i log T)
Note that the bound is when ∆i ≈ 0. This is not a really issue, since such actions
also have very small regret when we use
p them. Formally, we can partition the action
according to ∆ip. Let A1 = {i : ∆i < k/T} be the set of actions with low ∆i , and
A2 = {i : ∆i ≥ k/T}. We can now re-analyze the pseudo regret, as follows,
k
X
E [Pseudo Regret] = ∆i ni (t)
i=1
Xk k
X
= ∆i ni (t) + ∆i ni (t)
i∈A1 i∈A2
r
X k X 32 2
≤ ni (t) + log T + 2
·T
i∈A1
T i∈A
∆i |T {z }
2
The bad event
√
r
T 2
≤ T + 32k log T +
k T
√
≤34 kT log T
• Afterwards we choose:
at = arg max U CBt (i).
i
243
Using the definition of UCB and the assumption that G holds, we have
µi + 2λt (i) ≥ µ∗
Rearranging, we have,
2λt (i) ≥ µ∗ − µi = ∆i
Each time we chose action i, we could not have made a very big mistake because:
s
2 log T
∆i ≤ 2 ·
nt (i)
And therefore if i is very far off from the optimal action we would not choose it
too many times. We can bound the number of times action i is used by,
8
nt (i) ≤ log T
∆2i
k
X 2
E [Pseudo Regret] = ∆i E [nt (i)] + 2
·T
i=1 |T {z }
The bad event
k
X c 2
≤ · log T +
i=1
∆i T
244
14.4 From Multi-Arm Bandits to MDPs
Much of the techniques used in the case of Multi-arm bandits, can be extended
naturally to the case of MDPs. In this section we sketch a simple extension where
the dynamics of the MDPs is known, but the rewards are unknown.
We first need to define the model for the online learning in MDPs, which will be
very similar to the one in MAB. We will concentrate on the case of a finite horizon
return. The learner interacts with the MDP for K episodes.
At each episode t ∈ [K], the learner selects a policy πt and observes a trajectory
(s1 , a1 , r1 . . . , stT ), where the actions are selected using πt , i.e., atτ = πt (stτ ).
t t t
The goal of the learner is to minimize the pseudo regret. Let V ∗ (s1 ) be the
optimal value function from the initial state s1 . The pseudo regret is define as,
X T
X
∗
E[Regret] = E[ V (s1 ) − rtstτ ,atτ ]
t∈[K] τ =1
We now like to introduce a UCB-like algorithm. We will first assume that the
learner knows the dynamics, but does not know the rewards. This will imply that
the learner, given a reward function, can compute an optimal policy.
Let µs,a = E[rs,a ] be the expected reward for (s, a). As in the case of UCB we
will define an Upper Confidence Bound for each reward. Namely, for each state s
and action a we will maintain an empirical average µ̂ts,a and a confidence parameter
q
λts,a = 2 logntKSA , where nts,a is the number of times we visited state s and performed
s,a
action a.
We define the good event similar to before
245
In the following we use the notation V(·|R) to imply that we are using the reward
function R. We denote by R∗ the true reward function, i.e., R∗ (s, a) = E[rs,a ].
Lemma 14.8. Assume the good event G holds. Then, for any episode t we have that
t
V π (s|R̄t ) ≥ V ∗ (s|R∗ ).
t ∗
Proof. Since π t is optimal for the rewards R̄t , we have that V π (s|R̄t ) ≥ V π (s|R̄t ).
∗ ∗
Since R̄t ≥ R∗ , then we have V π (s|R̄t ) ≥ V π (s|R∗ ).
Combining the two inequalities, yields the lemma.
The optimism is very powerful property, as it let’s us bound the pseudo regret
as a function of quantities we observe, namely R̄t , rather than unknown quantities,
such as the true rewards R∗ or the unklnown optimal policy π ∗ .
Lemma 14.9. Assume the good event G holds. Then,
T
XX
E[Regret] ≤ E[2λtstτ ,atτ ]
t∈[K] τ =1
Note that
T
X t
E[ rtstτ ,atτ ] = E[V π (s1 |R∗ )]
τ =1
and we have,
T
X
πt t πt ∗
E[Regret] ≤ E[V (s1 |R̄ )] − E[V (s1 |R )] = E[ λtstτ ,atτ ]
τ =1
246
We are now left with only upper bounding the sum of the confidence bounds. We
can upper bound this sum regardless of the realization.
Lemma 14.10.
T
XX p
λtstτ ,atτ ≤ KSA log(KSA)
t∈[K] τ =1
In the above, τ is the index of the τ -th visit to the state-action pair (s, a) at some
time t. During that visit we have that nts,a = τ . This explain the expression for the
confidence intervals.
√
Since 1/ x is a P convex function,
√ √ we can upper bound the sum using Jensen
N
inequality, and have τ =1 1/ τ ≤ 2N , and have
T
XX Xq
p
λtstτ ,atτ ≤ 2 log KSA 2nK
s,a
t∈[K] τ =1 s,a
q
K
P P
Recall that s,a ns,a = K. This implies that s,a 2nK
s,a is maximized when all the
nK K
s,a are equal, i.e, ns,a = K/(SA). Hence,
T
XX p
λtstτ ,atτ ≤ 2 SAK log KSA
t∈[K] τ =1
247
PAC criteria An action i is -optimal if µi ≥ µ∗ − . The PAC criteria is that,
given , δ > 0, with probability at least 1 − δ, find an optimal action.
k
Complexity: During phase l we have |Sl | = 2l−1
actions. We are setting the accuracy
and confidence parameters as follows.
l−1
3 3 δ
l = l−1 = , δl =
4 4 4 2l
248
Algorithm 25 Best Arm Identification
1: Input: , δ > 0
2: Output: ā ∈ A
3: Init: S1 = A, 1 = 4 , δ1 = 2δ , l = 1
4: repeat
5: for all i ∈ Sl do
1 3
6: Sample action i for m(l , δl ) = 2 log
times
( 2l ) δl
7: µ̂i ← average reward of action i (only of samples during the lth phase)
8: end for
9: medianl ← median{µ̂i : i ∈ Sl }
10: Sl+1 ← {i ∈ Sl : µ̂i ≥ medianl }
11: l+1 ← 43 l
12: δl+1 ← δ2l
13: l ←l+1
14: until |Sl | = 1
15: Output â where Sl = {â}
This implies that the sum of the accuracy and confidence parameters over the phases
would be,
l−1
X 3 X X δ
l ≤ ≤ , and δl ≤ ≤δ
l
4 4 l l
2l
In phase l we have Sl as the set of actions. For each action in Sl we sample
m(l , δl ) samples. The total number of samples is therefore:
X 4 3 X k 64 16 l−1 3 · 2l
|Sl | · 2 log = log
l
l δl l
2l−1 2 9 δ
l−1
log 1δ
X 8 log 3 l
= k c· 2 + 2 + 2
l
9
k 1
=O 2
log
δ
Correctness: The following lemma is the main tool in establishing the correctness
of the algorithm. It shows that when we move from phase l to phase l + 1 with high
249
probability (1 − δl ) the decrease in accuracy is at most l
Lemma 14.12. Given Sl , we have
Pr
max µj ≤ ≥ 1 − δl
max µj +l
j∈Sl j∈Sl+1
| {z } | {z }
best action l best action l + 1
Proof. Let µ∗l = maxj∈Sl µj , the expected reward of the best action in Sl , and a∗l =
arg maxj∈Sl µj be the best action in Sl . Define the bad event El = µ̂∗l < µ∗l − 21 .
(Note that El depends only on the action a∗l . Since we sample a∗l for m(l , δl ) times,
we have that P r [El ] ≤ δ3l . If El did not happen, we define a bad set of actions:
Bad = {j : µ∗l − µj > l , µ̂j ≥ µ̂∗ }
The set Bad includes the actions which have a better empirical average than a∗l ,
and the difference in the expectation is more than l . We would like to show that
Sl+! 6⊆ Bad, and hence includes at least one action which has expectation of at most
l from µ∗l .
Consider an action j such that µ∗ − µj > l , then:
l l
P r[µ̂j > µ̂∗ | µ̂∗l ≥ µ∗ − ≤ P r[µ̂j > µ∗l − ]
| {z 2} 2
qEl
l δ1
≤ P r[µ̂j ≥ µj + |qEl ] ≤
2 3
where the second inequality follows since µ∗l − l /2 > µj + l /2, which follows since
µ∗l − µj > l .
Note that the failure probability is not negligible, and our main aim is to avoid a
union bound which will introduce a log k factor. We will show that it cannot happen
to too many such actions. We will bound the expectation of the size of Bad,
δ1
E[|Bad||qEl ] ≤ |Bad|
3
with Markov’s inequality we get:
k E |Bad| 2
P r |Bad| ≥ qEl ] ≤ = δl
2 |Bad|/2 3
1
with probability 1 − δl : µ̂∗ ≥ µ∗ − 2
and |Bad| ≤ k
2
. Therefore: ∃j ∈
/ Bad and
j ∈ Sl+1 .
250
Given the above lemma, we can conclude with the following theorem.
Theorem 14.13. The median elimination algorithm guarantees that with probability
at least 1 − δ we have that µ∗ − µâ ≤
P
Proof. With probability at least 1 − l δl ≥ 1 − δ we have that during each phase l,
it holds that maxj∈Sl µj ≤ maxj∈Sl+1 µj + l .
By summing all the inequalities of the different phases, this implies that
X
µ∗ = max µj ≤ µâ l ≤ µâ + .
j∈A
l
251
252
Appendix A
Dynamic Programming
In this book, we focused on Dynamic Programming (DP) for solving problems that
involve dynamical systems. The DP approach applies more broadly, and in this
chapter we briefly describe DP solutions to computational problems of various forms.
An in-depth treatment can be found in Chapter 15 of [23].
The dynamic programming recipe can be summarized as follows: solve a large
computation problem by breaking it down into sub-problems, such that the optimal
solution of each sub-problem can be written as a function of optimal solutions to
sub-problems of a smaller size. The key is to order the computation such that each
sub-problem is solved only once.
We remark that in most cases of interest, the recursive structure is not evident
or unique, and its proper identification is part of the DP solution. To illustrate this
idea, we proceed with several examples.
Fibonacci Sequence
The Fibonacci sequence is defined by:
V0 = 0
V1 = 1
Vt = Vt−2 + Vt−1 .
Our ‘problem’ is to calculate the T’s number in the sequence, VT . Here, the recursive
structure is easy to identify from the problem description, and a DP algorithm for
computing VT proceeds as follows:
1. Set V0 = 0,V1 = 1
253
2. For t = 2, . . . , T, set
Vt = Vt−2 + Vt−1 .
Our choice of notation here matches the finite horizon DP problems in Chapter 3:
the effective ‘size’ of the problem T is similar to the horizon length, and the quantity
that we keep track of for each sub-problem V is similar to the value function. Note
that by ordering the computation in increasing t, each element in the sequence is
computed exactly once, and the complexity of this algorithm is therefore O(T).
We will next discuss problems where the DP structure is less obvious.
An exhaustive search needs to examine O(T2 ) sums. We will now devise a more
efficient DP solution. Let
Xt
Vt = max
0
x`
1≤t ≤t
`=t0
denote the maximal sum over all contiguous subsequences that end exactly at xt .
We have that:
V1 = x1 ,
and
Vt = max{Vt−1 + xt , xt }.
Our DP algorithm thus proceeds as follows:
1. Set V1 = x1 , π1 = 1
2. For t = 2, . . . , T, set
Vt = max{Vt−1 + xt , xt },
(
πt−1 , if Vt−1 + xt > xt
πt =
t else.
254
4. Return V ∗ = Vt∗ , tstart = πt∗ , tend = t∗ .
This algorithm requires only O(T) calculations, i.e., linear time. Note also that in
order to return the range of elements that make up the maximal contiguous sum
[tstart , tend ], we keep track of πt – the index of the first element in the maximal sum
that ends exactly at xt .
V1 = 1,
if xt0 ≥ xt for all t0 < t,
1,
Vt =
max {Vt0 : t0 < t, xt0 < xt } + 1, else.
subject to X
st ≤ C.
t∈A
Note that the number of item subsets is 2T . We will now devise a DP solution.
Let V(t, t0 ) = denote the maximal value for filling exactly capacity t0 with items
1
We note that this can be further improved to O(T log T). See Chapter 15 of [23].
255
from the set {1, . . . , t}. If the capacity t0 cannot be matched by any such subset, set
V(t, t0 ) = −∞. Also set V(0, 0) = 0, and V(0, t0 ) = −∞ for t0 ≥ 1. Then
V(t, t0 ) = max{V(t − 1, t0 ), V(t − 1, t0 − st ) + rt },
which can be computed recursively for t = 1 : T, t0 = 1 : C. The required value is
obtained by V ∗ = max0≤t0 ≤C V(T, t0 ). The running time of this algorithm is O(TC).
We note that the recursive computation of V(t, t0 ) requires O(C) space. To obtain
the indices of the terms in the optimal subset some additional book-keeping is needed,
which requires O(TC) space.
Further examples
Additional important DP problems include, among others:
• The Edit-Distance problem: find the distance (or similarity) between two
strings, by counting the minimal number of “basic operations” that are needed
to transform one string to another. A common set of basic operations is:
delete character, add character, change character. This problem is frequently
encountered in natural language processing and bio-informatics (e.g., DNA se-
quencing) applications, among others.
256
• The Matrix-Chain Multiplication problem: Find the optimal order to compute
a matrix multiplication M1 M2 · · · Mn (for non-square matrices).
257
258
Appendix B
y 0 = ay + b,
259
where a and b are constants. This equation can be solved using an integrating factor.
The integrating factor, µ(x), is given by µ(x) = e−ax . Multiplying through by this
integrating factor, the equation becomes:
b
e−ax y = − e−ax + C,
a
where C is the constant of integration. Solving for y, we obtain:
b
y(x) = − + Ceax .
a
Note that if a < 0, we have that limx→∞ y(x) = − ab for all the solutions of the ODE.
260
B.1.1 Systems of Linear Differential Equations
When dealing with multiple interdependent variables, we can extend the concept
of linear ODEs to systems of equations. These are particularly useful in modeling
multiple phenomena that influence each other.
Consider a system of linear differential equations represented in matrix form as
follows:
y0 = Ay + b, (B.1)
where y is a vector of unknown functions, A is a matrix of coefficients, and b is a
vector of constants. This compact form encapsulates a system where each derivative
of the component functions in y depends linearly on all other functions in y and
possibly some external inputs b. We shall now present the general solution to the
ODE (B.1).
Let us first define the matrix exponential.
Definition B.1. The matrix exponential, eAx , where A is a matrix, is defined similarly
to the scalar exponential function but extended to matrices,
∞
Ax
X xk Ak
e = .
k=0
k!
261
Using the power rule and the properties of matrix multiplication, we find:
∞ ∞
X A(Ax)k−1 k X (Ax)k−1
=A = AeAx .
k=1
k! k=1
(k − 1)!
Therefore,
d
y(x) = AeAx y0 .
dx
Substituting y(x) back into the original differential equation:
y0 = Ay ⇒ AeAx y0 = Ay.
Since y(x) = eAx y0 , it follows that:
AeAx y0 = Ay(x).
To show that eAx y0 is the only possible solution, note that at x = 0, eAx y0 = y0 .
Therefore, for any initial condition, we have found a solution, and the uniqueness
follows from Theorem B.1.
Now, for the case b 6= 0, let yp such that Ayp = −b. We have that for y(x) =
eAx y0 + yp , y0 (x) = AeAx y0 = AeAx y0 + Ayp − Ayp = Ay(x) + b.
262
We have the following result for the system of linear differential equations in
(B.1).
Theorem B.3. Consider the ODE in (B.1), and let A ∈ RN ×N be diagonizable. If all
the eigenvalues of A have a negative real part, and let yp such that Ayp = −b, then
y = yp is a globally asymptotically stable solution.
Proof. We have already established that every solution is of the form y(x) = eAx y0 +
yp . Let λi , vi denote the eigenvalues and eigenvectors of A. Since A is diagonizable,
we can write Ay0 = N T
P
i=1 λi vi y0 , so
∞ ∞ X
N N
X xk Ak y0 X xk λk v T y0
i i
X
eAx y0 = = = eλi x viT y0 .
k=0
k! k=0 i=1
k! i=1
If λi has a negative real part, then limx→∞ eλi x = 0. Thus, if all the eigenvalues of A
have a negative real part, limx→∞ eAx y0 = 0 for all y0 , and the claim follows.
A similar result can be shown to hold for general (not necessarily diagonizable)
matrices. We state here a general theorem (see, e.g., Theorem 4.5 in [55]) without
proof.
263
264
Bibliography
[2] Alekh Agarwal, Sham M. Kakade, and Lin F. Yang. Model-based reinforcement
learning with a generative model is minimax optimal. In Jacob D. Abernethy
and Shivani Agarwal, editors, Conference on Learning Theory, COLT, 2020.
[4] K.J. Åström and B. Wittenmark. Adaptive Control. Dover Books on Electrical
Engineering. Dover Publications, 2008.
[5] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the
multiarmed bandit problem. Mach. Learn., 47(2-3):235–256, 2002.
[6] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J. Kappen. Mini-
max PAC bounds on the sample complexity of reinforcement learning with a
generative model. Mach. Learn., 91(3):325–349, 2013.
[7] Andrew G. Barto and Michael O. Duff. Monte carlo matrix inversion and rein-
forcement learning. In Jack D. Cowan, Gerald Tesauro, and Joshua Alspector,
editors, Advances in Neural Information Processing Systems 6, [7th NIPS Con-
ference, Denver, Colorado, USA, 1993], pages 687–694. Morgan Kaufmann,
1993.
[9] Jacob Beck, Risto Vuorio, Evan Zheran Liu, Zheng Xiong, Luisa Zintgraf,
Chelsea Finn, and Shimon Whiteson. A survey of meta-reinforcement learning.
arXiv preprint arXiv:2301.08028, 2023.
265
[10] Richard Bellman. Dynamic Programming. Dover Publications, 1957.
[11] Alberto Bemporad and Manfred Morari. Control of systems integrating logic,
dynamics, and constraints. Automatica, 35(3):407–427, 1999.
[13] Dimitri P. Bertsekas. Dynamic programming and optimal control, 3rd Edition.
Athena Scientific, 2005.
[18] Murray Campbell, A.Joseph Hoane, and Feng hsiung Hsu. Deep blue. Artificial
Intelligence, 134(1):57–83, 2002.
[19] Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cam-
bridge University Press, 2006.
[20] Mmanu Chaturvedi and Ross M. McConnell. A note on finding minimum mean
cycle. Inf. Process. Lett., 127:21–22, 2017.
[21] Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha
Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision trans-
former: Reinforcement learning via sequence modeling. Advances in neural
information processing systems, 34:15084–15097, 2021.
[22] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein.
Introduction to algorithms. MIT press, 2009.
[23] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.
Introduction to Algorithms, 3rd Edition. MIT Press, 2009.
266
[24] Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-
horizon reinforcement learning. In Neural Information Processing Systems
(NeurIPS), 2015.
[26] Peter Dayan. The convergence of td(lambda) for general lambda. Mach. Learn.,
8:341–362, 1992.
[27] Peter Dayan and Terrence J. Sejnowski. Td(lambda) converges with probability
1. Mach. Learn., 14(1):295–301, 1994.
[30] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and
stopping conditions for the multi-armed bandit and reinforcement learning
problems. J. Mach. Learn. Res., 7:1079–1105, 2006.
[31] Eyal Even-Dar and Yishay Mansour. Learning rates for q-learning. Journal of
Machine Learning Research, 5:1–25, 2003.
[32] John Fearnley. Exponential lower bounds for policy iteration. In Automata,
Languages and Programming (ICALP), volume 6199, pages 551–562, 2010.
[34] Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar, et al.
Bayesian reinforcement learning: A survey. Foundations and Trends® in Ma-
chine Learning, 8(5-6):359–483, 2015.
[36] Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction
techniques for gradient estimates in reinforcement learning. Journal of Machine
Learning Research, 5(9), 2004.
267
[37] Assaf Hallak, Dotan Di Castro, and Shie Mannor. Contextual markov decision
processes. arXiv preprint arXiv:1502.02259, 2015.
[38] Thomas Dueholm Hansen, Peter Bro Miltersen, and Uri Zwick. Strategy it-
eration is strongly polynomial for 2-player turn-based stochastic games with a
constant discount factor. J. ACM, 60(1):1:1–1:16, 2013.
[39] Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for the
heuristic determination of minimum cost paths. IEEE transactions on Systems
Science and Cybernetics, 4(2):100–107, 1968.
[40] Morris W Hirsch, Stephen Smale, and Robert L Devaney. Differential equa-
tions, dynamical systems, and an introduction to chaos. Academic press, 2013.
[43] Tommi S. Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the conver-
gence of stochastic iterative dynamic programming algorithms. Neural Com-
put., 6(6):1185–1201, 1994.
[45] Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-
free exploration for reinforcement learning. In International Conference on
Machine Learning (ICML), 2020.
[46] Leslie Pack Kaelbling. Learning to achieve goals. In IJCAI, volume 2, pages
1094–8. Citeseer, 1993.
[47] Sham Kakade. On the sample complexity of reinforcement learning. PhD thesis,
University College London, 2003.
[48] Sham Kakade and John Langford. Approximately optimal approximate rein-
forcement learning. In Proceedings of the Nineteenth International Conference
on Machine Learning, pages 267–274, 2002.
268
[49] Richard M. Karp. A characterization of the minimum cycle mean in a digraph.
Discret. Math., 23(3):309–311, 1978.
[51] Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jons-
son, Edouard Leurent, and Michal Valko. Adaptive reward-free exploration. In
Algorithmic Learning Theory (ALT), 2021.
[52] Michael J Kearns and Satinder Singh. Bias-variance error bounds for temporal
difference updates. In COLT, pages 142–147, 2000.
[55] H.K. Khalil. Nonlinear Systems. Pearson Education. Prentice Hall, 2002.
[56] IS Khalil, JC Doyle, and K Glover. Robust and optimal control, volume 2.
Prentice hall, 1996.
[57] Khimya Khetarpal, Matthew Riemer, Irina Rish, and Doina Precup. Towards
continual reinforcement learning: A review and perspectives. Journal of Arti-
ficial Intelligence Research, 75:1401–1476, 2022.
[59] Robert Kirk, Amy Zhang, Edward Grefenstette, and Tim Rocktäschel. A
survey of zero-shot generalisation in deep reinforcement learning. Journal of
Artificial Intelligence Research, 76:201–264, 2023.
[60] Jon Kleinberg and Éva Tardos. Algorithm Design. Addison Wesley, 2006.
269
[62] H.J. Kushner and D.S. Clark. Stochastic Approximation Methods for Con-
strained and Unconstrained Systems. Springer-Verlag, New York, 1978.
[63] H.J. Kushner and G. Yin. Stochastic approximation and recursive algorithms
and applications. Springer Verlag, 2003.
[64] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Ger-
shman. Building machines that learn and think like people. Behavioral and
brain sciences, 40:e253, 2017.
[65] Abdul Latif. Banach contraction principle and its generalizations. Topics in
fixed point theory, pages 33–64, 2014.
[66] Tor Lattimore and Marcus Hutter. Near-optimal PAC bounds for discounted
mdps. Theor. Comput. Sci., 558:125–143, 2014.
[67] Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University
Press, 2020.
[68] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end
training of deep visuomotor policies. Journal of Machine Learning Research,
17(39):1–40, 2016.
[69] Lihong Li. Sample Complexity Bounds of Exploration, pages 175–204. Springer
Berlin Heidelberg, Berlin, Heidelberg, 2012.
[70] Michael L. Littman, Thomas L. Dean, and Leslie Pack Kaelbling. On the
complexity of solving markov decision problems. In Conference on Uncertainty
in Artificial Intelligence (UAI), pages 394–402. Morgan Kaufmann, 1995.
[73] Omid Madani, Mikkel Thorup, and Uri Zwick. Discounted deterministic
markov decision processes and discounted all-pairs shortest paths. ACM Trans.
Algorithms, 6(2):33:1–33:25, 2010.
270
[75] Shie Mannor and Nahum Shimkin. A geometric approach to multi-criterion
reinforcement learning. The Journal of Machine Learning Research, 5:325–
360, 2004.
[76] Shie Mannor and John N Tsitsiklis. Algorithmic aspects of mean–variance
optimization in markov decision processes. European Journal of Operational
Research, 231(3):645–653, 2013.
[77] Yishay Mansour and Satinder Singh. On the complexity of policy iteration.
In Conference on Uncertainty in Artificial Intelligence (UAI), pages 401–408,
1999.
[78] Peter Marbach and John N. Tsitsiklis. Simulation-based optimization of
markov reward processes. IEEE Trans. Autom. Control., 46(2):191–209, 2001.
[79] Peter Marbach and John N. Tsitsiklis. Approximate gradient methods in
policy-space optimization of markov reward processes. Discret. Event Dyn.
Syst., 13(1-2):111–148, 2003.
[80] Mary Melekopoglou and Anne Condon. On the complexity of the policy im-
provement algorithm for markov decision processes. INFORMS J. Comput.,
6(2):188–192, 1994.
[81] Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kauf-
mann, Edouard Leurent, and Michal Valko. Fast active learning for pure ex-
ploration in reinforcement learning. In International Conference on Machine
Learning (ICML), 2021.
[82] N. Metropolis and S. Ulam. The monte carlo method. Journal of the American
Statistical Association, 44:335–341, 1949.
[83] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Ve-
ness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland,
Georg Ostrovski, et al. Human-level control through deep reinforcement learn-
ing. nature, 518(7540):529–533, 2015.
[84] Rémi Munos. Performance bounds in l p-norm for approximate value iteration.
SIAM journal on control and optimization, 46(2):541–561, 2007.
[85] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under
reward transformations: Theory and application to reward shaping. In Inter-
national Conference on Machine Learning, volume 99, pages 278–287, 1999.
271
[86] Andrew Y Ng, Stuart Russell, et al. Algorithms for inverse reinforcement
learning. In Icml, volume 1, page 2, 2000.
[87] Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision pro-
cesses with uncertain transition matrices. Operations Research, 53(5):780–798,
2005.
[88] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright,
Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray,
et al. Training language models to follow instructions with human feedback.
Advances in neural information processing systems, 35:27730–27744, 2022.
[90] Ian Post and Yinyu Ye. The simplex method is strongly polynomial for de-
terministic markov decision processes. In Symposium on Discrete Algorithms
(SODA), pages 1465–1473. SIAM, 2013.
[95] Stuart J Russell and Peter Norvig. Artificial intelligence: a modern approach.
Pearson, 2016.
[97] Herbert Scarf. The optimality of (s, S) policies in the dynamic inventory prob-
lem. In Kenneth J. Arrow, Samuel Karlin, and Patrick Suppes, editors, Math-
ematical Methods in the Social Sciences, chapter 13, pages 196–202. Stanford
University Press, Stanford, CA, 1959.
272
[98] Bruno Scherrer and Matthieu Geist. Local policy search in a convex space and
conservative policy iteration as boosted policy search. In Machine Learning
and Knowledge Discovery in Databases: European Conference, ECML PKDD
2014, Nancy, France, September 15-19, 2014. Proceedings, Part III 14, pages
35–50. Springer, 2014.
[99] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and
Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint
arXiv:1707.06347, 2017.
[100] L. S. Shapley. Stochastic games. Proc Natl Acad Sci USA, 39:1095—-1100,
1953.
[101] David Silver. UCL course on RL, 2015. https://www.davidsilver.uk/teaching/.
[102] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre,
George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda
Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep
neural networks and tree search. Nature, 529(7587):484–489, 2016.
[103] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre,
George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Ve-
davyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe,
John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine
Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the
game of Go with deep neural networks and tree search. Nature, 529(7587):484–
489, 2016.
[104] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja
Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian
Bolton, et al. Mastering the game of go without human knowledge. Nature,
550(7676):354–359, 2017.
[105] Satinder Singh, Tommi S. Jaakkola, Michael L. Littman, and Csaba Szepesvári.
Convergence results for single-step on-policy reinforcement-learning algorithms.
Mach. Learn., 38(3):287–308, 2000.
[106] Satinder P. Singh and Richard S. Sutton. Reinforcement learning with replacing
eligibility traces. Machine Learning, 22(1-3):123–158, 1996.
[107] Aleksandrs Slivkins. Introduction to multi-armed bandits. Found. Trends
Mach. Learn., 12(1-2):1–286, 2019.
273
[108] Rupesh Kumar Srivastava, Pranav Shyam, Filipe Mutz, Wojciech Jaśkowski,
and Jürgen Schmidhuber. Training agents using upside-down reinforcement
learning. arXiv preprint arXiv:1912.02877, 2019.
[109] Alexander L. Strehl, Lihong Li, and Michael L. Littman. Reinforcement learn-
ing in finite mdps: PAC analysis. Journal of Machine Learning Research,
10:2413–2444, 2009.
[114] Richard S. Sutton, David A. McAllester, Satinder Singh, and Yishay Mansour.
Policy gradient methods for reinforcement learning with function approxima-
tion. In NIPS, pages 1057–1063, 1999.
[116] Istvan Szita and András Lörincz. Optimistic initialization and greediness lead
to polynomial time learning in factored mdps. In Andrea Pohoreckyj Danyluk,
Léon Bottou, and Michael L. Littman, editors, International Conference on
Machine Learning (ICML), 2009.
[117] Istvan Szita and Csaba Szepesvári. Model-based reinforcement learning with
nearly tight exploration complexity bounds. In International Conference on
Machine Learning (ICML), 2010.
274
In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36,
pages 8423–8431, 2022.
[119] Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning
domains: A survey. Journal of Machine Learning Research, 10(7), 2009.
[120] Gerald Tesauro. Temporal difference learning and td-gammon. Commun.
ACM, 38(3):58–68, 1995.
[121] Gerald Tesauro. Programming backgammon using self-teaching neural nets.
Artif. Intell., 134(1-2):181–199, 2002.
[122] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and
Pieter Abbeel. Domain randomization for transferring deep neural networks
from simulation to the real world. In 2017 IEEE/RSJ international conference
on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017.
[123] Emanuel Todorov and Weiwei Li. A generalized iterative lqg method for
locally-optimal feedback control of constrained nonlinear stochastic systems.
In Proceedings of the 2005, American Control Conference, 2005., pages 300–
306. IEEE, 2005.
[124] J. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with
function approximation. IEEE Trans. on Automatic Control, 42(5):674–690,
1997.
[125] John N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning.
Mach. Learn., 16(3):185–202, 1994.
[126] AW van der Vaart, A.W. van der Vaart, A. van der Vaart, and J. Wellner.
Weak Convergence and Empirical Processes: With Applications to Statistics.
Springer Series in Statistics. Springer, 1996.
[127] Harm van Seijen, Hado van Hasselt, Shimon Whiteson, and Marco A. Wiering.
A theoretical and empirical analysis of expected sarsa. In IEEE Symposium on
Adaptive Dynamic Programming and Reinforcement Learning, ADPRL 2009,
Nashville, TN, USA, March 31 - April 1, 2009, pages 177–184, 2009.
[128] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, An-
drew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds,
Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent rein-
forcement learning. nature, 575(7782):350–354, 2019.
275
[129] Andrew Viterbi. Error bounds for convolutional codes and an asymptoti-
cally optimum decoding algorithm. IEEE transactions on Information Theory,
13(2):260–269, 1967.
[132] Yinyu Ye. The simplex and policy-iteration methods are strongly polynomial
for the markov decision problem with a fixed discount rate. Math. Oper. Res.,
36(4):593–603, 2011.
[133] Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Multi-Agent Reinforcement
Learning: A Selective Overview of Theories and Algorithms, pages 321–384.
Springer International Publishing, Cham, 2021.
[134] Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B. Tenen-
baum, and Chuang Gan. Planning with large language models for code gener-
ation. In Proceedings of the International Conference on Learning Representa-
tions (ICLR), 2023.
276