EE290 Lecture 16
EE290 Lecture 16
In this lecture, we present various Markov Decision Process (MDP) formulations, including finite horizon
and indefinite horizon MDPs. We also begin discussion on well-known planning algorithms for MDPs,
including policy iteration and value iteration.
1 MDP Formulations
Continuing from March 9’s lecture introducing Bellman equations and MDP formulations, we continue our
discussion on the latter. As a reminder, MDP formulations we have discussed thus far include: indefinite-
horizon, undiscounted; infinite-horizon, discounted. To recap:
• Indefinite-horizon, undiscounted: Some states are terminal, the length of the trajectory is not
fixed, and for terminal states s, the value function is V (s) = 0.
• Infinite-horizon, discounted: The length of the trajectory is infinite. The value function is dis-
counted according to γ t−1 , where t denotes the timestep. This formulation encourages finishing.
Next, we move onto finite-horizon MDPs. In finite-horizon MDPs, all trajectories stop in precisely H
steps. There is no “natural” terminal state, but termination is enforced “externally” via this H parameter.
In particular, we discuss two variants: finite-horizon, time-invariant and finite-horizon, time-varying.
In both of these variations, the MDP can be expressed as (S, A, R, P, H), where H denotes the length of the
finite time horizon. Note that that in the infinite-horizon version of this problem, we would have γ instead,
the discount factor.
Now, we further discuss the two variations of finite-horizon MDP:
1
2. Finite-horizon, time-invariant
3. Finite-horizon, time-variant
First consider the infinite-horizon, discounted problem with R(s, a) → R(s, a) + c. Does the optimal
policy change? The answer is no. This is because the relative values of the value functions which are used
to calculate the optimal policy do not change.
C
V π (s) → V π (s) +
1−γ
Second, consider either of the fixed-horizon cases. Does the optimal policy change? The answer is no for
the exact same reason.
Then, we arrive at the indefinite-horizon case, which is in fact sensitive to reward shifting. Consider a
Gridworld example to illustrate this concept. As a reminder, in Gridworld, the reward is -1 before reaching
the goal. After reaching the goal, the reward is 0. What if we changed −1 → 1 and 0 → 2? Now, the agent
learns to maximize its time in Gridworld before reaching the goal. This is an example of how the positivity
of reward actually has meaning.
2
refer the reader to [2, 3]. However, approximate DP has been recently dominated by learning and statistical
analyses perspectives. Therefore, we start by discussing the two other famous MDP planning algorithms:
policy iteration and value iteration. Both of them are special cases of the fixed-point iteration method.
For the policy evaluation step, the value of Qπk can be easily computed using the Bellman equation which
gives us the following relation
Qπk = (I − γP πk )−1 r (1)
where I is the identity matrix in appropriate dimensions and P πk is the transition matrix on state-action
pairs induced by the stationary policy πk .
The policy improvement step gives us the next policy which can be interpreted as the greedy policy that
is controlled by the Q function obtained in the latest iteration. In other words, we have
The following result analyzes the convergence of the policy iteration algorithm.
Theorem 1. The policy iteration algorithm guarantees the improvement in the policy, i.e., Qπk (s) ≥
Qπk−1 (s) ∀s ∈ S, k ≥ 1, and the improvement is strictly positive for at least one state s ∈ S, i.e.,
Qπk (s) > Qπk−1 (s) for at least one s ∈ S, ∀k ≥ 1, until the optimal policy π ∗ is found.
We omit the proof of this theorem and refer the reader to [4]. Now, we analyze the upper bound on the
number of iterations required for the policy iteration algorithm to converge. Given S states and A actions,
the number of policies is given as AS . It is known that the number of iterations required for the convergence
of the policy iteration algorithm is bounded by the number of policies, AS , in the worst case. However, in
practice, the average performance of the algorithm is observed to be much faster than the worst case. For
instance, for a grid world problem, the policy iteration is shown to converge much faster than the value
iteration algorithm. Besides, other variants of policy iteration have been proposed in the literature, such
as the policy iteration with entropy regularization [5], that are shown to be computationally more tractable
than the vanilla algorithm presented here.
Given this iteration complexity of the policy iteration, there is a linear convergence bound on its per-
formance for computing a good policy rather than finding the exact optimal policy. The following theorem
formalizes this bound.
Theorem 2. Let Q∗ be the (unique) optimal value function for a given MDP specification (S, A, P, r, γ).
Then,
||Q∗ − Qπk+1 ||∞ ≤ γ||Q∗ − Qπk ||∞ (3)
where the value difference matrix Q∗ − Qπk+1 is an S × A dimensional matrix and the supremum norm of an
m
P
m×n matrix is defined as the maximum absolute column sum of the matrix, i.e., ||X||∞ = max1≤j≤n |xij |.
j=1
This result gives a linear convergence bound as γ ∈ [0, 1) and implies that the Q function converges
exponentially fast to the optimal Q function value. We also note that the convergence result of policy
iteration is expressed in terms of the Q function instead of the policy π. That is because the optimal Q
3
function value is unique whereas there might be alternative optimal policies corresponding the this unique
optimal Q value.
To prove Theorem 2, we need to provide two additional results which will be covered in the next lecture.
Before presenting these results, here we first introduce the Bellman optimality operator.
Definition 3. Let f be an S × A dimensional matrix, then the Bellman operator T is defined such that
∗
For f = Q , then the Bellman operator gives us the sum of two terms: i) instantaneous reward obtained
by implementing action a at state s, and ii) expectation of the future values under the policy induced by
function f . Next, we define the Bellman operator for an arbitrary policy π, T π .
Definition 4. Let f be an S × A dimensional matrix, then the Bellman operator for an arbitrary policy π
T π is defined such that
T π f (s, a) = R(s, a) + γhP (·|s, a), Vfπ (s)i (5)
where the Vfπ (s0 ) = f (s0 , π(s)).
Both of these Bellman operators are helpful for explicitly writing down the Bellman equations for policy
evaluation and optimization in policy iteration algorithm. Specifically, the Bellman optimality operator T is
used for policy optimization whereas the Bellman operator T π , for an arbitrary policy π, is used for policy
evaluation. The respective Bellman equations for policy evaluation and optimization are given as
References
[1] T. M. Moerland, J. Broekens, and C. M. Jonker, “A framework for reinforcement learning and planning,”
ArXiv, vol. abs/2006.15009, 2020.
[2] W. B. Powell, Approximate Dynamic Programming: Solving the curses of dimensionality. John Wiley
& Sons, 2007, vol. 703.
[3] W. Powell, “What you should know about approximate dynamic programming,” Naval Research Logistics,
vol. 56, pp. 239–249, 2009.
[4] A. Agarwal, N. Jiang, and S. M. Kakade, “Reinforcement learning: Theory and algorithms,” 2019.
[5] A. Abdolmaleki, J. T. Springenberg, J. Degrave, S. Bohez, Y. Tassa, D. Belov, N. Heess, and M. Ried-
miller, “Relative entropy regularized policy iteration,” arXiv preprint arXiv:1812.02256, 2018.