0% found this document useful (0 votes)
55 views20 pages

CHAPTER 20-Final

The document introduces the Markov Decision Process (MDP) framework for modeling sequential decision making problems under uncertainty. It describes an example dice game and outlines key aspects of MDPs like probabilistic state transitions and cumulative rewards over time. Finally, it provides overviews of Markov chains, Markov reward processes, and algorithms for solving MDPs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views20 pages

CHAPTER 20-Final

The document introduces the Markov Decision Process (MDP) framework for modeling sequential decision making problems under uncertainty. It describes an example dice game and outlines key aspects of MDPs like probabilistic state transitions and cumulative rewards over time. Finally, it provides overviews of Markov chains, Markov reward processes, and algorithms for solving MDPs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

20 Markov Decision Process

The main goal of this chapter is to introduce a principled framework for sequential decision
making, namely the Markov Decision Process (MDP). Specifically, this chapter
• introduces the Markov Decision Process (MDP) framework,

• models real world sequential decision-making problems using an MDP,

• finite horizon and infinite horizon class of MDP,

• Bellman optimality equation and it’s numerical solution.

Overview
Consider the following classical dice game, where in each round you choose to either (stay
and) play the game or quit. If you quit, then you get $ 10, and the game ends and if you chose
to stay and play the game, then you get $4 and a dice is rolled. If the dice comes up either as
1 or 2, then the game ends, else the game goes to the next round. There are several
important aspects of the game above that merits attention:
1. Probabilistic transition: The “state” of the game is not deterministic; it depends on
the decision of the player and probabilistically on the outcome of the dice cast.

2. Cumulative reward: The player is, obviously, interested in choosing decisions that
maximizes cumulated payout at the end of the game (or as long as the player can play).
It is straight forward to argue that the optimal decision is to choose to play at every
round so as to accumulate the $4 reward.

The above aspects are the intrinsic properties of a sequential decision-making problem: A
decision maker (or player) is offered a choice at every time instant (or round) where the
state of the system is probabilistic and the decision maker chooses a decision to maximize
the cumulative reward. Note that the decision choice affects not only the reward at the
current time but also future choices as well, and hence the cumulative reward.
The sequential decision-making problem appears in a variety of contexts:
• Robotics: Deciding when and where to move, but actuators and sensors can fail and
see unknown obstacles.

• Resource allocation: Deciding what and how much to produce in a factory, with
uncertainty over customer demand for various products.

• Agriculture: Deciding what to plant, but don’t know the weather condition and hence,
the crop yield.

In this chapter, we introduce Markov Decision Process, a class of sequential decision-making


problem where the probabilistic transitions obey the Markov property. Before, we start,

1
Section 20.1 provides an overview of Markov chain. As a prelude to Markov decision
process in Section 20.3, we discuss Markov reward process in Section 20.2. Finally,
Section 20.4 presents numerical algorithms for solving the Markov decision process.

20.1 Markov chain


Let X 1 , X 2 , ⋯ , X N be a set of random variables indexed by time. Consider the conditional
distribution P ( X n ∨X 1 , ⋯ , X n−1 ) at time n , i.e. the conditional distribution of the “current”
random variable conditioned on the history ( X 1 , ⋯ , X n −1). If the above conditional
distribution is same as P ( X n ∨X n−1 ), i.e. conditioned on the random variable at the previous
time instant, the conditional distribution is independent of the history, then we say that the
set of random variables satisfy the Markov property.

Markov Property P ( X n ∨X 1 , ⋯ , X n−1 )=P ( X n∨X n−1 ) (1)

A process that obeys the Markov property and takes values in a finite set ( S) is called a
Markov chain. The state of a Markov chain at time n is the value of X n, and hence, S is
referred to as the state space. A Markov chain can be characterized by P ( X n ∨X n−1 ), known
as the one step transition probability (or simply transition probability). An alternate
convenient notation for the transition probability is given below:

Pni , j=P ( X n = j∨ X n−1=i ) . (2)

Note that in the definition in (2), the transition probabilities depend on the time index n .
For a Markov chain with stationary transition probabilities, Pni , j is independent of n . In this
book, we will only deal with Markov chains with stationary transition probability.
Example 1 (Gambling model) Consider a gambler, who at each play of the game, wins $ 1
with probability  p and loses $1 with probability 1− p . The gambler stops gambling, when
either he is broke or when the reaches $ N .
The gambler’s fortune at each time  X n can be modeled as a Markov chain.
X n ∈ { 0,1, ⋯ , N −1 , N } , since the gambler stops playing either when he is broke (fortune of 0 )
or when he reaches a fortune of $ N .
The transition probabilities of the Markov chain are given by:
Pi , j={ p j=i+ 1, 1− p j=i−1 , 0 elsei=1,2, ⋯ , N −1 ,

P0,0 =P N ,N =1.

Transition Probability Matrix of a Markov Chain: Sometimes it is convenient to write


the transition probabilities of the Markov chain in matrix form given by P= [ Pi , j ] , where the
rows are indexed by the previous state and the columns are indexed by the next state. For

2
example, the transition probability matrix of the Markov chain in Example 1 for p=0.4 and
N=4 is given by
P= [ 1 0 0 0 0 0.6 0.4 0 0 0 0 0.6 0.4 00 0 0 0.6 0.4 0 0 0 0 01 ]

Transition diagram of a Markov Chain: A transition diagram is another way of


visualizing the Markov chain. For example, the transition diagram of the Markov chain in
Example 1 is given in Figure 1 .

Figure 1 Transition diagram of the Markov chain in Example 1.1 for N = 4, and ρ = 0.4
Here the circles denote the possible states of the Markov chain and the arrows represent
the transitions from one state to other, with the probability of transition indicated besides
the arrow.
Properties of Markov Chain: An important property of the Markov chain is given in (3).
The property is institutive: Given the Markov chain is in state i , in the next time step, it
transitions to any of the possible states.

∑ Pi , j=1 (3)
j

A consequence of the above property is that the rows of the transition probability matrix
sum to 1 , i.e.  P is a stochastic matrix.

20.2 Markov reward process


A Markov reward process is a Markov chain augmented with a reward function. Formally, a
Markov reward process is a tuple ⟨ S , P , r , γ ⟩,
• S and P are the state space and the transition probability of the Markov chain,
respectively.

• r :S → R , is the reward function which maps the state of the Markov chain to a real
value.

• γ ∈ [ 0,1 ] is the discount factor. The role of the discount factor will be discussed later in
the section.

The Gambling model in Example 1 is an example of Markov reward process. If the reward
at time t , corresponds to the wealth of the gambler, then r ( s t )=s t .

3
As mentioned in the introduction to this chapter, decision makers are interested in
maximizing the cumulative reward. Even though decision making is not part of Markov
reward process, we can formally define the cumulative reward as follows.

Given sequences of states τ t = { s t , s t+1 , ⋯ , sT } starting at time t for a time horizon of T < ∞ , the
T

cumulative reward
T
Gt ( τ t )=r t +r t +1 +⋯+ r T , (4)

where, r t =r (s t ).

Remark 1: r t =r ( s t ) is the immediate reward and the others are future reward. Hence, the
T
cumulative reward Gt is the sum of the immediate reward and future rewards.

The above formulation for t ≤ T < ∞ is referred to as the finite horizon Markov reward
process. In contrast, the case T → ∞ is referred to as the infinite horizon Markov reward
process.
The cumulative reward for the infinite horizon case is defined as follows.

Infinite horizon: Given sequences of states τ t= { s t , s t+1 , ⋯ } starting at time t , the cumulative
reward
2
Gt ( τ t )=r t +γ r t +1+ γ r t +2 +⋯ (5)

Note that in contrast to the finite horizon definition in (4), the definition for the infinite
horizon case in (5) includes the discount factor ( γ ).

Why discount factor?


1. Technical requirement: This is perhaps the most important reason for including the
T
discount factor. As T → ∞ in (4), the sum Gt can grow unbounded (or in engineering
terminology could go to infinite). It is easy to see that, when discount factor is
1
included, Gt < r max where r max =max s r ( s )
1−γ

2. Economic interpretation: This arises out of the preference of human decision makers.
As humans, we prefer to receive $1 today, instead of tomorrow. For example, if we
were to deposit the $1 recieved today in a bank, the bank pays a fixed interest and the
value of the deposit tomorrow is higher than $1.

In addition, this provides a methodology to model the short term and long-term
ambitions of decision makers. As, γ →0 , decision makers are concerned with only the
“short term” immediate rewards; and as γ →1, decision makers are concerned with
“long term” cumulative reward, where decision makers could sacrifice short term
rewards for improving long term cumulative reward.

4
3. Game theoretic interpretation: The interpretation results from thinking about
Markov reward process as a game being played with environment and a player. The
player obtains reward based on the state of the environment. At each time, with some
probability 1−γ , the environment may decide to stop playing the game, then the
expected reward under this setting is same as in the formulation mentioned in (5). For
example, the movement of a stock can be interpreted as Markov reward process; there
is some small, but finite, probability that a financial crisis could cause the stock to go
“under”.

Since many of the reasons for including discount factor is also valid for the finite horizon
case, we will include a discount factor for finite horizon as well.

Finite horizon with discount factor: Given sequences of states τ t = { s t , s t+1 , ⋯ , sT } starting
T

at time t for a time horizon of T < ∞ , the cumulative reward


T T −t
G t ( τ t )=r t +γ r t +1+ ⋯+ γ rT , (6)

where, r t =r (s t ).

20.2.1 Value Function and Bellman’s Theorem


How do we characterize the various states in a Markov reward process? It might be the
case that in some states reward may be high, but the system “quickly” transitions to states
with lower reward, leading to lower cumulative reward. However, in other states, even
though the reward might be low, it might transition to higher rewarding states more often,
leading to higher cumulative reward. One way to characterize the states in a Markov
reward process is to find the expected long-term reward that can be accumulated starting
from any state, referred to as the value function. Formally, the value function V t (s t ) of state
st at time t is defined as the total expected rewards that are accrued from time t .

20.2.1.1 Finite Horizon Markov reward process


For the finite horizon reward process, the value function in terms of the cumulative reward
is given in (7).
V t (s t )=E τ ∨s [ Gt ] .
T (7)
t t

The expectation in (7) is taken with respect to all the trajectories starting with state  st at
time t , and for a horizon of time T .
From (7), it looks like finding the value function is quite complicated. Fortunately,
Bellman’s theorem provides a method to compute the value functions.
Theorem 1 Bellman’s Theorem (finite horizon Markov reward process)
The value function for a finite horizon Markov reward process is given by

5

V t ( s ) =r ( s )+ ∑ P s , s V t +1 ( s )
'
' (8)
'
s

V T ( s ) =r ( s ) (9)

Proof. It is straightforward to note that V T (s)=r (s).


❑ ❑ ❑
V t (s)=V t (s t=s)=E τ ∨s = s [ Gt (τ t ) ] =
T
t t
T
T
∑ P [ τ t ∨s t =s ] G t (τ t )=r (s)+γ
T T
T
∑ P [ τ t +1∨s t =s ] Gt (τ t+1 )=r ( s)+ γ
T T
T

τ t ∨st = s τ t+ 1∨st =s τ t +1∨s

The third step in the proof above is due to the law of total probability, and the fifth step
uses the Markov property.

20.2.1.2 Infinite Horizon Markov reward process


Before we proceed to the Bellman theorem for infinite horizon, note that in infinite horizon
the following property holds.
Property 1 In infinite horizon Markov reward process, for times t and t ' ,
V t (s)=V t (s) ∀ s ∈ S .
' (10)

Property 1 states that the value function in an infinite horizon Markov reward process
is stationary, i.e. it doesn’t depend on the time. The proof is straight forward, and the main
idea is since there is infinite amount of time left, it doesn’t matter when the value function
is evaluated. Now we are ready to state Bellman’s theorem for infinite horizon Markov
reward process.
Theorem 2. Bellman’s Theorem (infinite horizon discounted Markov reward process)
The value function is given by

V ( s)=r (s)+ γ ∑ Ps , s V (s ) .
'
' (11)
'
s

The proof proceeds exactly as in the finite horizon case and hence is omitted.
Remark 2. The value function is the fixed point of the set of equations in (11). The
existence of the fixed point can be shown by using Banach fixed point theorem and by
showing that the right-hand side of (11) is a contraction mapping.
Example 2 Consider the two state Markov chain in Figure 2. The reward function, is
mentioned along with the state. Assume infinite horizon discounted Markov reward
process. Compute the value function for each of the two states?

6
Figure 2 Transition diagram of a 2-state Markov reward process
If the Markov chain starts in state 1, then there is only one sequence of states possible (and
that occurs with probability 1) and the cumulative reward is given by
1
Gt ∨{ St =1 }=1+ γ ×1+ γ 2 ×1+⋯= (12)
1−γ

1
Hence, V ( s=1)=E [ Gt∨ { S t=1 } ]= .
1−γ

γ γ2
P [ Gt ∨{ S t =0 } ] Gt ∨{ S t =0 } 0.5 0+ γ ×1+ γ ×1+γ × 1+ …=
2 3 2 3
0.25 0+γ × 0+ γ ×1+γ × 1+ …= 0.125 0+γ ×
1−γ 1−γ

Compared to Gt ∨{ St =1 } in (12), Gt ∨{ St =0 } is a random variable (this is usually the case).


γ γ2 1 0.5 γ
Hence, V ( s=0)=E [ Gt ∨{ St =0 } ]=0.5 × +0.25 × + ⋯= . It is easy to see
1−γ 1−γ 1−γ 1−0.5 γ
that V (s=0) is less than V ( s=1), for all values of  γ . This is intuitive since state 1 has a
higher reward and there is less uncertainty about the reward.

20.3 Markov Decision Process


Markov decision process is an extension of Markov reward process, where the rewards and
state transitions can be influenced by the actions of the decision makers. Formally, a
Markov decision process is a tuple ⟨ S , P , r , A , γ ⟩,
• S is the state space of the Markov chain.

• A is the action space of the decision maker, i.e. the set of actions available to the
decision maker. For example, in the Gambling model in Example 1 the action space is
either to gamble or not. Hence, A={ Gamble, Not Gamble }.

When the action space is dependent on the state of the Markov chain, it is denoted
by  A( s).

• P(⋅) is the transition probability of the Markov chain. However, in contrast, to the
Markov reward process, the transition probability also depends on the action of the
decision maker. For example, Pi , j (a) is the probability of transition from state i to
state  j when action a is taken.

7
• r :S × A → R , is the reward function which maps the state of the Markov chain, and the
action taken by a decision maker to a real value. In contrast to Markov reward process,
the reward depends not only on the state but also the action of the decision maker.

• As in Markov reward process, γ ∈ [ 0,1 ] is the discount factor.


Remark 3. An equivalent formulation also considers the case where the reward function,
in addition to the current state and action of the decision maker, also depends on the next
state. The reader is asked to show the equivalence of the two formulations in Exercise 8 .
As in the Markov reward process, we proceed by considering the finite horizon and the
infinite horizon cases, in turn.

20.3.1 Finite Horizon MDP


In the case of a finite horizon MDP, the decision maker accrues reward for a finite number
of time steps. However, in contrast to Markov reward process, the actions of the decision
maker also influence the reward of the decision maker.
Policy: The “policy” of the decision maker prescribes an action to the decision maker. Since
the objective of the decision maker is to maximize expected cumulative reward, an optimal
policy is one that maximizes expected cumulative reward. For a finite horizon MDP, it is
sufficient to consider policies that are a deterministic function of the current state, i.e. 
a k =μk (s k ), (13)

where μk is the policy at time k . The policy of the decision maker for a finite horizon MDP is
the vector
μ= ( μ0 , μ1 , … μN −1 ) , (14)

where μk , as defined in (13), is the policy at time k .


¿
An optimal policy  μ is one that maximizes the expected cumulative reward,

[∑ ]
N−1
μ¿ =argmax μ E r k ( s k , μk ( sk ) ) +r N ( s N ) (15)
k=0

For notational convenience, define

[∑ ]
N −1
J μ ( x)=E r k (s k , μ k (s k ))+r N (s N )∨s0 =x . (16)
k=0

J μ ( x) is the expected cumulative reward for a policy  μ with the Markov chain starting at
state  x . Hence, the optimal policy in (15) is simply that policy, among all policies, that
maximizes the expected cumulative reward for all possible starting states.

8
Finding the optimal policy for an  N horizon Markov decision problem amounts to finding
an optimal mapping from state to action for  N time instant. From the optimization problem
in (14), finding the optimal policy looks complicated. Fortunately, Bellman dynamic
programming theorem provides a recursive solution to compute the optimal policy.
Theorem 3 Bellman’s Theorem (finite horizon discounted Markov decision process)

[ ]

J k ( s )=max a ∈ A r k ( s , a ) +γ ∑ Ps , s ( a ) J k +1 ( s' )
' (17)
'
s

[ ]

μk ( s )=argmax a ∈ A r k ( s , a ) +γ ∑ P s ,s ( a ) J k+1 ( s )
'
'
'
s

with  J N (s)=r N (s).


The proof of Theorem 3 is similar to the Theorem 1 - Bellman theorem for Markov reward
process. Hence, the proof is omitted. In subsequent chapters, we will extensively use the Q-
function notation. The Q-function for the finite horizon discounted MDP, is as below:

Qk (s , a)=r k (s , a)+ γ ∑ Ps , s (a)J k+1 ( s ).
'
' (18)
'
s

Using the Q-function notation, (17) can be written as:


J k ( s )=max a ∈ A Qk ( s , a ) , (19)

μk ( s )=argmax a ∈ A Q k ( s , a ) .

Example 3. Consider the Markov decision making problem in Figure 3. Two actions a 1
and a 2 are possible in state 1. No actions are possible in state 2 and state 3. The state
transitions arrows are marked with the action and the probability of transitions associated
with that action. Find the optimal policy for a horizon of T =2, starting from t=0 .

9
Figure 3 Transition diagram of Markov decision making problem for Example 3
At time, T =2, J 2 (1)=1, J 2 (2)=5, J 2 (3)=−60 .
For time, t=1, for states 2 and 3,

J 1 (3)=r ( 3)+ ∑ P3 , s J 2 (s )=r (3)+1 J 2( 1)=−60+1=−59
s


J 1 (2)=r (2)+ ∑ P2 , s J 2 ( s)=r (2)+0.8 J 2 (2)+0.2 J 2 (3)=5+0.8 ×5+ 0.2×−60=−3
s

For time t=1 and state is 1, since two actions are possible we consider each of them in turn.
If action chosen is a 1:

Q1 (1 , a1)=r (1)+ ∑ P1 ,s ( a1) J 2 (1)=r (1)+ 1 J 2 (1)=1+1=2.
s

Similarly, if action a 2 is chosen:



Q1 (1 , a2)=r (1)+ ∑ P1 ,s (a2)J 2 (1)=r (1)+ 1 J 2 (2)=1+5=6.
s

V 1 (1 )=max a Q1 (1 , a ) =6 μ1 ( 1 ) =argmax a Q1 ( 1 , a )=a2

Finding the Q-value, the optimal policy at time t=0 is left as an exercise.

20.3.1.1 Forward recursion formulation


Consider the following change of notation:
V n (s)=J N−n (s)0 ≤ n< N V 0 (s)=J N (s )=r N ( s) (20)

The V n (⋅) is referred to as the value function in Markov decision process. Hence, the
Bellman dynamic programming equation using the value function notation is given as:

10
[ ]

V n ( s )=max a ∈ A r N−n ( s , a ) +∑ Ps , s ( a ) V n−1 ( s )
'
' (21)
'
s

with V 0 (s)=r N (s).

20.3.2 Infinite Horizon MDP


For the infinite horizon MDP, the expected cumulative reward

[ ]

J μ ( x)=E μ ∑ γ k r ( sk , μk ( s k ))∨s 0=x (22)
k=0

As in the finite horizon MDP, the infinite formulation the policy is given by
μ= ( μ0 , μ1 , ⋯ ) , (23)

where, μk is the policy at time k is the mapping from the state to action. The optimal infinite
¿ ¿ ¿
horizon policy  μ =( μ0 , μ1 , ⋯ ) is the infinite horizon policy that maximizes  J μ ( x) in (22) for
all possible starting states, i.e. 
¿
μ =argmax μ J μ ( x ) . (24)

From (23), it looks like solving an infinite horizon MDP requires finding infinite policies.
However, the following property states that solving an infinite horizon MDP requires
finding only a single policy.
Property 2. For an infinite horizon discounted MDP, there exists a stationary optimal
¿ ¿ ¿
policy, i.e.  μ =( μ , μ , ⋯ ).
The intuition of Property 2 is similar to the stationarity of the value function in the infinite
horizon discounted Markov reward process, in that since there is an infinite amount of time
left it doesn’t matter when we start the decision making.
Using finite horizon Bellman equation, we have

[ ]

J k ( s )=max a γ r ( s , a ) + ∑ P s ,s ( a ) J k+1 ( s )
k
' (25)
'
s

with J N=∞ ( s)=lim ¿k → ∞ γ k r ( s , a)=0 ¿.

Similar to the finite horizon, we change to the value function notation using

V n (s)=γ n− N J N −n (s ):

11
[ ]

V n ( s )=max a r ( s , a )+ γ ∑ P s ,s ( a ) V n−1 ( s )' (26)
'
s

with V 0 (s)=0.
With this, we are now ready to state the Bellman equation for infinite horizon MDP.
Bellman’s dynamic programming for infinite horizon discounted Markov decision process
¿ ¿
Theorem 4 Optimal policy  μ and the optimal value function V is the unique solution of
the following set of equations:

[ ]

V ∗( s )=max a r ( s , a )+ γ ∑ Ps ,s V (27)
(s ) ¿ '
'
'
s

[ ]

μ∗( s )=argmax a r ( s , a ) + γ ∑ Ps , s V ( s )
¿ '
' (28)
'
s

Similar to the finite horizon formulation, rewriting in Q-notation:



Q ( s , a )=r ( s , a ) +γ ∑ Ps , s V ( s ) (29)
¿ '
'
'
s

(30)
μ¿ ( s ) =argmax a Q ( s , a )

¿
V ( s ) =maxa Q ( s , a ) (31)

Example 4. This example considers a simplified model of the classical decision-making


problem of selling a house. There are  B types of buyers. For simplicity, buyer type i bids i
B
for the house. In addition, a buyer is of type i with probability  pi, such that ∑ pi =1. At each
i=1
time instant, a buyer arrives independently and identically according to the probability
distribution. The seller can take one of these two actions: accept the bid or reject the bid. If
the seller accepts the bid then the house is sold at the buyer’s bid price and no further bids
are possible. If the seller rejects the bids, then the bidding continues. In addition, the seller
pays a maintenance cost of C (at each time instant) as longs as he keeps the house.
(a) Formulate the problem as an MDP. Determine the state space, action space, transition
probability, and reward.
(b) Write the infinite horizon Bellman dynamic programming equation for the problem?

12
Before we start analyzing the problem, let us institutively understand the tradeoffs in this
decision-making problem. Obviously, the seller would want to accept the maximum bid
possible for the house. But, the seller doesn’t know when the maximum bid would be
realized, due the randomness in how the buyers arrive. Holding onto the house, in the
hopes for accepting the maximum bid, incurs maintenance cost for the seller. However,
selling the house too soon (by accepting one of the lower bids) so as to avoid the
maintenance cost is also not optimal. Let us now formulate this problem as an infinite
horizon MDP.
The state, at any point of time, is the type of buyer (or equivalently, the bid amount). The
state space is all possible types of buyer. The decision maker has two actions (action space)
either to Accept or to Reject. The transition probability can be represented using the state
transition diagram in Figure 4.

Figure 4: State diagram of the transition probability of the house selling problem in
Example 4
States 1 to  B correspond to the types of buyers. We also introduce an end state  Z , which
correspond to the state when the house is sold. Any time a bid is accepted (“Accept”
action), the state transitions to  Z . Once the state transitions to  Z , it remains in  Z regardless
of the action taken. If the bid is rejected (“Reject” action), then the state transitions to one
of the other states corresponding to a new bid being received. Because of the independent
and identical nature of how the bids are received, the transition probability doesn’t depend
on the current state. Hence,
Pi , j ( Reject)={ p j i≠ Z 0 i=Z Pi , j ( Accept )={1 j=Z 0 i≠ Z

The reward function is straight forward and can be obtained as follows:


r ( i, Accept ) ={0 i=Z i i≠ Z r (i , Reject )={−C i ≠ Z 0i=Z

13
Using the transition probability and the reward function above, the Bellman dynamic
programming equation can be obtained as follows

[ ]
B
V ¿ ( i )=max i ,−C+ γ ∑ p j V ¿ ( j ) (32)
j=1

[ ]
B
μ ( i )=argmax i ,−C +γ ∑ p j V ¿ ( j )
¿
(33)
j=1

The next question that we should ask is how to solve the Bellman dynamic programming
equations to obtain the optimal policy? For example, how to solve equations (32) to (33) to
decide which bids are to be accepted or rejected?

20.4 Numerical methods for Bellman dynamic programming equation


This section deals with numerical methods to solve the dynamic programming equation.
For simplicity, we only consider the case of the infinite horizon Bellman dynamic
programming equations. The ideas presented in this section, apply equally well to the finite
horizon Bellman dynamic programming equations.
¿
The optimal value function V (⋅) of the infinite horizon Bellman dynamic equation is the
fixed-point solution of Equation (21).

20.4.1 Value iteration (Successive Approximation)


The idea behind value iteration is that we can run the finite horizon Bellman equation for
“very large” value of  N . The steps for value iteration are shown in Algorithm 1.

V (s)=0 ∀ s
For k=1 , ⋯ do ❑
(34)
Qk ( s , a )=r ( s , a ) + γ ∑ Ps , s ( a ) V k−1 ( s ' )
'
'
s

V k ( s )=max a Q k ( s , a )

μk ( s )=argmax a Qk ( s , a )

Algorithm 1 Value Iteration algorithm


The critical question is doing the policy obtained as a result of value iteration in Algorithm
1 converge to the infinite horizon policy. The following property guarantees that it does in
fact converge.
Property 3: Value iteration algorithm (Algorithm 1) converges geometrically fast to the
optimal value function of the infinite horizon Bellman dynamic programming equation.

14
Here, we skip the proof of Property 3. The proof utilizes the fact that the Bellman operation
(the left hand side of Equation (34)) is a contraction mapping and hence, Algorithm 1
converges to the fixed point of Equation (11).
Example 5. Consider the MDP shown in Figure 5. Assume a discount factor ( γ ) of 0.9 . The
reward depends only on the state and is denoted by r . In state  S1, two actions a 1 and a 2 are
possible. All other states have only one action for each state. The numbers next to arrows
denote the probability of outcome.

Figure 4 State transition diagram for the MDP in Example 5


Consider doing value iteration to find the optimal policy. Let initial value for value iteration
be V 0 (S 0)=V 0 (S 1)=V 0 (S 2)=V 0 (S3 )=0, where V k (s) is the value function for iteration k and
state s. Find V 1 (S1 ), and V 2 (S1 ).

It is easy to see that V 1 (S0 )=1, V 1 ( S1 )=2, V 1 (S2 )=3, and V 1 (S3 )=10. To find, V 2 (S1 ), we
apply (34).
V 2 ( S 1 )=Max { 2+0.9 [ 0.5 ×2+0.5 ×10 ] , 2+0.9 [ 1 ×3 ] }=Max { 7.4 , 4.7 }=7.4

20.4.2 Policy iteration


In value iteration, we start with some arbitrary value function. As the iterations progress,
we “refine” the value function to be closer to the optimal value function. The policy at each
iteration is obtained as a “by product” of value function iterations. In contrast, in policy
iteration, we start with some arbitrary policy, and in each iteration, we refine the policy to
be closer to the optimal policy. The algorithm for policy iteration is given in Algorithm 2.
Policy iteration proceeds in two steps. In Step 1, we find a value function corresponding to
a given policy, and in Step 2, we improve the policy.

Start with arbitrary policy  μ0

for k = 0; ⋯ do Step 1: Policy evaluation step

15

V μ ( s ) =r ( s , μ k ( s ) ) + γ ∑ P s ,s V μ ( s )
'
'
k k
'
s

Step 2: Policy improvement step

[ ]

μk +1 ( s ) =argmax a r ( s ,a )+ γ ∑ P s ,s V μ ( s ' ) '
k
'
s

Algorithm 2 Policy iteration algorithm

Step 1 in Algorithm 2 involves solving a set of linear equations of the form  A V μ =b and any k

linear solver (such as linsolve from Matlab) can be used for this purpose.
As in the case of value iteration, policy iteration algorithm is also guaranteed to converged
and is given in the following property.
Property 4. Policy iteration algorithm converges to the optimal policy.

20.4.3 Linear Programming


A less used method for solving MDP is the linear programming method, since they tend to
the slower compared to specialized versions of policy iteration or value iteration. However,
recently, the linear programming method has been used as a basis for approximate solution
for large-scale MDPs.
Consider the infinite horizon Bellman dynamic programming equation in (27), reproduced
here for convenience:

V ( s ) ≥ r ( s , a ) + γ ∑ Ps , s ( a ) V ( s )
¿ ¿ '
'
'
s

The idea behind the linear programming formulation is that the above equation can be
equivalently represented as follows

V ¿ (s) ≥ r (s , a)+ γ ∑ P s ,s (a) V ¿ ( s' )∀ a .
' (35)
'
s

Now, consider the following linear programming problem:



minimizeV ∑ V ( s) (36)
s

The following theorem ensures that the solution of (36) is the solution of Bellman dynamic
programming equation in (27).
¿
Property 5. The optimal value of the linear program in (36) is V .

16

In (36), we can replace the objective ∑ V (s) with any positive linear function of V (s),
s

i.e. ∑ d (s) V (s) with d ( s)>0 , and the Property 5 still holds.
s

17
20.5 Exercises
Ex.1 Model of disease spread: Suppose there are N individuals in a population, in which
some people have a disease and others are healthy. At any time, two individuals are
selected at random and assumed to interact with each other. During the interaction, if one
of the persons have the disease and other person is healthy, the disease transmission takes
place with probability α =0.1. Let X n denote the number of diseased persons. Does X n
follow a Markov chain? If so, construct the transition probability matrix for N=5 .
Ex.2 Balls in an urn: An urn contains n white and m black balls. At any time, a ball is
selected at random, without replacement. Let X k =(wk , bk ) denote the number of numbers
of white balls and black balls remaining in the urn at time k . Does X n follow a Markov
chain? If so, draw the probability transition diagram when m=n=5.
Ex.3 An urn contains n white and m black balls. Balls are selected at random, without
replacement. The decision maker wins one dollar when a white ball is selected, and loses
one dollar when a black ball is selected. At any point in time, the decision maker is allowed
to quit playing.
(a) Formulate the problem as a MDP. Determine the state space, action space, transition
probability, and reward.
(b) Write the Bellman dynamic programming equation for the problem?
Ex.4 Consider a decision maker who can bet any (non-negative) amount, up to his present
fortune. The decision maker wins the bet with probability p, and loses the bet with
probability 1− p . The reward of the decision maker is given by log ( x), where x is the
present fortune of the decision maker. The decision maker is allowed n bets.
(a) Formulate the above problem as a MDP. Determine the state space, action space,
transition probability, and reward.
(b) Write the finite horizon Bellman dynamic programming equation for solving the MDP.
Ex.5. Consider the simplified model for machine replacement. Suppose that at time, a
machine is inspected and its condition and state is noted. States are {0,1,2, ⋯ } with state 0
being “perfectly new”. With each state i , an operating cost of C (i) is incurred at each epoch.
After inspecting the state, at each time a decision is made: a=0 (not replace the machine)
or a=1 (replace the machine). If the decision is toreplace, a cost  R>0 is immediately
incurred. This then moves the state of the machine to 0 for the next epoch. If the decision is
not to replace then the condition of the machine evolves randomly according to the
probabilities  Pi , j.
Pose the problem as an MDP with infinite horizon and discounted objective. Write the state
space, action space, transition probabilities, and reward function.
Write the Bellman dynamic programming equation for the MDP.
Ex.6 Consider an MDP with four states { A , B , C , D } as shown in Figure 6. In each state, we
can either perform action a or b . The figure below shows the state transition function and
reward (as a function of state and action) obtained by each action. Consider doing value
18
iteration to find the optimal policy. Let initial value for value iteration be
V 0 ( A)=V 0 (B)=V 0 (C)=V 0 (D)=0, where V k (s) is the value function for iteration k and
state s. Find V 1 (s); s= A , B , C , D , and V 2 (s); s= A , B , C , D .

Figure 6: MDP for Ex. 6. The reward as a function of state and action is denoted alongside
the arrows.
Ex.7 [Programming exercise]: Consider a simplified example of the grid world as shown in
Figure 7. A robot placed at any location in this grid world has the goal to reach the reward
state (indicated in green, with reward 1), while avoiding traversing through a bad state
(indicated in red, with reward −100). The reward in other locations of the grid world is 0 .
The movement of the robot is inexact (as shown in Figure 7), i.e. the robot moves in the
desired direction with probability 0.8 and in one of the other perpendicular location with
probability 0.1. Whenever, the robot moves in the direction that bumps it against a wall,
leaves the robot at the same location. Run value iteration, policy iteration and compare the
number of iterations required to converge to the optimal solution (optimal solution can be
solved using linear programming formulation).

(a) Grid world (b) Action transition probability


Figure 7: Figure for Exercise 7.

19
Ex.8 Consider a Markov decision process where the reward r (s , a , s' ) depends not only on
the current state and action, but the next state as well. Show that the Bellman dynamic
programming equations for this MDP can be obtained by replacing r (s , a) by

r ( s , a)=∑ P s , s ( a)r ( s , a , s ).
'
'
'
s

20

You might also like