0% found this document useful (0 votes)
43 views

On State Variables and POMDP-s

This document discusses state variables and modeling sequential decision problems. It argues that properly modeled problems are Markovian from one perspective, while models of real applications are non-Markovian from another perspective. It presents a framework for modeling sequential decision problems and defines state variables. It also discusses partially observable Markov decision processes and illustrates these concepts using a problem of learning how to mitigate the spread of flu in a population.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

On State Variables and POMDP-s

This document discusses state variables and modeling sequential decision problems. It argues that properly modeled problems are Markovian from one perspective, while models of real applications are non-Markovian from another perspective. It presents a framework for modeling sequential decision problems and defines state variables. It also discusses partially observable Markov decision processes and illustrates these concepts using a problem of learning how to mitigate the spread of flu in a population.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

arXiv:2002.06238v1 [cs.

LG] 14 Feb 2020

On State Variables, Bandit Problems and POMDPs

Warren B. Powell
Department of Operations Research and Financial Engineering
Princeton University

February 18, 2020


Abstract

State variables are easily the most subtle dimension of sequential decision problems. This is especially
true in the context of active learning problems (“bandit problems”) where decisions affect what we
observe and learn. We describe our canonical framework that models any sequential decision problem,
and present our definition of state variables that allows us to claim: Any properly modeled sequential
decision problem is Markovian. We then present a novel two-agent perspective of partially observable
Markov decision problems (POMDPs) that allows us to then claim: Any model of a real decision
problem is (possibly) non-Markovian. We illustrate these perspectives using the context of observing
and treating flu in a population, and provide examples of all four classes of policies in this setting.
We close with an indication of how to extend this thinking to multiagent problems.
Contents
1 Introduction 1

2 Modeling sequential decision problems 2


2.1 Elements of a sequential decision problem . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Energy storage illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Pure learning problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Designing policies 7

4 State variables 10
4.1 A brief history of state variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 A modern definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 More illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3.1 With a time-series price model . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3.2 With passive learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3.3 With active learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3.4 With rolling forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 A probabilist’s perspective of information . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Partially observable Markov decision processes 19

6 A learning problem: protecting against the flu 21


6.1 A static model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.2 Variations of our flu model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2.1 A time-varying model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2.2 A time-varying model with drift . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2.3 A dynamic model with a controllable truth . . . . . . . . . . . . . . . . . . . . 25
6.2.4 A flu model with a resource constraint and exogenous state . . . . . . . . . . . 26
6.2.5 A spatial model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.3 The POMDP perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7 A two-agent model of the flu application 32


7.1 A two-agent formulation of the POMDP . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.2 Transition functions for two-agent model . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8 Designing policies for the flu problem 37
8.1 Policy function approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
8.2 Cost function approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.3 Policies based on value functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8.4 Direct lookahead policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8.5 A hybrid policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

9 Multiagent systems 42

References 45
1 Introduction

Sequential decision problems span a genuinely vast range of applications including engineering, busi-
ness, economics, finance, health, transportation, and energy. It encompasses active learning problems
that arise in the experimental sciences, medical decision making, e-commerce, and sports. It also
includes iterative algorithms for stochastic search, as well as two-agent games and multiagent sys-
tems. In fact, we might claim that virtually any human enterprise will include instances of sequential
decision problems.

Sequential decision problems consist of sequences: decision, information, decision, information,


. . ., where decisions are determined according to a rule or function that we call a policy, that is
a mapping from a state to a decision. This is arguably the richest problem class in modern data
analytics. Yet, unlike fields such as machine learning or deterministic optimization, they lack a
canonical modeling framework that is broadly used.

The core challenge in modeling sequential decision problems is modeling the state variable. A
consistent theme across what I have been calling the “jungle of stochastic optimization” (see jungle.
princeton.edu) is the lack of a standard definition (with the notable exception of the field of optimal
control).

I am using this document to a) offer my own definition of a state variable (taken from Powell
(2020)), b) suggest a new perspective on the modeling of learning problems (spanning bandit prob-
lems to partially observable Markov decision problems) and c) extending these ideas to multi-agent
systems. This discussion draws from several sources I have written recently: Powell (2020), Powell
(2019b), and Powell (2019a). These are all available at jungle.princeton.edu.

In this article, I am going to argue:

1) All properly modeled problems are Markovian.

2) All models of real applications are (possibly) non-Markovian.

These seem to be contradictory claims, but I am going to show that they reflect the different per-
spectives that lead to each claim.

We are going to begin by presenting our universal framework for modeling sequential decision
problems in section 2. The framework presents sequential decision problems in terms of optimizing
over policies. Section 3 provides a streamlined presentation of how to design policies for any sequential
decision problem.

Section 4 then provides an in-depth discussion of state variables, including a brief history of state
variables, our attempt at a proper definition, followed by illustrations in a variety of settings. Then,

1
section 5 discusses partially observable Markov decision problems, and presents a two-agent model
that offers a fresh perspective of all learning problems. We illustrate these ideas in section 6 using a
problem setting of learning how to mitigate the spread of flu in a population. We then extend this
thinking in section 9 to the field of multiagent systems.

This article is being published on arXiv only, so it is not subject to peer review. Instead, readers
are invited to comment on this discussion at http://tinyurl.com/statevariablediscussion.

2 Modeling sequential decision problems

Any sequential decision problem can be written as a sequence of state, decision, information, state,
decision, information. Written over time this would be given by

(S0 , x0 , W1 , S1 , x1 , W2 , . . . , St , xt , Wt+1 , . . . , ST )

where St is the “state” (to be defined below) at time t, xt is the decision made at time t (using the
information in St ), and then Wt+1 is the information that arrives between t and t + 1 (which is not
known when we make the decision xt ). Note that we start at time t = 0, and assume a finite horizon
t = T (standard in many communities, but not Markov decision processes).

There are many problems where it is more natural to use a counter n (as in nth event or nth
iteration). We write this as

(S 0 , x0 , W 1 , S 1 , x1 , W 2 , . . . , S n , xn , W n+1 , . . . , S N ).

Finally, there are times where we are iterating over simulations (e.g. the nth pass over a week-long
simulation of ad-clicks), which we would write using

(Stn , xnt , Wt+1


n
)Tt=0 for n = 0, . . . , N .

The classical modeling framework used for Markov decision processes is to specify the tuple
(S, A, P, r) where S is the state space, A is the action space (the MDP community uses a for action),
P is the one-step transition matrix with element p(s0 |s, a) which is the probability we transition from
state s to s0 when we take action a, and r = r(s, a) is the reward if we are in state s and take action a
(see Puterman (2005)[Chapter 3]). This modeling framework has been adopted by the reinforcement
learning community, but in Powell (2019b) we argue that it does not provide a useful model, and
ignores important elements of a real model.

Below we describe the five elements of any sequential decision problem. We first presented this
modeling style in Powell (2011), but as noted in Powell (2019b), this framework is very close to the

2
style used in the optimal control community (see, for example, Lewis & Vrabie (2012), Kirk (2004)
and Sontag (1998)). After this, we illustrate the framework with a classical inventory problem
(motivated by energy storage) and as a pure learning problem. The framework involves optimizing
over policies, so we close with a discussion of designing policies.

2.1 Elements of a sequential decision problem

There are five dimensions of any sequential decision problem: state variables, decision variables,
exogenous information processes, the transition function and the objective function.

State variables - The state St of the system at time t contains all the information that is necessary
and sufficient to compute costs/rewards, constraints, and the transition function (we return to
state variables in section 4).

Decision variables - Standard notation for decisions might be at for action, ut for control, or xt ,
which is the notation we use since it is standard in math programming. xt may be binary, one
of a finite discrete set, or a continuous or discrete vector. We let Xt be the feasible region for
xt , where Xt may depend on St . We address the problem of designing the policy to later.
Decisions are made with a decision function or policy, which we denote by X π (St ) where “π”
carries the information about the type of function f ∈ F, and any tunable parameters θ ∈ Θf .
We require that the policy satisfy X π (St ) ∈ Xt .

Exogenous information - We let Wt+1 be any new information that first becomes known at time
t + 1 (we can think of this as information arriving between t and t + 1). Wt+1 may depend on
the state St and/or the decision xt , so it is useful to think of it as the function Wt+1 (St , xt ),
but we write Wt+1 for compactness. This indexing style means any variable indexed by t is
known at time t.

Transition function - We denote the transition function by

St+1 = S M (St , xt , Wt+1 ), (1)

where S M (·) is also known by names such as system model, state equation, plant model, plant
equation and transfer function. S M (·) contains the equations for updating each element of St .

Objective functions - There are a number of ways to write objective functions. We begin by
making the distinction between state-independent problems, and state-dependent problems.
We let F (x, W ) denote a state-independent problem, where we assume that neither the objec-
tive function F (x, W ), nor any constraints, depends on dynamic information captured in the
state variable. We let C(S, x) capture state-dependent problems, where the objective function
(and/or constraints) may depend on dynamic information.

3
We next make the distinction between optimizing the cumulative reward versus the final reward.
Optimizing cumulative rewards typically arises when we are solving problems in the field where
we are actually experiencing the effect of a decision, whereas final reward problems typically
arise in laboratory environments.
Below we list the objectives that are most relevant to our discussion:

State-independent, final reward This is the classical stochastic search problem. Here we
go through a learning/training process to find a final design/decision xπ,N , where π is our
search policy (or algorithm), and N is the budget. We then have to test the performance
of the policy by simulating W
c using
π,N c
max ES 0 EW 1 ,...,W N |S 0 EW
c |S 0 F (x , W ), (2)
π

where xπ,N depends on S 0 and the experiments W 1 , . . . , W N , and where W


c represents
the process of testing the design xπ,N .
State-independent, cumulative reward This is the standard representation of multi-armed
bandit problems, where S n is what we believe about EF (x, W ) after n experiments. The
objective function is written
N
X −1
max ES 0 EW 1 ,...,W N |S 0 F (X π (S n ), W n+1 ), (3)
π
n=0

where F (X π (S n ), W n+1 ) is our performance for the n+1st experiment using experimental
settings xn = X π (S n ), chosen using our belief S n = B n based on what we know from
experiments 1, . . . , n.
State-dependent, cumulative reward This is the version of the objective function that is
most widely used in stochastic optimal control (as well as Markov decision processes).
We switch back to time-indexing here since these problems are often evolving over time
(but not always). We write the contribution in the form C(St , xt , Wt+1 ) to help with the
comparison to F (x, W ), which gives us
( T )
X
π
max ES0 EW1 ,...,WT |S0 C(St , X (St ), Wt+1 )|S0 . (4)
π
t=0

This is not an exhaustive list of objectives (for example, we did not list state-dependent, final
reward). Other popular choices model regret or posterior-optimal solutions to compare against
a benchmark. Risk is also an important issue.

Note that the objectives in (2) - (4) all involve searching over policies. Writing the model, and
then designing an algorithm to solve the model, is absolutely standard in deterministic optimization.
Oddly, the communities that address sequential decision problems tend to first choose a solution
approach (what we call the policy), and then model the problem around the class of policy.

4
Our framework applies to any sequential decision problem. However, it is critical to create the
model before we choose a policy. We refer to this style as “model first, then solve.” We address the
problem of designing policies in section 3.

2.2 Energy storage illustration

We are going to use a simple energy storage problem to provide a basic illustration of the core
framework. Our problem involves a storage device (such as a large battery) that can be used to
buy/sell energy from/to the grid at a price that varies over time.

State variables State St = (Rt , pt ) where

Rt = energy in the battery at time t,


pt = price of energy on the grid at time t.

Decision variables xt is the amount of energy to purchase from the grid (xt > 0) or sell to the
grid (xt < 0). We introduce the policy (function) X π (St ) that will return a feasible vector xt .
We defer to later the challenge of designing a good policy.

Exogenous information variables Wt+1 = (p̂t+1 ), where p̂t+1 is the price charged at time t + 1
as reported by the grid. The price data could be from historical data, or field observations (for
an online application), or a mathematical model.

Transition function St = S M (St , xt , Wt+1 ), which consists of the equations:

Rt+1 = Rt + η(xGB
t + xEB
t − xBD
t ), (5)
pt+1 = p̂t+1 . (6)

The transition function needs to include an equation for each element of the state variable.
In real applications, the transition function can become quite complex (“500 lines of Matlab
code” was how one professional described it).

Objective function Let C(St , xt ) be the one-period contribution function given by

C(St , xt ) = pt xt .

We wish to find a policy X π (St ) that maximizes profits over time, so we use the cumulative
reward objective, giving us
( T )
X
π
max ES0 EW1 ,...,WT |S0 C(St , X (St ))|S0 , (7)
π
t=0

where St+1 = S M (St , xt = X π (St ), Wt+1 ) and where we are given an information process
(S0 , W1 , W2 , . . . , WT ).

5
Of course, this is a very simple problem. We are going to return to this problem in section 4 where we
will use a series of modifications to illustrate how to model state variables with increasing complexity.

2.3 Pure learning problem

An important class of problems are pure learning problems, which are widely studied in the literature
under the umbrella of multiarmed bandit problems. Assume that x ∈ X = {x1 , . . . , xM } represents
different configurations for manufacturing a new model of electric vehicle which we are going to
evaluate using a simulator. Let µx = EW F (x, W ) be the expected performance if we could run an
infinitely long simulation. We assume that a single simulation (of reasonable duration) produces the
performance

F̂x = µx + ε,

2 ) is the noise from running a single simulation.


where ε ∼ N (0, σW

Assume we use a Bayesian model (we could do the entire exercise with a frequentist model),
where our prior on the truth µx is given by µx ∼ N (µ̄0x , σ̄x2,0 ). Assume that we have performed n
simulations, and that µx ∼ N (µ̄nx , σ̄x2,n ). Our belief B n about µx after n simulations is then given by

B n = (µ̄nx , σ̄x2,n )x∈X . (8)

For convenience, we are going to define the precision of an experiment as β W = 1/σW


2 , and the

precision of our belief about the performance of drug x as βxn = 1/σ̄x2,n .

If we choose to try drug xn and then run the n+1st experiment and observe F̂ n+1 = F (xn , W n+1 ),
we update our beliefs using

βxn µ̄nx + β W F̂xn+1


µ̄n+1
x = , (9)
βxn + β W
βxn+1 = βxn + β W , (10)

if x = xn ; otherwise, µ̄n+1
x = µ̄nx and βxn+1 = βxn . These updating equations assume that beliefs are
independent; it is a minor extension to allow for correlated beliefs.

Also, these equations are for a Bayesian belief model. In section 4.3.1 we are going to illustrate
learning with a frequentist belief model.

We are now ready to state our model using the canonical framework:

State variables The state variable is the belief S n = B n given by equation (8).

6
Decision variables The decision variable is the configuration x ∈ X that we wish to test next,
which will be determined by a policy X π (S n ).

Exogenous information This is the simulated performance given by F̂xn+1


n .

Transition function These are given by equations (9)-(10) for updating the beliefs.

Objective function This is a state-independent problem (the only state variable is our belief about
the performance). We have a budget to run N simulations of different configurations. When
the budget is exhausted, we choose the best design according to

xπ,N = arg max µ̄N


x ,
x∈X

where we introduce the policy π because µ̄N


x has been estimated by running experiments using
experimentation policy X π (S n ). The performance of a policy X π (S n ) is given by

F π = ES 0 EW 1 ,...,W N |S 0 EW
c |S 0 F (x
π,N c
, W ).

Our goal is to then solve

max EF π (S 0 ).
π

Note that when we made the transition from an energy storage problem to a learning problem,
the modeling framework remained the same. The biggest change is the state variable, which is now
a belief state.

The modeling of learning problems is somewhat ragged in the academic literature. In a tutorial
on reinforcement learning, Lazaric (2019) states that bandit problems do not have a state variable
(!!). In contrast, there is a substantial literature on bandit problems in the applied probability
community that studies “Gittins indices” that is based on solving Bellman’s equation exactly where
the state is the belief (see Gittins et al. (2011) for a nice overview of this field).

Our position is that a belief state is simply part of the state variable, which may include elements
that we can observe perfectly, as well as beliefs about parameters that can only be estimated. This
leaves us with the challenge of designing policies.

3 Designing policies

There are two fundamental strategies for designing policies, each of which can be further divided
into two classes, producing four classes of policies:

7
Policy search - Here we use any of the objective functions (2) - (4) to search within a family of
functions to find the policy that works best. Policies in the policy-search class can be further
divided into two classes:

Policy function approximations (PFAs) PFAs are analytical functions that map states
to actions. They can be lookup tables (if the chessboard is in this state, then make this
move), or linear models which might be of the form
X
X P F A (S n |θ) = θf φf (S n ).
f ∈F

PFAs can also be nonlinear models (buy low, sell high is a form of nonlinear model), or
even a neural network.
Cost function approximations (CFAs) CFAs are parameterized optimization models. A
simple one that is widely used in pure learning problems, called interval estimation, is
given by

X CF A−IE (S n |θIE ) = arg max(µ̄nx + θIE σ̄xn ).


x∈X

The CFA might be a large linear program, such as that used to schedule aircraft where the
amount of slack for weather delays is set at the θ-percentile of the distribution of travel
times. We can write this generally as

X CF A (S n |θ) = arg max C̄ π (S n , x|θ),


x∈X π (θ)

where C̄ π (S n , x|θ) might be a parametrically modified objective function (e.g. with penal-
ties for being late), while X π (θ) might be parametrically modified constraints (think of
buffer stocks and schedule slack).

Lookahead approximations - We can create an optimal policy if we could solve



T
( ( ) )!
X
Xt∗ (St ) = arg max C(St , xt ) + E max E C(St0 , Xtπ0 (St0 )) St+1 St , xt . (11)

xt π
t0 =t+1

In practice, equation (11) cannot be computed, so we have to resort to approximations. There


are two approaches for creating these approximations:

Value function approximations (VFAs) The ideal VFA policy involves solving Bellman’s
equation

Vt (St ) = max C(St , x) + E{Vt+1 (St+1 )|St , x} . (12)
x

8
We can build a series of policies around Bellman’s equation:

X V F A (St |θ) = arg max C(St , x) + E{Vt+1 (St+1 )|St , x} ,



(13)
x∈Xt

= arg max C(St , x) + E{V̄t+1 (St+1 |θ)|St , x} , (14)
x∈Xt
= arg max C(St , x) + V̄tx (Stx |θ) ,

(15)
x∈Xt
X 
= arg max C(St , x) + θf φf (St , x) , (16)
x∈Xt
f ∈F
= arg max Q̄(St , x|θ). (17)
x∈Xt

The policy given in equation (13) would be optimal if we could compute Vt+1 (St+1 ) from
(12) exactly. Equation (14) replaces the value function with an approximation, which
assumes that a) we can come up with a reasonable approximation and b) we can compute
the expectation. Equation (15) eliminates the expectation by using the post-decision state
S x (see Powell (2011) for a discussion of post-decision states). Equation (16) introduces
a linear model for the value function approximation. Finally, equation (17) writes the
equation in the form of Q-learning used in the reinforcement learning community.
Direct lookaheads (DLAs) The second approach is to create an approximate lookahead
model. If we are making a decision at time t, we represent our lookahead model us-
ing the same notation as the base model, but replace the state St with S̃tt0 , the decision
xt with x̃tt0 which is determined with policy X̃ π̃ (S̃tt0 ), and the exogenous information Wt
with W̃tt0 . This gives us an approximate lookahead policy

( ( T
) )!
X
XtDLA (St ) = arg max C(St , xt ) + Ẽ max Ẽ C(S̃tt0 , X̃ π̃ (S̃tt0 ))|S̃t,t+1 |St , xt . (18)
xt π̃
t0 =t+1

We claim that these four classes are universal, which means that any policy designed for any
sequential decision problem will fall in one of these four classes, or a hybrid of two or more. We
further insist that all four classes are important. Powell & Meisel (2016) demonstrates that each of
the four classes of policies may work best, depending on the characteristics of the datasets, for the
energy storage problem described in section 2.2. Further, all four classes of policies have been used
(by different communities) for pure learning problems.

We emphasize that these are four meta-classes. Choosing one of the meta-classes does not mean
that you are done, but it does help guide the process. Most (almost all) of the literature on decisions
under uncertainty is written with one of the four classes in mind. We think all four classes are
important. Most important is that at least one of the four classes will work, which is why we insist
on “model first, then solve.”

It has been our experience that the most consistent error made in the modeling of sequential
decision problems arises with state variables, so we address this next. It is through the state variable

9
that we can model problems with physical states, belief states or both. Regardless of the makeup of
the state variable, we will still turn to the four classes of policies for making decisions.

4 State variables

The definition of a state variable is central to the proper modeling of any sequential decision problem,
because it captures the information available to make a decision, along with the information needed
to compute the objective function and the transition function. The policy is the function that derives
information from the state variable to make decisions.

We are going to start in section 4.1 with a brief history of state variables. In section 4.2 we offer
our own definition of a state variable (this is taken from Powell (2020) which in turn is based on
the definition offered in Powell (2011)[Chapter 5, available at http://adp.princeton.edu]. Section 4.3
then provides a series of extensions of our energy storage problem to illustrate history-dependent
problems, passive and active learning, and the widely overlooked issue of modeling rolling forecasts.
We close by giving a probabilist’s measure-theoretic perspective of information and state variables
in section 4.4.

4.1 A brief history of state variables

Our experience is that there is an almost universal misunderstanding of what is meant by a “state
variable.” Not surprisingly, interpretations of the term “state variable” vary between communities.
An indication of the confusion can be traced to attempts to define state variables. For example,
Bellman introduces state variables with “we have a physical system characterized at any stage by a
small set of parameters, the state variables” (Bellman 1957). Puterman’s now classic text introduces
state variables with “At each decision epoch, the system occupies a state.” (Puterman 2005)[p. 18]
(in both cases, the italicized text was included in the original text). As of this writing, Wikipedia
offers “A state variable is one of the set of variables that are used to describe the mathematical state
of a dynamical system.” Note that all three references use the word “state” in the definition of state
variable (which means it is not a proper definition).

In fact, the vast majority of books that deal with sequential decision problems in some form do
not offer a definition of a state variable, with one notable exception: the optimal control community.
There, we have found that books in optimal control routinely offer an explicit definition of a state
variable. For example, Kirk (2004) offers:

A state variable is a set of quantities x1 (t), x2 (t), . . . [WBP: the controls community uses
x(t) for the state variable] which if known at time t = t0 are determined for t ≥ t0 by
specifying the inputs for t ≥ t0 .

10
Cassandras & Lafortune (2008) has the definition:

The state of a system at time t0 is the information required at t0 such that the output
[cost] y(t) for all t ≥ t0 is uniquely determined from this information and from [the
control] u(t), t ≥ t0 .

We have observed that the pattern of designing state variables is consistent across books in deter-
ministic control, but not stochastic control. We feel this is because deterministic control books are
written by engineers who need to model real problems, while stochastic control books are written by
mathematicians.

There is a surprisingly widespread belief that a system can be “non-Markovian” but can be made
“Markovian” by adding to the state variable. This is nicely illustrated in Cinlar (2011):

The definitions of “time” and “state” depend on the application at hand and the demands
of mathematical tractability. Otherwise, if such practical considerations are ignored,
every stochastic process can be made Markovian by enhancing its state space sufficiently.

We agree with the basic principle expressed in the controls books, which can all be re-stated as
saying “A state variable is all the information you need (along with exogenous inputs) to model the
system from time t onward.” Our only complaint is that this is a bit vague.

On the other hand, we disagree with the widely held belief that stochastic systems “can be made
Markovian” which runs against the core principle in the definitions in the optimal control books that
the state variable is all the information needed to model the system from time t onward. If it is all
the information, then it is Markovian by construction.

There are two key areas of misunderstanding that we are going to address with our discussion.
The first is a surprisingly widespread misunderstanding about “Markov” vs. “history-dependent”
systems. The second, and far more subtle, arises when there are hidden or unobservable variables.

4.2 A modern definition

We offer two definitions depending on whether we have a system where the structure of the policy
has been specified, and when it has not (this is taken from Powell (2020)).

A state variable is:

a) Policy-dependent version A function of history that, combined with the exogenous infor-
mation (and a policy), is necessary and sufficient to compute the decision function (the
policy), the cost/contribution function, and the transition function.

11
b) Optimization version A function of history that, combined with the exogenous informa-
tion, is necessary and sufficient to compute the cost or contribution function, the constraints,
and the transition function.

Both of these definitions lead us to our first claim:

Claim 1: All properly modeled systems are Markovian.

But stay tuned; later, we are going to argue the opposite, but there will be a slight change in the
wording that explains the apparent contradiction.

Note that both of these definitions are consistent with those used in the controls community,
with the only difference that we have specified that we can identify state variables by looking at the
requirements of three functions: the cost/contribution function, the constraints (which is a form of
function), and the transition function.

One issue that we are going to address arises when we ask “What is a transition function?”

The transition function, which we write St+1 = S M (St , xt , Wt+1 ), is the set of equa-
tions that describes how each element of the state variable St evolves over time.

We quickly see that we have circular reasoning: a state variable includes the information we need to
model the transition function, and the transition function is the equations that describe the evolution
of the state variables. It turns out that this circular logic is unavoidable, as we illustrate later (in
section 4.3.4).

We have found it useful to identify three types of state variables:

Physical state Rt - The physical state captures inventories, the location of a device on a graph,
the demand for a product, the amount of energy available from a wind farm, or the status of
a machine. Physical states typically appear in the right hand sides of constraints.

Other information It - The “other information” variable is literally any other information about
observable parameters not included in Rt .

Belief state Bt - The belief state Bt captures the parameters of a probability distribution describing
unobservable parameters. This could be the mean and variance of a normal distribution, or a
set of probabilities.

We present the three types of state variables as being in distinct classes, but it is more accurate
to describe them as a series of nested sets as depicted in figure 1. The physical state variables

12
Figure 1: Physical state variables Rt , as a subset of other information It , as a subset of belief state
variables Bt .

describe quantities that constrain the system (inventories, location of a truck, demands) that are
known perfectly. We then describe It as “other information” but it might help to think of It as
any parameter that we observe perfectly, which could include Rt . Finally, Bt is the parameters of
probability distributions for any quantity that we do not know perfectly, but a special case of a
probability distribution is a point estimate with zero variance, which could include It (and then Rt ).
Our choice of the three variables is designed purely to help with the modeling process.

We explicitly model the “resource state” because we have found in some communities (and this is
certainly true of operations research) that people tend to equate “state” and “physical state.” We do
not offer an explicit definition of Rt , although we note that it typically includes dynamic information
in right-hand side constraints. Rti might be the number of units of blood of type i at time t; Rta
might be the number of resources with attribute vector a. For example, Rta could be how much we
have invested in an asset, where a captures the type of asset, how long it has been invested, and
other information (such as the current price of the asset).

Some authors find it convenient to distinguish between two types of states:

Exogenous states These are dynamically varying parameters that evolve purely from an exogenous
process.

Controllable states These are the state variables that are directly or indirectly affected by deci-
sions.

In the massive class of problems known as “dynamic resource allocation,” Rt would be the physical
state, and this would also be the controllable state. However, there may be variables in It that are
also controllable (at least indirectly). There will also be states (such as water in a reservoir, or the
state of disease in a patient) that evolve due to a mixture of exogenous and controllable processes.

13
Later we are going to illustrate the widespread confusion in the handling of “states” (physical
states in our language) and “belief states” in the literature on partially observable Markov decision
processes (POMDPs).

4.3 More illustrations

We are going to use our energy storage problem to illustrate the handling of so-called “history-
dependent” problems (in section 4.3.1), followed by examples of passive and active learning (in
sections 4.3.2 and 4.3.3), closing with an illustration of the circular logic for defining state variables
and transition functions using rolling forecasts (in section 4.3.4). The material in this section is taken
from Powell (2020).

4.3.1 With a time-series price model

Our basic model assumed that prices evolved according to a purely exogenous process (see equation
(6)). Now assume that it is governed by the time series model

pt+1 = θ0 pt + θ1 pt−1 + θ2 pt−2 + εt+1 . (19)

A common mistake is to say that pt is the “state” of the price process, and then observe that it is
no longer Markovian (it would be called “history dependent”), but “it can be made Markovian by
expanding the state variable,” which would be done by including pt−1 and pt−2 . According to our
definition of a state variable, the state is all the information needed to model the process from time
t onward, which means that the state of our price process is (pt , pt−1 , pt−2 ). This means our system
state variable is now


St = Rt , (pt , pt−1 , pt−2 ) ).

We then have to modify our transition function so that the “price state variable” at time t + 1
becomes (pt+1 , pt , pt−1 ).

4.3.2 With passive learning

We implicitly assumed that our price process in equation (19) was governed by a model where the
coefficients θ = (θ0 , θ1 , θ2 ) were known. Now assume that the vector θ is unknown, which means we
have to use estimates θ̄t = (θ̄t0 , θ̄t1 , θ̄t2 ), which gives us the price model

pt+1 = θ̄t0 pt + θ̄t1 pt−1 + θ̄t2 pt−2 + εt+1 . (20)

14
We have to adaptively update our estimate θ̄t which we can do using recursive least squares. To do
this, let

p̄t = (pt , pt−2 , pt−2 )T ,


F̄t (p̄t |θ̄t ) = (p̄t )T θ̄t .

We perform the updating using a standard set of updating equations given by

1
θ̄t+1 = θ̄t + Mt p̄t εt+1 , (21)
γt
εt+1 = F̄t (p̄t |θ̄t ) − pt+1 , (22)
1
Mt+1 = Mt − Mt (p̄t )(p̄t )T Mt , (23)
γt
γt = 1 − (p̄t )T Mt p̄t . (24)

To compute these equations, we need the three-element vector θ̄t and the 3 × 3 matrix Mt . These
then need to be added to our state variable, giving us


St = Rt , (pt , pt−1 , pt−2 ), (θ̄t , Mt ) .

We then have to include equations (21) - (24) in our transition function.

4.3.3 With active learning

We can further generalize our model by assuming that our decision xt to buy or sell energy from or
to the grid can have an impact on prices. We might propose a modified price model given by

pt+1 = θ̄t0 pt + θ̄t1 pt−1 + θ̄t2 pt−2 + θ̄t3 xt + εt+1 . (25)

All we have done is introduce a single term θ̄t3 xt (which specifies how much we buy/sell from/to
the grid) to our price model. Assuming that θ3 > 0, this model implies that purchasing power from
the grid (xt > 0) will increase grid prices, while selling power back to the grid (xt < 0) decreases
prices. This means that purchasing a lot of power from the grid (for example) means we are more
likely to observe higher prices, which may assist the process of learning θ. When decisions control
or influence what we observe, then this is an example of active learning, which we saw in section 2.3
when we described a pure learning problem.

This change in our price model does not affect the state variable from the previous model, aside
from adding one more element to θ̄t , with the required changes to the matrix Mt . The change will,
however, have an impact on the policy. It is easier to learn θ if there is a nice spread in the prices,

15
which is enhanced by varying xt over a wide range. This means trying values of xt that do not
appear to be optimal given our current estimate of the vector θ̄t . Making decisions partly just to
learn (to make better decisions in the future) is the essence of active learning, best known in the
field of multiarmed bandit problems.

4.3.4 With rolling forecasts

We are going to assume that we are given a rolling forecast from an outside source. This is quite
common, and yet is surprisingly overlooked in the modeling of dynamic systems (including inven-
tory/storage systems, for which there is an extensive literature). We are going to use rolling forecasts
to illustrate the interaction between the modeling of state variables and the creation of the transition
function.

Imagine that we are modeling the energy Et from wind, which means we would have to add Et
E
to our state variable. We need to model how Et evolves over time. Assume we have a forecast ft,t+1
of the energy Et+1 from wind, which means

E
Et+1 = ft,t+1 + εt+1,1 , (26)

where εt+1,1 ∼ N (0, σε2 ) is the random variable capturing the one-period-ahead error in the forecast.

Equation (26) needs to be added to the transition equations for our model. However, it introduces
E
a new variable, the forecast ft,t+1 , which must now be added to the state variable. This means we
E
now need a transition equation to describe how ft,t+1 evolves over time. We do this by using a
E
two-period-ahead forecast, ft,t+2 E
, which is basically a forecast of ft+1,t+2 , plus an error, giving us

E E
ft+1,t+2 = ft,t+2 + εt+1,2 , (27)

where εt+1,2 ∼ N (0, 2σε2 ) is the two-period-ahead error (we are assuming that the variance in a
E
forecast increases linearly with time). Now we have to put ft,t+2 in the state variable, which generates
a new transition equation. This generalizes to

E E
ft+1,t0 = ft,t0 + εt+1,t0 −t , (28)

where εt+1,t0 −t ∼ N (0, (t0 − t)σε2 ). This process illustrates the back and forth between defining the
state variable and creating the transition function that we hinted at earlier.

This stops, of course, when we hit the planning horizon H. This means that we now have to add

ftE = (fttE0 )tt+H


0 =t+1

16
to the state variable, with the transition equations (28) for t0 = t + 1, . . . , t + H. Combined with the
learning statistics, our state variable is now

St = (Rt , Et ), (pt , pt−1 , pt−2 ), (θ̄t , Mt ), ftE .




It is useful to note that we have a nice illustration of the three elements of our state variable:

(Rt , Et ) = The physical state variables,


(pt , pt−1 , pt−2 ) = other information,
((θ̄t , Mt ), ftE ) = the belief state, since these parameters determine the distribution of belief
about variables that are not known perfectly.

4.4 A probabilist’s perspective of information

We would be remiss in a discussion of state variables if we did not cover how the mathematical
probability community thinks of “information” and “state variables.” We note that this section is
completely optional, as will be seen by the end of the section.

We begin by introducing what is widely known as boilerplate language when describing stochastic
processes.

Let (S0 , W1 , W2 , . . . , WT ) be the sequence of exogenous information variables, beginning


with the initial state (that may contain a Bayesian prior), followed by the exogenous
information contained in Wt . Let ω ∈ Ω be a sample sequence of a truth (contained
in S0 ), and a realization of W1 , . . . , WT . Let F be the σ-algebra (also written “sigma-
algebra”) on Ω, which captures all the events that might be defined on Ω. The set F
is the set of all countable unions and complements of the elements of Ω, which is to
say every possible event. Let P be a probability measure on (Ω, F) (if Ω is discrete, P
would be a probability mass function). Now let Ft = σ(S0 , W1 , . . . , Wt ) be the σ-algebra
generated by the process (S0 , W1 , . . . , Wt ), which means it reflects the subsets of Ω that
we can identify using the information that has been revealed up through time t. The
sequence F0 , F1 , . . . , Ft is referred to as a filtration, which means that Ft ⊆ Ft+1 (as
more information is revealed, we are able to see more fine-grained events on Ω, which
acts like a sequence of filters with increasingly finer openings).

This terminology is known as boilerplate because it can be copied and pasted into any model with
a stochastic process (S0 , W1 , . . . , WT ), and does not change with applications (readers are given
permission to copy this paragraph word for word, but as we show below, it is not necessary).

17
We need this language to handle the following issue. In a deterministic problem, the decisions
x0 , x1 , . . . , xT represent a sequence of numbers (these can be scalars or vectors). In a stochastic
problem, there is a decision xt for each sample path ω which represents a realization of the entire
sequence (S0 , W1 , . . . , Wt , . . . , WT ). This means that if we write xt (ω), it is as if we are telling xt the
entire sample path, which means it gets to see what is going to happen in the future.

We fix this by insisting that the function xt (ω) be “Ft -measurable,” which means that xt is
not allowed to depend on the outcomes of Wt+1 , . . . , WT . We get the same behavior if we write xt
explicitly as a function (that we call a policy) X π (St ) that depends on just the information in the
state St . Note that the state St is purely a function of the history S0 , W1 , . . . , Wt , in addition to the
decisions x0 = X π (S0 ), x1 = X π (S1 ), . . . , xt = X π (St ).

Theoreticians will use Ft or St to represent “information,” but they are not equivalent. Ft con-
tains all the information in the exogenous sequence (S0 , W1 , . . . , Wt ). The state St , on the other
hand, is constructed from the sequence (S0 , W1 , . . . , Wt ), but only includes the information we need
to compute the objective function, constraints and transition function. Also, St can always be repre-
sented as a vector of real-valued numbers, while Ft is a set of events which contain the information
needed to compute St . The set Ft is more general, hence its appeal to mathematicians, while St
contains the information we actually need to model our problem.

The state St can be viewed from three different perspectives, depending on what time it is:

1) We are at time t - In a sequential decision problem, if we are talking about St then it usually
means we are at time t, in which case St is a particular realization of a set of numbers that
capture everything we need from history to model our system moving forward.

2) We are at time t = 0 - We might be trying to choose the best policy (or some other fixed
parameter), in which case we are at time 0, and St would be a random variable since we do
not know what state we will be in at time t when we are at time t = 0.

3) We are at time t = T - Finally, a probabilist sits at time t = T and sees all the outcomes
Ω (and therefore all the events in F), but from his perspective at time T , if you ask him a
question about St at time t, he will recognize only events in Ft (remember that each event
in Ft is a subset of sample paths ω ∈ Ω). For example, if we are running simulations using
historical data, and we cheat and use information from time t0 > t to make a decision at time
t, that implies we are seeing an event that is in Ft0 , but which is not in Ft . In such a case, our
decision would not be “Ft -measurable.”

Readers without training in measure-theoretic probability will find this language unfamiliar, even
threatening. We will just note that the following statements are all completely equivalent.

1) The policy X π (St ) (or decision xt ) is Ft -measurable.

18
2) The policy X π (St ) (or decision xt ) is nonanticipative.

3) The policy X π (St ) (or decision xt ) is “adapted.”

4) The policy X π (St ) (or decision xt ) is a function of the state St .

Readers without formal training in measure-theoretic probability will likely find statement (4) to
be straightforward and easy to understand. We are here to say that you only need to understand
statement (4), which means you can write models (and even publish papers) without any of the other
formalism in this section.

5 Partially observable Markov decision processes

Partially observable Markov decision processes (POMDPs) broadly describe any sequential decision
problem that involves learning an environment that cannot be precisely observed. However, it is
most often associated with problems where decisions can affect the environment, which was not the
case in our pure learning problems in section 2.3.

We are going to describe our environment in terms of a set of parameters that we are trying to
learn. It is helpful to identify three classes of problems:

1) Static unobservable parameters - These are problems where we are trying to learn the values
of a set of static parameters, which might be the response of a function given different inputs,
or the parameters characterizing an unknown function. These experiments could be run in a
simulator or laboratory, or in the field. Examples are:

• The strength of material resulting from the use of different catalysts.


• Designing a business system using a simulator. We might be designing the layout of an
assembly line, evaluating the number of aircraft in a fleet, or finding the best locations of
warehouses in a logistics network.
• Evaluating the parameters of a policy for stocking inventory or buying stock.
• Controlling robots moving in a static but unknown environment.

2) Dynamic unobservable parameters - Now we are trying to learn the value of parameters
that are evolving over time. These come in two flavors:

2a) Exogenous, uncontrollable process - These are problems where environmental pa-
rameters will evolve over time due to a purely exogenous source:
• Demand for hotel rooms as a function of price in changing market conditions (where
our decisions do not affect the market).

19
• Robots moving in an uncertain environment that is changing due to weather.
• Finding the best path through a congested network after a major road has been closed
due to construction forcing people to explore new routes.
• Finding the best price for a ridesharing fleet to balance drivers and riders. The best
price evolves as the number of drivers and riders changes over the course of the day.
2b) Controllable process - These are problems where the controlling agent makes decisions
that directly or indirectly affect environmental parameters:
• Equipment maintenance - We perform inspections to determine the state of the ma-
chine, and then perform repairs which changes the state of the equipment.
• Medical treatments - We test for a disease, and then treat using drugs which changes
the progression of the disease.
• Managing a utility truck after a storm, where the truck is both observing and repairing
damaged lines, while we control the truck.
• Invasive plant or animal species management - We perform inspections, and imple-
ment steps to mitigate further spread of the invasive species.
• Spread of the flu - We can take samples from the population, and then administer flu
vaccinations to reduce incidence of the disease.

Class (1) represents our pure learning problems (which we first touched on in section 2.3), often
referred to as multiarmed bandits, although the arms (choices) may be continuous and/or vector
valued. The important characteristic of class (1) is that our underlying problem (the environment)
is assumed to be static. Also, we may be learning in the field, where we are interested in optimizing
cumulative rewards (this is the classic bandit setting) or in a laboratory, where we are only interested
in the final performance.

Class (2) covers problems where the environment is changing over time. Class (2a) covers prob-
lems where the environment is evolving exogenously. This has been widely studied under the umbrella
of “restless bandits.” The modeling of these systems is similar to class (1). Class (2b) arises when our
decisions directly or indirectly affect the parameters that cannot be observed, which is the domain
of POMDPs. We are going to illustrate this with a problem to treat flu in a population.

Earlier we noted that Markov decision problems are often represented by the tuple (S, X , P, r)
(we use X for action space, but standard notation in this community is to use a ∈ A for action).
The POMDP community extends this representation by representing the POMDP as the tuple
(S, X , P, r, W obs , P obs ) where S, X , P and r are as they were with the basic MDP, W obs is the space
of possible observations that can be made of the environment, and P obs is the “observation function”

20
where

P obs (wobs |s) = P rob[W obs = wobs |s]


= the probability we observe outcome W obs = wobs when the unobserv-
able state S = s.

The notation for modeling POMDPs is not standard. Different authors may use Z or Y for the
observation, and may use Z, Y or Ω for the space of outcomes. Some will use O for outcome and O
for the “observation function.” Our choice of P obs (wobs |s) (which is not standard) helps us to avoid
the use of “O” for notation, and makes it clear that it is the probability of making an observation,
which parallels our one-step transition matrix P which describes the evolution of the “state” s.

Remark: There is a bit of confusion in the modeling of uncertainty in POMDPs. The tuple
(S, X , P, r, W obs , P obs ) represents uncertainty in both the transition matrix P , and then through
the pair (W obs , P obs ) which captures both observations and the probability of making an observa-
tion. Recall that we represent the transition function in equation (1) using s0 = S M (s, x, w) where
W = w is our “exogenous information.” This is the exogenous information (that is, the random
inputs) that drives the evolution of our “physical system” with state S = s. We use this random
variable to compute our one-step transition matrix from the transition function using

p(s0 |s, x) = EW {1{s0 =S M (s,x,w)} }. (29)

The random variables W and W obs may be the same. For example, we may have a queueing system
where W is the random number of customers arriving, which we are allowed to observe. The unknown
parameter s might be the arrival rate of customers, which we can estimate using W . In other settings
W and W obs may be completely different. For example, W might be the random transmission of
disease in a population, while W obs is the outcome of random samples.

What is important is that the standard modeling representation for POMDPs is as a single,
extended problem. The POMDP framework ensures that the choice of action a (decision x in our
notation) is not allowed to see the state of the system, but it does assume that the transition function
P and the observation function P obs are both known. Later we are going to offer a different approach
for modeling POMDPs. Before we do this, it is going to help to have an actual example in mind.

6 A learning problem: protecting against the flu

We are going to use the problem of protecting a population against the flu as an illustrative exam-
ple. It will start as a learning problem with an unknown but controllable parameter, which is the
prevalence of the flu in the population. We will use this to illustrate different classes of policies, after
which we will propose several extensions.

21
6.1 A static model

Let µ be the prevalence of the flu in the population (that is, the fraction of the population that has
come down with the flu). In a static problem where we have an unknown parameter µ, we make
observations using

Wt+1 = µ + εt+1 , (30)

2 ) is what keeps us from observing µ perfectly.


where the noise εt+1 ∼ N (0, σW

We express our belief about µ by assuming that µ ∼ N (µ̄t , σ̄t2 ). Since we fix the assumption of
normality, we express our belief about µ as Bt = (µ̄t , σ̄t2 ). We are again going to express uncertainty
using βt = 1/σ̄t2 which is the precision of our estimate of µ, and β W = 1/σW
2 is the precision of our

observation noise εt+1 .

We need to estimate the number of people with the disease by running tests, which produces the
noisy estimate Wt+1 . We represent the decision to run a test by the decision variable xobs
t where

1 if we observe the process and obtain Wt+1 ,
xobs
t =
0 if no observation is made.

If xobs
t = 1, then we observe Wt+1 which we can use to update our belief about µ using

βt µ̄t + β W Wt+1
µ̄t+1 = , (31)
βt + β W
βt+1 = βt + β W . (32)

If xobs
t = 0, then µ̄t+1 = µ̄t and βt+1 = βt .

For this problem our state variable is our belief about µ, which we write

St = Bt = (µ̄t , βt ).

If this was our problem, it would be an instance of a one-armed bandit. We might assess a cost for
making an observation, along with a cost of uncertainty. For example, assume we have the following
costs:

cobs = The cost of sampling the population to estimate the number of people in-
fected with the flu,
unc
C (St ) = the cost of uncertainty,
= cunc σ̄t ,
C(St , xt ) = cobs xobs
t +C
unc
(St ).

22
Using this information, we can put this model in our canonical framework as follows:

State variables St = (µ̄t , βt ).

Decision variables xt = xobs


t determined by our policy X obs (St ) (to be determined later).

Exogenous information Wt+1 which is our noisy estimate of how many people have the flu from
equation (30) (and we only obtain this if xobs = 1).

Transition function Equations (31) and (32).

Objective function We would write our objective as


( T )
X
max E C(St , xt )|S0 . (33)
π
t=0

We now need a policy X obs (St ) to determine xobs


t . We can use any of the four classes of policies
described in section 3. We sketch examples of policies in section 8 below.

6.2 Variations of our flu model

We are going to present a series of variations of our flu model to bring out different modeling issues:

• A time-varying model

• A time-varying model with drift

• A dynamic model with a controllable truth

• A flu model with a resource constraint and exogenous state

• A spatial model

These variations are designed to bring out the modeling issues that arise when we have an evolving
truth (with known dynamics), an evolving truth with unknown dynamics (the drift), an unknown
truth that we can control (or influence), followed by problems that introduce the dimension of having
a known and controllable physical state.

23
6.2.1 A time-varying model

If the true prevalence of the flu is evolving exogenously (as we would expect in this application),
then we would write the true parameter as depending on time, µt , which might evolve according to

µt+1 = max{0, µt + εµt+1 }, (34)

where εµt+1 ∼ N (0, σ µ,2 ) describes how our truth is evolving. If the truth evolves with zero mean and
known variance σ ε,2 , our belief state is the same as it was with a static truth (that is, St = (µ̄t , βt )).
What does change is the transition function which now has to reflect both the noise of an observation
εt+1 as well as the uncertainty in the evolution of the truth, captured by εµt+1 .

Remark: When µ was a constant, we did not have a problem referring to it as a parameter, whereas
the state of our system is the belief which evolves over time (state variables should only include
information that changes over time). When µ is changing over time, in which case we write it as µt ,
then it is more natural to think of the value of µt as the state of the system, but not observable to
the controller. For this reason, many authors would refer to µt as a hidden state. However, we still
have the belief about µt , which creates some confusion: What is the state variable? We are going to
resolve this confusion below.

6.2.2 A time-varying model with drift

Now assume that

εµt+1 ∼ N (δ, σ ε,2 ).

If δ 6= 0, then it means that µt is drifting higher or lower (for the moment, we are going to assume
that δ is a constant). We do not know δ, so we would assign a belief such as

δ ∼ N (δ̄t , σ̄tδ,2 ).

Again let the precision be given by βtδ = 1/σ̄tδ,2 .

We might update our estimate of our belief about δ using

δ̂t+1 = Wt+1 − Wt .

Now we can update our estimate of the mean and variance of our belief about δ using

βtδ δ̄t + β W δ̂t+1


δ̄t+1 = , (35)
βtδ + β W
δ
βt+1 = βtδ + β W . (36)

24
In this case, our state variable becomes

St = Bt = (µ̄t , βt ), (δ̄t , βtδ ) .




Here, we are modeling only the belief about µt , while µt itself is just a dynamically varying parameter.
This changes in the next example.

6.2.3 A dynamic model with a controllable truth

Now consider what happens when our decisions might actually change the truth µt . Let

xvac
t = the number of vaccination shots we administer in the region.

We assume that the vaccination shots reduce the presence of the disease by θvac for each vaccinated
patient, which is xvac
t . We are going to assume that the decision made at time t is not implemented
until time t + 1. This gives us the following equation for the truth

µ
µt+1 = max{0, µt − θvac xvac
t−1 + εt+1 }. (37)

We express our belief about the presence of the disease by assuming that it is Gaussian where
µt ∼ N (µ̄t , σt2 ). Again letting the precision be βt = 1/σt2 , our belief state is Bt = (µ̄t , βt ), with
transition equations similar to those given in equations (9) and (10) but adjusted by our belief about
what our decision is doing. If we make an observation (that is, if xobs
t = 1), then

βt (µ̄t − θvac xvac W


t−1 ) + β Wt+1
µ̄t+1 = , (38)
βt + β W
βt+1 = βt + β W . (39)

If xobs
t = 0, then µ̄t+1 = µ̄t − θvac xvac
t−1 , and βt+1 = βt .

This setting introduces a modeling challenge: Is the state µt ? Or is it the belief (µ̄t , βt )? When
µt was static or evolved exogenously, it seemed clear that the state was our belief about µt . However,
now that we can control µt , it seems more natural to view µt as the state. This problem is an instance
of a partially observable Markov decision problem. Later we are going to review how the POMDP
community models these problems, and offer a different approach.

This problem has an unobservable state that is controllable. The next two problems will introduce
the dimension of combining both observable and unobservable states that are both controllable.

25
6.2.4 A flu model with a resource constraint and exogenous state

Now imagine that we have a limited number of vaccinations that we can administer. Let R0 be
the number of vaccinations we have available. Our vaccinations xvac
t have to be drawn from this
inventory. We might also introduce a decision xinv
t to add to our inventory (at a cost). This means
our inventory evolves according to

Rt+1 = Rt + xinv vac


t−1 − xt−1 ,

where we require xvac obs


t−1 ≤ Rt . We still have our decision of whether to observe the environment xt ,
so our decision variables are

xt = (xinv vac obs


t , xt , xt ).

While we are at it, we might as well include information about the weather such as temperature
Ittemp and humidity Ithum which can contribute to the spread of the flu. We would model these in
our “other information” variable

It = (Ittemp , Ithum ).

Our state variable becomes

St = Rt , (Ittemp , Ithum ), (µ̄t , βt ) .



(40)

We now have a combination of a controllable physical state Rt that we can observe perfectly, exoge-
nous environmental information I t = (Ittemp , Ithum ), and the belief state Bt = (µ̄t , βt ) which captures
our distribution of belief about the controllable state µt that we cannot observe.

Note how quickly our solvable two-dimensional problem just became a much larger five-dimensional
problem. This is a big issue if we are trying to use Bellman’s equation, but only if we are using a
lookup table representation of the value function (otherwise, we do not care).

6.2.5 A spatial model

Imagine that we have to allocate our supply of flu vaccines over a set of regions I. For this problem,
we have a truth µti and belief (µ̄ti , βti ) for each region i ∈ I. Next assume that xvac
ti is the number
of vaccines allocated to region i, which is subject to the constraint

X
xvac
ti ≤ Rt . (41)
i∈I

26
Our inventory Rt now evolves according to

X
Rt+1 = Rt + xinv
t − xvac
ti .
i∈I

The coupling constraint (41) prevents us from solving for each region independently. This produces
the state variable


St = Rt , (µ̄ti , βti )i∈I . (42)

What we have done with this extension is to create a state variable that is potentially very high
dimensional, since spatial problems may easily range from hundreds to thousands of regions.

6.2.6 Notes

The problems in this section were chosen to bring out a series of modeling issues. We made the
transition from learning a static parameter µ to learning a dynamic parameter µt with no drift, and
then a problem with an unknown drift. All three of these problems involved state variables that
described our beliefs about unknown parameters.

Then, we made the truth µt controllable, which is when we found that we could model the system
where the state was µt (which is not observable), or the belief about µt . This problem falls in the
domain of the POMDP community, where µt would be the state of our system, and then we have a
belief about this state.

We then closed with two problems that combined the controllable but unobservable parameter
µt , along with controllable but observable parameter Rt .

We are now going to describe how the POMDP community approaches problems with control-
lable, but unobservable, states. After this, we are going to present a perspective that introduces a
fresh way of thinking about these problems that opens up new solution approaches that draw on the
four classes of policies described in section 3.

6.3 The POMDP perspective

The POMDP community approaches the controllable version of our flu problem by viewing it as a
dynamic program with state µt and action xt that controls (or at least influences) this state. Viewed
from this perspective, µt is the state of the system. Any reference to “the state” refers to the current
value of µt . In our resource constrained system, we would add Rt to the state variable giving us
St = (Rt , µt ), but for now we are going to focus on the unconstrained problem.

27
The community then shifts to the idea of modeling the belief about µt , and then introduces the
“belief MDP” where the belief is the state (instead of µt ).

The problem with these two versions of a Markov decision process is that there is not a clear
model of who knows what. There is also the confusion of a “state” s (sometimes called the physical
state), and our belief b(s) giving the probability that we are in state s, where b(s) is its own state
variable!! This issue arises not just in who has access to the value of µt , but also information about
the transition function. In section 7, we are going to offer a new model that resolves the confusion
about these two perspectives.

To help us present the POMDP perspective, we are going to make three assumptions:

A1 Our state space (that is, the possible values of µt ) is discrete, which means we can write St ∈ S =
{s1 , . . . , sK }. Note that this state space can become quite large when we have more than one
dimension, as occurred with our other models (think about our spatially distributed problem).

A2 We are solving the problem in steady state.

A3 We can compute the one-step transition matrix p(s0 |s, x) which is the probability that we tran-
sition to S = s0 given that we are in state s and take action x. It is important to remember
that p(s0 |s, x) is computed using

p(s0 |s, x) = ES EW |S {1{St+1 =s0 =S M (s,x,W )} |St = s}.

The first expectation ES captures our uncertainty about the state S, while the second expecta-
tion EW |S captures the noise in the observation of S. Computing the one-step transition matrix
p(s0 |s, x) means we need to know both the transition function S M (s, x, W ) and the probability
distribution for W .

Table 1 presents the notation we use in our model, which parallels the notation for our original model.
The structure of our model, however, is completely standard. Our model follows the presentation in
Ross et al. (2008).

With these assumptions, we can formulate the familiar form of Bellman’s equations for discrete
states and actions
!
X
V (s) = max C(s, x) + p(s0 |s, x)V (s0 ) . (43)
x
s0 ∈S

The problem with solving Bellman’s equation (43) to determine actions is that the controller
determining x is not able to see the state s. The POMDP community addresses this by creating a

28
Physical (unobservable) system
St = s = µt Physical (unobservable) state s (e.g. µt )
xt Decision (made by the controller) that acts on St
Wt+1 Exogenous information impacting the physical state St
S M (s, x, w) Transition function for physical state St = s given xt = x and Wt+1 = w
p(s0 |s, x) P rob[St+1 = s0 |St = s, xt = x]
Controller system
b(s) Probability (belief) we are in physical (unobservable) state s
W obs Noisy observation of physical state St = s
W obs Space of outcomes of W obs
obs
P (w|s) Probability of observing W obs = w given St = s

Table 1: Table of notation for POMDPs

belief b(s) for each state s ∈ S. At any point in time, we can only be in one state, which means

X
b(s) = 1.
s∈S

The POMDP literature then creates what is known as the belief MDP in terms of the belief state
vector b = (b(s1 ), . . . , b(sK )) = (b1 , . . . , bK ). This is a dynamic program whose state is given by the
continuous vector b. We next introduce the transition function for the belief vector b given by

B M (b, x, W ) = The transition function that gives the probability vector b0 = B M (b, x, W )
when the current belief vector (the prior) is b, we make decision x and then
observe the random variable W , which is a noisy observation of µt if we have
chosen to make an observation.

The function B M (b, x, W ) returns a vector b0 that has an element b0 (s) for each physical state s. We
will write B M (b, x, W )(s) to refer to element s of the vector returned by B M (b, x, W ).

The belief transition function is an exercise in Bayes’ theorem. Let bt (s) be the probability we
are in state St = s0 (this is our prior) at time t. We assume we have access to the distribution

P obs (wobs |s) = the probability that we observe W obs = wobs if we are in state s.

obs = w obs . The updated belief


Assume we are in state St = s and take action xt and observe Wt+1

29
distribution bt+1 (s0 ) would then be

bt+1 (s0 |bt , xt , Wt+1


obs
= wobs ) = B M (bt , x, Wt+1
obs
= wobs )(s0 )
= P rob[St+1 = s0 |bt , xt , Wt+1
obs
= wobs ]
obs
P rob[Wt+1 = wobs |bt , xt = x, St+1 = s0 ]P rob[St+1 = s0 |bt , xt ]
= (44)
P rob[W obs = wobs |bt , xt ]
obs
= wobs |bt , xt , St+1 = s0 ] s∈S P rob[St+1 = s0 |St = s, bt , xt ]P rob[St = s|bt , xt ]
P
P rob[Wt+1
= obs
P rob[Wt+1 = wobs |bt , xt ]
(45)
obs obs 0 P 0
P rob[Wt+1 =w |St+1 = s ] s∈S P rob[St+1 = s |St = s, xt ]bt (s)
= obs
(46)
P rob[Wt+1 = wobs |bt , xt ]
P obs (wobs |s0 ) s∈S P (s0 |s, xt )bt (s)
P
= (47)
P obs (wobs |bt , xt )
(48)

Equation (44) is a straightforward application of Bayes’ theorem, where all probabilities are con-
ditioned on the decision xt and the prior bt (which has the effect of capturing history). Equation
(45) handles the transition from conditioning on the belief b(s) that St = s, to the state St+1 = s0
from which observations are made. The remaining equations reduce (45) by recognizing when con-
ditioning on bt (s) does not matter, and substituting in the names of the variables for the differenet
probabilities.
We compute the denominator in equation (44) using

X X
obs
P rob[Wt+1 = wobs |bt , xt ] = obs
P rob[Wt+1 = wobs |St+1 = s0 ] P rob[St+1 = s0 |St = s, xt ]P rob[St = s|bt(49)
, xt ]
s0 ∈S s∈S
X obs obs 0
X 0
= P (w |s ) P (s |s, xt )bt (s). (50)
s0 ∈S s∈S

Equation (47) is fairly straightforward to compute as long as the state space S is not too large
(actually, it has to be fairly small), the observation probability distribution P obs (wobs |S = s, x) is
known, and the one-step transition matrix P (s0 |s, x) is known. Knowledge of P obs (wobs |S = s, x)
requires an understanding of the structure of the process of observing the unknown system. For
example, if we are sampling the population to learn about who has the flu, we might use a binomial
sampling distribution to capture the probability that we sample someone with the flu. Knowledge
of the one-step transition matrix P (s0 |s, x), of course, requires an understanding of the underlying
dynamics of the physical system.

This said, note that we have three summations over the state space to compute a single value
of bt+1 (s0 ). This has to be repeated for each s0 ∈ S, and it has to be computed for each action
obs . That is a lot of nested loops. The problem is that we are modeling two
xt and observation Wt+1
transitions: the evolution of the state St , and the evolution of the belief vector bt (s). This would not
be an issue if we were just simulating the two systems. Equations (47) and (50) require computing
expectations to find the transition probabilities for both the physical state St and the belief state bt .

30
The POMDP community then approaches solving this dynamic program through Bellman’s equa-
tion (for steady state problems) that can be written

V (bt ) = max C(bt , x) + E{V (B M (bt , x, Wt+1


obs

))|bt , x} .
x

It is better to expand the expectation operator over the actual random variables that are involved.
Assume that we are in physical state St with belief vector bt . The expectation would then be written

V (bt ) = max C(bt , x) + ESt |bt ESt+1 |St EW obs |St+1 {V (B M (bt , x, W obs ))|bt , x} .

(51)
x t+1

Here, ESt |bt integrates over the state space for St using the belief distribution bt (s). ESt+1 |St takes the
expectation over St+1 given St . Finally, EW obs |St+1 integrates over the space of observations given
t+1
we are in state St+1 . These expectations would be computed using
 
X X X
V (bt ) = max C(bt , x) + bt (s) p(s0 |s, x) P obs (wobs |s, x)V (B M (bt , x, wobs )) . (52)
x
s∈S s0 ∈S wobs ∈W obs

If equation (52) can be solved, then the policy for making decisions for the controller is given by
 
X X
X ∗ (bt ) = arg max C(bt , x) + bt (s) P obs (wobs |s, x)V (B M (bt , x, W obs )) .
x
s∈S wobs ∈W obs

As if the computations behind these equations were not daunting enough, we need to also realize
that we are combining the decisions of the controller with a knowledge of the dynamics (captured
by the transition matrix) of the physical system. The one-step transition matrix requires knowledge
of both the transition function S M (s, x, W ), which will not always be known to the controller. We
are going to illustrate a setting where the transition is not known to the controller below.

A challenge here is that even through we have discretized the unobservable state (µ for this
problem), the vector b is continuous. However, Bellman’s equation using the state b has some nice
properties that the research community has exploited. Just the same, it is still limited to problems
where the state space S of the unobservable system is relatively small. Keep in mind that small
problems can easily produce state spaces of 10,000, and we could never execute these equations with
a state space that large.

We need to emphasize that while the POMDP community typically turns to Bellman’s equation
to solve a sequential decision problem, in practice this will rarely be computationally feasible. This
realization has motivated the development of a series of approximation strategies which are summa-
rized in Ross et al. (2008). One class of approximations that has attracted considerable attention is
to restrict belief states to a sampled set B̂.

31

State variables Stenv = µt , δ (we include the drift δ, even if it is not changing).
Decision variables There are no decisions.
env
Exogenous information Wt+1 = εµt+1 .
Transition function S env = S M,env (Stenv , , Wt+1
env
), which includes equation (34) describing the evolution of
µt .
Objective function Since there are no decisions, we do not have an objective function.

Figure 2: The canonical model of the environment.

We urge readers to formulate the problem in terms of the canonical model given in section 2,
and then to consider all four classes of policies described in section 3. Note that all four classes of
policies are widely used for sequential decision problems, including those without a belief state (such
as resource allocation problems), pure learning problems (where the only state variable is the belief
state), as well as hybrids, such as the resource constrained problems. Section 8 sketches examples
of each of the four classes of policies, none of which suffer from the curses of dimensionality that we
encounter in equation (47).

7 A two-agent model of the flu application

We now offer a completely different approach for modeling POMDPs using the context of our flu
application. We start by presenting a two-agent model of the POMDP for the flu problem. We then
turn our attention to knowledge of the transition function itself, rather than just parameters.

7.1 A two-agent formulation of the POMDP

There are two perspectives that we can take in any POMDP: one from the perspective of the envi-
ronment, and one from the perspective of the controller that makes decisions:

The environment perspective The environment (sometimes called the “ground truth”) knows
µt , but cannot make any decisions (nor does it do any learning).

The controller perspective The controller makes decisions that affect the environment, but is
not able to see µt . Instead, the controller only has access to the belief about µt .

The model of the environment agent is given in figure 2. The model of the controlling agent is given
in figure 3.

It is best to think of the two perspectives as agents, each working in their own world. There is
the “environment agent” which does not make decisions, and the “controlling agent” which makes

32

State variables Stcont = (µ̄t , βt ), (δ̄t , βtδ ) .
Decision variables xt = (xvac obs
t , xt ).
cont
Exogenous information Wt+1 which is our noisy estimate of how many people have the flu (and we only
obs
obtain this if xt = 1).
cont
Transition function St+1 = S M,cont (Stcont , xt , Wt+1
cont
), which consist of equations (38) and (39).
Objective function We can write this in different ways. Assuming we are implementing this in a field
situation, we want to optimize cumulative reward. Let:

cobs = The cost of sampling the population to estimate the number of people infected with
the fly,
C vac (µ̄t ) = the cost we assess when we think that the number of infected people is µ̄t .

Now let C cont (St , xt ) = cobs xobs


t + C vac (µ̄t ) be the cost at time t when we are in state St and make
vac
decision xt (note that xt impacts St+1 ). Finally, we want to optimize
( T )
X
max E C cont (St , Xtπ (St ))|S0 . (53)
π
t=0

Figure 3: The canonical model of the controlling agent.

Stenv Stcont Description


1) (µt ) (µ̄t , βt ) Static, unknown truth
(µt ), (Ittemp , Ithum ) Rt , (Ittemp , Ithum ), (µ̄t , βt )
 
2) Resource constrained with exogenous information
3) µt , δ  (µ̄t , βt ), (δ̄t, βtδ ) Dynamic model with uncertain drift
4) µt , xvac
t−1 , θ
vac
 µ̄t , βt  Dynamic model with a controllable truth
5) (µti )i∈I , xvac
t−1 , θ
vac
 Rt , (µ̄t , βt )  Resource constrained model
6) (µti )i∈I , xvac
t−1 , θ
vac
Rt , (µ̄ti , βti )i∈I Spatially distributed model

Table 2: Environmental state variables and controller state variables for different models.

decisions, and performs learning about the environment that cannot be observed (such as µt ). Once
we have identified our two agents, we need to define what is known by each agent. This begins with
who knows what about parameters such as µt , but it does not stop there.

Table 2 shows the environmental state and controlling state variables for each of the variations
of our flu problem that we presented in section 6.2. A few observations are useful:

• The two-agent perspective means we have two systems. The environment agent is a simple
system with no decisions, but with access to µt and the dynamics of how vaccinations affect µt .
The state of the system for the environment agent is Stenv which includes µt . The state for the
system for the controlling agent, Stcont , is the belief about µt , along with any other information
known to the controlling agent such as Rt . The two systems are completely distinct, beyond
the ability to communicate.

• In model 2, we model the temperature Ittemp and humidity Ithum as state variables for both the

33
environment, which presumably would control changes to these variables, and the controlling
agent, since we have assumed that the controlling agent is able to observe these perfectly. We
could, of course, insist that the controller can only observe these through imperfect instruments,
in which case they would be handled in the same way we handle µt .

• Normally a state variable St should only include information that changes over time (otherwise
the information would go in the initial state S0 ). For this presentation, we included infor-
mation such as the drift δ (model 3) and the effect of vaccinations on the prevalence of the
flu θvac (model 4) in the environmental state variables to indicate information known to the
environment but not to the controlling agent.

• In model 4, we include the decision xvac


t−1 in the state variable for the environment. We assume
that the controlling agent makes the decision to vaccinate xvac
t−1 at time t − 1 which is then
communicated to the environment (which is how it gets added to Stenv ) and is then implemented
during time period t. The information arrives to the environment through the exogenous
information variable Wtenv .

• For models 5 and 6, we see how quickly we can go from two or three dimensions, to hundreds
or thousands of dimensions. The spatially distributed model cannot be solved using standard
discrete representations of state spaces, but approximate dynamic programming has been used
for very high-dimensional resource allocation problems (see Simao et al. (2009) and Bouzaiene-
Ayari et al. (2016)).

In addition to modeling what each agent knows, we have to model communication. This will
become an important issue when we model multiple controlling agents which we address in section 9.
For our problem with a single controlling agent and a passive environment, there are only two types
of communication: 1) the ability of the controlling agent to observe the environment (with noise)
and 2) the communication of the decision xvac
t to the environment.

It is not hard to see that any learning problem can (and we claim should) be presented using this
“two-agent” perspective.

Below we are going to illustrate a setting where the controlling agent does not know the transition
function. When we move to multiagent systems, we will add the dimension of learning the policies
of other agents.

7.2 Transition functions for two-agent model

Our two-agent model has focused on what each agent knows (the state variable), but there is another
dimension that deserves a closer look, which is the transition function. Assume that the true model
describing the evolution of µt (known only to the environment) is

34
µt+1 = θ0µ µt + θ24
µ
µt−24 + (θ0temp Ut + θ1temp Ut−1 + θ2temp Ut−2 ) − (θ1vac xvac vac vac 2 µ
t−1 + θ2 (xt−1 ) ) + εt+1 . (54)

where

2
Ut = max{0, Ittemp − I threshold }

where I threshold is a threshold temperature (say, 25 degrees F) below which colds and sneezing begins
to spread the flu. The inclusion of temperature over the current and two previous time periods
captures lag in the onset of the flu due to cold temperatures.

For certain classes of policies, the controlling agent needs to develop its own model of the evolution
of the flu. The controlling agent would not know the true dynamics in equation (54) and might instead
use the following time-series model for the observed number of flu cases Wt :

Wt+1 = θ0W Wt + θ1W Wt−1 + θ2W Wt−2 − θvac xvac W


t−1 + εt+1 . (55)

Our model in equation (55) is a reasonable time-series model for the sequence of observations
W1 , . . . , Wt to predict Wt+1 . There are, however, several errors in this model:

• The controlling agent is using observations Wt , Wt−1 and Wt−2 while the environment uses µt ,
which is not observable to the POMDP.

• The controlling agent did not realize there was a 24-hour lag in the development of the flu.

• The controlling agent is ignoring the effect of temperature.

• The controlling agent is not properly capturing the effect of vaccinations on infections.

Just the same, ignorance is bliss and our controlling agent moves forward with his best effort at
modeling the evolution of the flu. Assume that the model is a reasonable fit of the data. We suspect
that a careful examination of the errors (they should be independent and identically distributed)
might fail a proper statistical test, but it is also possible that we cannot reject the hypothesis that
the errors do satisfy the appropriate conditions. This does not mean that the model is correct - it
just means that we do not have the data to reject it.

Now imagine that a graduate student is writing a simulator for the flu model, and assume that
there is only one person writing the code (which is typically what happens in practice). Our erstwhile
graduate student will create the true transition equation (54). When she goes to create the transition
model used by the controller, she would create the best approximation possible given the information

35
she was allowed to use, but she would know immediately that there are a number of errors in her
approximation. This would allow her to declare that this model is “non-Markovian,” but it is only
because she is using her knowledge of the true model.

It is useful to repeat the famous quote of George Box that “all models are wrong, but some are
useful.” We suspect that most (if not all) of the errors that are inherent in any statistical model
would allow us to show, given enough data, that the errors εt are not independent across time,
although the errors may be small enough that, given our dataset, we cannot reject the hypothesis
that they are independent. This supports our second claim

Claim 2: All models of real problems are (possibly) non-Markovian.

While Claim 2 seems to contradict Claim 1 (all properly modeled systems are Markovian), the conflict
boils down to the interpretation of the model. When we claim that all properly modeled systems
are Markovian (such as equation (55)), we were addressing the tendency of some in the community
to represent Wt as the “state” of our process, when in fact the real state for our assumed model for
the controller is (Wt , Wt−1 , Wt−2 , xvac
t ) (some would say that this is history dependent).

The observation that the controller model (55) is non-Markovian, on the other hand, arises only
because the modeler is able to cheat and see the true model, which would never happen in a real
system. This is why the two-agent formulation is important: We need to create a model of the
environment that is known only to the environment. This includes not only the value of variables
such as µt , but also the structure of the system model (54).

Thus, given the true dynamics in (54), it is possible to insist that the controlling agent’s model of
the environment in (55) is non-Markovian, but this statement uses information that the controlling
agent would never have in practice. This means that the model in (55) is Markovian not because it
is truly Markovian, but because we assume it is.

Recognizing the difference between the truth and a model raises the philosophical question of
what we mean by statements such as “the model is Markovian” or “the solution is optimal.” We
suspect that few would disagree with the position that these statements only make sense relative to
a model of a real problem, rather than the real problem itself. Thus, if we are given an observation
of, say, prices p0 , p1 , . . . , pt , it would not make any sense to say whether this series is Markovian. We
would have to create a model of the process, just as we did for the evolution of the flu in equation
(55), and then work with this model. However, it is possible to train policies using historical data,
but this is just what we are doing with our time series equation (55). The trained model is just an
approximation, which means that even if we could find an optimal policy, it is only optimal relative
to the approximation. We ask the reader to keep this in mind in the next section when we transition
to designing policies.

36
8 Designing policies for the flu problem

Once we formulate our models of each agent, we need to design policies for the controlling agent. The
creation of effective, high quality policies can be major projects. What we want to do is to sketch
examples of each of the four classes of policies to help reinforce why it is important to understand
all four classes.

8.1 Policy function approximations

Policy function approximations are analytic functions that map states to actions. Of the four classes
of policies, this is the only class that does not involve an imbedded optimization problem.

For our flu problem, it is common to use the structure of the problem to identify simple functions
for making decisions. For example, we might use the following rule for determining whether to make
an observation of the environment:

1 σ̄t /µ̄t ≥ θobs



pf a−obs obs
X (St |θ ) = (56)
0 Otherwise.

The policy captures the intuition that we want to make an observation when the level of uncertainty
(captured by the standard deviation of our estimate of the true prevalence), relative to the mean, is
over some number. The parameter θobs has to be tuned, which we do using the objective function
(33). A nice feature of the tunable parameter is that it is unitless.

To determine xvac
t , we might set µ
vac as a target infection level, and then vaccinate at a level

that we believe (or hope?) that we get down to the target. To do this, first compute

1
ζt (θvac ) = max{0, µ̄t − µvac }.

θvac

We can view ζt as the distance to our goal µvac . This calculation ignores the uncertainty in our
estimate µ̄t , so instead we might want to use

ζt (θζ ) = max{0, µ̄t + θζ σ̄t − µvac }.




This policy is saying that µt might be as large as µ̄t + θζ σ̄t , where θζ is a tunable parameter. Now
our policy for xvac would be be

1
X pf a−vac (St |θvac , θζ ) = ζt (θ ζ
). (57)
θvac

objs
Using our policy X obs (St ), we can write our policy for xt = (xvac
t , xt ) as

X P F A (St |θ) = X pf a−vac (St |θvac , θζ ), X pf a−obs (St |θobs ) .




37
where θ = (θvac , θobs , θζ ). This policy would have to be tuned in the objective function (53). This
policy could then be compared to that obtained by approximating Bellman’s equation.

An alternative approach for designing a policy function approximation is to assume that it is


represented by a linear model

X
X P F A (St |θ) = θf φf (St ).
f ∈F

Parametric functions are easy to estimate, but they require that we have some intuition into the
structure of the policy. An alternative is to use a neural network, where θ is the weights on the links
in the graph of the neural network. It is important to keep in mind that neural networks tend to be
very high dimensional (θ may have thousands of dimensions), and they may not replicate obvious
properties. Either way, we would tune θ using the objective function in (53).

8.2 Cost function approximations

We are going to illustrate CFAs using the spatially distributed flu vaccination problem, where we
assume we are allowed to observe just one region x ∈ I at a time (we have just one inspection team).
Assume that we can only treat one region at a time, where we are always going to treat the region
that has the highest estimated prevalence of the flu.

We do not know µtx , but at time t assume that we have an estimate µ̄tx for the prevalence of
2 ). We use this belief to decide which
the flu in region x ∈ I, where we assume that µx ∼ N (µ̄tx , σ̄tx
region to vaccinate, which we describe using the policy

X vac (St ) = arg max µ̄tx .


x∈I

We then have to decide which region to observe. We can approach this problem as a multiarmed
bandit problem, where we have to decide which region (“arm” in bandit lingo) to observe. The most
popular class of policies for learning problems in the computer science community is known as upper
confidence bounding for multiarmed bandit problems. A class of UCB policy is interval estimation
which would choose the region x that solves

X obs−IE (St |θIE ) = arg max µ̄tx + θIE σ̄tx



(58)
x∈I

where σ̄tx is the standard deviation of the estimate µ̄tx .

The policy X obs−IE (St |θIE ) is a form of parametric cost function approximation: it requires
solving an imbedded optimization problem, and there is no explicit effort to approximate the impact

38
of a decision now on the future. It is easy to compute, but θIE has to be tuned. To do this, we need
an objective function. Note that we are going to tune the policy in a simulator, which means we
have access to µtx for all x ∈ I.

Let xobs
t = X obs−IE (St |θIE ) be the region we choose to observe given what we know in St . This
gives us the observation

Wt+1,xobs = µt,xobs + εt+1 ,


t t

2 ). We would then use this observation to update the estimates µ̄


where ε ∼ N (0, σW t,xobs using the
t
updating equations (9) - (10).

It is important to remember that the true prevalence µtx is changing over time as a result of our
policy of observation and vaccination, so we are going to refer to it as µπtx (θIE ), where the observation
policy is parameterized by θIE .

We are learning in the field, which means we want to minimize the prevalence of the flu over
all regions, over time. Since we are using a simulator to evaluate policies, we would evaluate our
performance using the true level of flu prevalence, given by

( T )
XX
F π (θIE ) = ES0 µπtx (θIE )|S 0 . (59)
t=0 x∈I

We then need to tune our policy by solving

min F π (θIE ).
θIE

8.3 Policies based on value functions

Any sequential decision problem with a properly defined state variable can be solved using Bellman’s
equation:


Vt (St ) = max C(St , xt ) + E{Vt+1 (St+1 )|St , xt } ,
x

which gives us the policy

X V F A (St ) = arg max C(St , xt ) + E{Vt+1 (St+1 )|St , xt } .



x

In practice we cannot compute Vt (St ), so we resort to methods that approximate the value function,
following one of the styles given in equations (14)-(17).

39
The use of approximate value functions has been recognized for a wide range of dynamic pro-
gramming and stochastic control problems. However, it has been largely overlooked for problems
with a belief state, with the notable exception of the literature on Gittins indices (Gittins et al.
2011), which reduces high-dimensional belief states (the beliefs across an entire set of arms) down to
a series of dynamic programs with one per arm.

In principle, approximate dynamic programming can be applied to even high-dimensional prob-


lems, including those with belief states, by replacing the value function Vt (St ) with a statistical model
such as

X
Vt (St ) ≈ V̄t (St |θ) = θf φf (St ),
f ∈F

where (φf (St ))f ∈F is a set of features. Alternatively, we might approximate V̄t (St ) using a neural
network.

We note that we might write our policy as


 
X
X V F A (St |θ) = arg max C(St , x) + θf φf (St , x) . (60)
x
f ∈F

where (φf (St , xt )f ∈F is a set of features involving both St and xt . For example, we might design
something like

X V F A (St |θ) = arg max C(St , x) + θt0 + θt1 µ̄t + θt2 µ̄2t + θt3 σ̄t + θt4 βt σ̄t

.
x

There are a variety of strategies for fitting θ that have been developed under the umbrella of approx-
imate dynamic programming (Powell 2011) and reinforcement learning (Sutton & Barto 2018).

8.4 Direct lookahead policy

Direct lookahead policies involve solving an approximate lookahead model that we previously gave
in equation (18), but repeat it here for convenience:

( ( T
) )!
X
XtDLA (St ) = arg max C(St , xt ) + Ẽ max Ẽ π̃
C(S̃tt0 , X̃ (S̃tt0 ))|S̃t,t+1 |St , xt . (61)
xt π̃
t0 =t+1

The problem with direct lookahead policies is that it requires solving a stochastic optimization prob-
lem (to solve the original stochastic optimization problem). To make it tractable, we can introduce
various approximations. Some that are relevant for our problem setting could be:

40
1) Use a deterministic approximation. These are effective for pure resource allocation problems
(google maps using a deterministic lookahead to find the best path to the destination over a
stochastic graph), but seem unlikely to work well for learning problems.

2) Use a parameterized policy for π̃. We could use any of the policies suggested above as our
lookahead policy. We would then also have to use Monte Carlo sampling to approximate the
expectations.

3) We can solve a simplified Markov decision process.

4) We could approximate the lookahead using Monte Carlo tree search.

We are going to illustrate the third approach. We start with model 3 of the flu problem, which
requires the state variable

Stcont = (µ̄t , βt ), (δ̄t , βtδ ) .




We might be able to do a reasonable job of solving a dynamic program with a two-dimensional state
variable (using discretization), but not a four-dimensional state. One approximation strategy is to
fix the belief about the drift δ by holding (δ̄t , βtδ ) constant. This means that we continue to model
the true δ with uncertainty, but we ignore the fact that we can continue to learn and update the
belief. This means the state variable S̃tt0 in the lookahead model is given by

˜tt0 , β̃tt0 ).
S̃tt0 = (µ̄

Assuming that we can discretize the two-dimensional state, we can solve the lookahead model
using classical backward dynamic programming on this approximate model (we could do this in
steady state, or over a finite horizon, which makes more sense). Solving this model will give us exact

value functions V tt0 (S̃tt0 ) for our approximate lookahead model, from which we can then find the
decision to make now given by
 ∼ 
Xtπ (St ) = arg max C(St , x) + E{V t,t+1 (S̃t,t+1 )|St } . (62)
x

Then, we implement xt = Xtπ (St ), step forward to t + 1, observe Wt+1 , update to state St+1 and
repeat the process.

8.5 A hybrid policy

We have two types of decisions: whether to observe xobs vac


t , and how many to vaccinate xt . We can
combine them into a single, two-dimensional decision xt = (xobs vac
t , xt ) and then think of enumerating

41
all possible actions. However, we can also use hybrids. For example, we could use the policy function
in equation (56), but then turn to any of the other four classes of policies for xvac
t . This not only
reduces the dimensionality of the problem, but might help if we feel that we have confidence in the
function for xobs
t but are less confident designing a function for xobs
t .

8.6 Notes

The POMDP literature focuses on who knows what about the parameter µt , but it appears to
completely overlook knowledge of the system model S M (St , xt , Wt+1 ). We illustrated this by giving
the true system model (known to the environment) in equation (54), and then described a plausible
approximation created by the controller in equation (55). The POMDP literature ignores this issue
entirely, and implicitly assumes that the controller has access to the system model through its
knowledge of the one-step transition matrix.

Both classes of lookahead policies (policies based on VFAs and DLAs) need an explicit model of
the future. The controller would have to use his own estimate of the transition function, rather than
the true transition function known only to the environment.

The policies in the policy search class (the PFAs and CFAs) do not explicitly depend on the
transition function, but both involve tunable parameters that have to be tuned. If these are being
tuned offline in a simulator, then this simulator would also have to use the approximate transition
function known to the controller.

It appears that the only way to avoid using the controller’s version of the transition function is
to do online tuning of a PFA or CFA, which means we are doing tuning in the field.

9 Multiagent systems

We can extend our “two-agent” formulation of learning problems (with a controlling agent and an
environment agent) to general multiagent problems with multiple controlling agents. We begin by
defining our different classes of agents:

• The environment agent - This agent cannot make any decisions, or perform any learning (that
is, anything that implies intelligence). This is the agent that would know the truth about
unknown parameters that we are trying to learn, or which performs the modeling of physical
systems that are being observed by other agents. Controlling agents are, however, able to
change the environment.

• Controlling agents - These are agents that make decisions that act on other agents, or the
environment agent. Controlling agents may communicate information to other controlling

42
and/or learning agents. One form of control is when one agent wants to influence the decision
that falls under the control of another agent. This has to be done by creating an incentive for
the other agent to choose a decision aligned with the wishes of the first agent.

• Learning agents - These agents do not make any decisions, but can observe and perform learning
(about the ground truth and/or other controlling agents), and communicate beliefs to other
agents.

• Combined controlling/learning agents - These agents perform learning through observations of


the ground truth or other agents, as well as making decisions that act on the ground truth or
other agents.

We are not requiring that controlling agents be homogeneous. Each agent can control specific types
of decisions, and can work at different levels (e.g. leader/follower).

We are going to model multiagent systems by simply extending the notation for our basic canon-
ical model, just as we did for our two-agent POMDP. In fact, we are going to use this model to
represent each agent. We do this using the following notation

Q = The set of agents, which we index by q ∈ Q.


Stq = The state variable for agent q (this captures everything known by agent q)
at time t.
xtqq0 = A decision made by agent q that acts on agent q 0 at time t.
xtq = (xtqq0 )q0 ∈Q
Wt+1,qq0 = Information arriving to agent q from agent q 0 between t and t + 1. This
can include the actions of other agents q 0 acting on q. These actions are
communicated through Wt+1,qq0 and captured by St+1,q .
Wt+1,q = (Wt+1,qq0 )q0 ∈Q .
Cq (Stq , xtq ) = Cost/contribution generated by agent q.

Just as we created a new approach for modeling POMDPs using a two-agent formulation, we are
going to propose that we can handle any multiagent system of arbitrary complexity by modeling the
behavior of each individual agents using our standard modeling framework. There are, of course,
some issues that arise that we need to address.

• Active observations - We saw this above with the decision xobs


t of whether to observe the
environment, which we assume comes at a cost. Now we add the dimension of actively observing
other agents. For example, a navy ship has to make the decision whether to turn on its radar to
observe another ship, which simultaneously reveals the location of the ship sending the radar.

• Modeling communication - We have to model the act of sending information in Stq to another
agent q 0 . The information may be sent accurately, or with some combination of noise and bias.

43
• Receiving information - If information is sent from q 0 to q, agent q has to update her own
beliefs, which has to reflect the confidence that agent q has in the information coming from
agent q 0 .

• Communication architecture - We have to decide who can communicate what to whom. It


will generally not be the case in more complex systems that any agent can (or would) send
everything in their state vector Stq to every other agent. We may have coordinating agents
that communicate with everyone, make decisions and then send these decisions (in some form)
to other agents.

An important dimension of multiple controlling agents is the formation of beliefs about the
behavior of other agents. This introduces the issue of modeling how other agents make decisions. As
with any statistical model, it will be an approximation, since we will have to formulate our belief
about how another agent is making decisions. This will likely be in the form of a parametric model
with tunable parameters, which we will need to tune through a sequence of observations of the actions
of other agents (who may be responding to our actions). However, coming up with this model means
that we have to assess how we think the other agent is making decisions (and how smart they may
be).

We feel that these issues will make it difficult to have a discussion about optimal policies. We
can find an optimal policy for an agent given the assumed models of the transition functions and
policies of other agents, but given the difference between assumed functions and true ones, an agent’s
optimal policy is not going to be optimal for the system.

A thorough treatment of multiagent systems is beyond the scope of this chapter, but we hope
this hints at how our framework can be extended to more complex problems.

44
References
Bellman, R. E. (1957), Dynamic Programming, Princeton University Press, Princeton, N.J.
Bouzaiene-Ayari, B., Cheng, C., Das, S., Fiorillo, R. & Powell, W. B. (2016), ‘From single com-
modity to multiattribute models for locomotive optimization: A comparison of optimal integer
programming and approximate dynamic programming’, Transportation Science 50(2), 1–24.
Cassandras, C. G. & Lafortune, S. (2008), Introduction to Discrete Event Systems, 2 edn, Springer,
New York.
Cinlar, E. (2011), Probability and Stochastics, Springer, New York.
Gittins, J., Glazebrook, K. D. & Weber, R. R. (2011), Multi-Armed Bandit Allocation Indices, John
Wiley & Sons, New York.
Kirk, D. E. (2004), Optimal Control Theory: An introduction, Dover, New York.
Lazaric, A. (2019), Introduction to Reinforcement Learning, in
‘http://tinyurl.com/lazaricRLtutorial’.
URL: http://tinyurl.com/lazaricRLtutorial
Lewis, F. L. & Vrabie, D. (2012), Design Optimal Adaptive Controllers, 3 edn, John Wiley & Sons,
Hoboken, NJ.
Powell, W. B. (2011), Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2
edn, John Wiley & Sons.
Powell, W. B. (2019a), ‘A unified framework for stochastic optimization’, European Journal of Op-
erational Research 275(3), 795–821.
URL: https://doi.org/10.1016/j.ejor.2018.07.014
Powell, W. B. (2019b), ‘From Reinforcement Learning to Optimal Control : A unified framework for
sequential decisions’, arXiv .
Powell, W. B. (2020), Reinforcement Learning and Stochastic Optimization: A unified framework for
sequential decisions, Princeton NJ.
URL: jungle.princeton.edu
Powell, W. B. & Meisel, S. (2016), ‘Tutorial on Stochastic Optimization in Energy - Part II: An
Energy Storage Illustration’, IEEE Transactions on Power Systems 31(2).
Puterman, M. L. (2005), Markov Decision Processes, 2nd edn, John Wiley and Sons, Hoboken, NJ.
Ross, S., Pineau, J., Paquet, S. & Chaib-draa, B. (2008), ‘Online planning algorithms for POMDPs’,
Journal of Artificial Intelligence Research 32, 663–704.
Simao, H. P., Day, J., George, A. P., Gifford, T., Powell, W. B. & Nienow, J. (2009), ‘An Approxi-
mate Dynamic Programming Algorithm for Large-Scale Fleet Management: A Case Application’,
Transportation Science 43(2), 178–197.
Sontag, E. (1998), ‘Mathematical Control Theory, 2nd ed.’, Springer pp. 1–544.
URL: papers2://publication/uuid/28C81B69-DB6C-433C-91C8-F6761AB37FAB
Sutton, R. S. & Barto, A. G. (2018), Reinforcement Learning: An Introduction, 2nd edn, MIT Press,
Cambridge, MA.

45

You might also like