0% found this document useful (0 votes)

9 views19 pages

Non-Maximizing Policies That Fulfill Multi-Criterion Aspirations in Expectation

Uploaded by

mavilihuv.hurulobut

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views19 pages

Non-Maximizing Policies That Fulfill Multi-Criterion Aspirations in Expectation

Uploaded by

mavilihuv.hurulobut

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Non-maximizing policies that fulfill

multi-criterion aspirations in expectation

Simon Dima1[0009−0003−7815−8238] , Simon Fischer2[0009−0000−7261−3031] , Jobst

3
Potsdam Institute for Climate Impact Research, Potsdam, Germany,
[email protected]
4
Independent Researcher, London, UK, [email protected]

Abstract. In dynamic programming and reinforcement learning, the

policy for the sequential decision making of an agent in a stochastic
environment is usually determined by expressing the goal as a scalar
reward function and seeking a policy that maximizes the expected total
reward. However, many goals that humans care about naturally concern
multiple aspects of the world, and it may not be obvious how to condense
those into a single reward function. Furthermore, maximization suffers
from specification gaming, where the obtained policy achieves a high
expected total reward in an unintended way, often taking extreme or
nonsensical actions.
Here we consider finite acyclic Markov Decision Processes with multiple
distinct evaluation metrics, which do not necessarily represent quanti-
ties that the user wants to be maximized. We assume the task of the
agent is to ensure that the vector of expected totals of the evaluation
metrics falls into some given convex set, called the aspiration set. Our
algorithm guarantees that this task is fulfilled by using simplices to ap-
proximate feasibility sets and propagate aspirations forward while en-
suring they remain feasible. It has complexity linear in the number of
possible state–action–successor triples and polynomial in the number of
evaluation metrics. Moreover, the explicitly non-maximizing nature of
the chosen policy and goals yields additional degrees of freedom, which
can be used to apply heuristic safety criteria to the choice of actions. We
discuss several such safety criteria that aim to steer the agent towards
more conservative behavior.

Keywords: multi-objective decision-making · planning · Markov deci-

sion processes · AI safety · satisficing · convex geometry

1 Introduction

In typical reinforcement learning (RL) and dynamic programming problems an

agent is trained or programmed to solve tasks encoded by a single real-valued
reward function that it shall maximize. However, many tasks are not easily
2 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver

expressed by such a function [12], human preferences are hard to learn and may
not be easy to aggregate across stakeholders [5], and maximizing a misspecified
objective may fall prey to reward hacking [1] and Goodhart’s law [11], leading
to unintended side-effects and potentially harmful consequences.
In this work, we study a particular aspiration-based approach to agent design.
We assume an existing task-specific world model in the form of a fully observed
Markov Decision Process (MDP), where the task is not encoded by a reward
function but instead a multi-criterion evaluation function and a bounded, con-
vex subset of its range, called an aspiration set, that can be thought of as an
“instruction” [4] to the agent. Aspiration-type goals can also naturally arise from
subtasks in complex environments even if the overall goal is to maximize some
objective, when the complexity requires a hierarchical decision-making approach
whose highest level selects subtasks that turn into aspiration sets for lower hier-
archal levels.
In our version of aspiration-based agents, the goal is to make the expected
value of the total with respect to this evaluation function fall within the aspi-
ration set, and select from this set according to certain performance and safety
criteria. They do so step-wise, exploiting recursion equations similar to the Bell-
man equation. Thus our approach is like multi-objective reinforcement learning
(MORL), with a primary aspiration-based objective and at least one secondary
objective incorporated via action-selection criteria [16]. Unlike MORL, the com-
ponents of the evaluation function (called evaluation metrics) are not objectives
in the sense of targets for maximization. Rather, an aspiration formulated w.r.t.
several evaluation metrics might correspond to a single objective (e.g., “make
a cup of tea”). Also, at no point does an aspiration-based agent aggregate the
evaluation metrics into a single value. Instead, any trade-offs are built into the
aspiration set itself, similar to what [6] call a “safety specification”. For example,
aspiring to buy a total of 10 oranges and/or apples for at most EUR 1 per item
could be encoded with the aspiration set {(o, a, c) : o, a ≥ 0; c ≤ o + a = 10}.
A similar set-up to ours has been used in [9] which has used reinforcement
learning to find a policy whose expected discounted reward vector lies inside
a convex set. Instead of a reinforcement learning perspective, we use a model-
based planning perspective and design an algorithm that explicitly calculates a
policy for solving the task, based on a model of the environment. 5 Also, the
approach in [9] is concerned with guaranteed bounds for the distance between
received rewards and the convex constraints in terms of the number of iterations,
whereas we focus on guaranteeing aspiration satisfaction in a fixed number of
computational steps, providing a verifiable guarantee in the sense of [6].
Other agent designs that follow a non-maximization-goal-based approach in-
clude quantilizers, decision transformers and active inference. Quantilizers are
agents that use random actions in the top n% of a “base distribution” over ac-
tions, sorted by expected return [13]. The goal for decision transformers is to
make the expected return equal a particular value Rtarget [3]. The goal for active
inference agents is to produce a particular probability distribution of observa-
5
Nevertheless, our algorithm can also be straightforwardly adapted to learning.
Policies that fulfill multi-criterion aspirations in expectation 3

tions [14]. While the goal space for quantilizers and decision transformers, being
based on a single real-valued function, is often too restricted for many applica-
tions, that of active inference agents (all probability distributions) appears too
wide for the formal study of many aspects of aspiration-based decision-making.
Our approach is of intermediary complexity.
An important consideration in this work, to ensure tractability in large envi-
ronments, is also the computational complexity in the number of actions, states,
and evaluation metrics. We will see that for our algorithm, the preparation of an
episode has linear complexity in the number of possible state–action–successor
transitions and (conjectured and numerically confirmed) linear average complex-
ity in the number of evaluation metrics, and then the additional per-time-step
complexity of the policy is linear in the number of actions, constant in the num-
ber of states, and polynomial in the number of evaluation metrics.
Our work also affects the emerging AI safety/alignment field, which views un-
intended consequences from maximization, e.g., reward hacking and Goodhart’s
law, as a major source of risk once agentic AI systems become very capable [1].

2 Preliminaries

Environment. An environment E = (S, s0 , S⊤ , A, T ) is a finite Markov Decision

Process without a reward function, consisting of a finite state space S, an initial
state s0 ∈ S, a nonempty subset S⊤ ⊆ S of terminal states, a nonempty finite
action space A, and a function T : (S \ S⊤ ) × A → ∆(S \ {s0 }) specifying
transition probabilities: T (s, a)(s′ ) is the probability that taking action a from
state s leads to state s′ . We assume that the environment is acyclic, i.e., that it is
impossible to reach a given state again after leaving it. We fix some environment
E and write s′ ∼ s, a to denote that s′ is distributed according to T (s, a).

Policy. A (memory-based) policy is given by some nonempty finite set M of

memory states internal to the agent, an initial memory state m0 ∈ M and a
function π : M × (S \ S⊤ ) → ∆(A × M) that maps each possible combination
of memory state m ∈ M and (environment) state s ∈ S \ S⊤ to a probability
distribution over combinations of actions a ∈ A and successor memory states
m′ ∈ M. Let ΠM be the set of all policies with memory space M. The special
class of Markovian or memoryless policies is obtained when M is a singleton.
Policies which are both Markovian and deterministic are called pure Markov
policies, and amount to a function (S \ S⊤ ) → A. We denote by Π 0 the set of
Markovian policies and by Π p the set of all pure Markov policies.

Evaluation, Delta, Total. A (multi-criterion) evaluation function for the envi-

ronment is a function f : (S \ S⊤ ) × A × (S \ {s0 }) → Rd where d ≥ 1. The
quantity f (s, a, s′ ) is called the Delta received under transition (s, a) → s′ . It
represents by how much certain evaluation metrics change when the agent takes
action a in state s and the successor state is s′ . Let us fix f for the rest of the pa-
per. The (received) Total of a trajectory h = (m0 , s0 , a1 , m1 , s1 , . . . , aT , mT , sT )
4 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver

is then the cumulative Delta received along the trajectory,

PT
τ (h) = t=1 f (st−1 , at , st ). (1)

Value functions. Given a policy π and an evaluation function f , the state value
function V π : M × S → Rd is defined as the expected Total accumulated in
future steps while following policy π. In particular, V π (m0 , s0 ) is the expected
Total of the whole trajectory: V π (m0 , s0 ) = E(τ ). Likewise, we define the action
value function Qπ : S × A × M → Rd . These satisfy the Bellman equations

V π (m, s) = Ea,m′ ∼π(m,s) (Qπ (s, a, m′ )) , (2)

π ′ π ′
Q (s, a, m) = Es′ ∼s,a (f (s, a, s ) + V (m, s )) . (3)

They can be calculated by backwards induction since the environment is finite

and acyclic, with the base case V π (m, s) = 0 for terminal states s ∈ S⊤ . For
memoryless policies, we will elide the argument m since M is a singleton.
The Delta and Total are analogous to reward and return in an MDP with a
reward function, and the value functions V and Q are defined as usual, albeit
with vector instead of scalar arithmetic. However, while we do assume that the
evaluation metrics represent some aspects relevant to the task at hand, we do
not assume that they represent a form of utility which it is always desirable
to increase. Accordingly, the agent’s goal will be specified by the user not as a
maximization task, but rather as a set of linear constraints on the expected sums
of the evaluation metrics, which we call an aspiration.

3 Fulfilling aspirations
Aspirations, feasibility. An (initial) aspiration is a convex polytope E0 ⊂ Rd ,
representing the values of the expected total Eτ which are considered acceptable.
We say that a policy π fulfills the aspiration when it satisfies V π (m0 , s0 ) ∈ E0 .
To answer the question of whether it is possible to fulfill a given aspiration, we
introduce feasibility sets. The state-feasibility set of s ∈ S is the set of possible
values for the expected future Total from s, under any memory-based policy:
V(s) = {V π (m, s) | M finite set; m0 , m ∈ M; π ∈ ΠM }; likewise, we define the
action-feasibility set Q(s, a). It is straightforward to verify that V(s ∈ S⊤ ) = {0},
and that the following recursive equations hold for s ∈ / S⊤ :
P
Q(s, a) = Es′ ∼s,a f (s, a, s′ ) + V(s′ ) = s′ T (s, a)(s′ )·(f (s, a, s′ ) + V(s′ )), (4)
S S P
V(s) = p∈∆(A) Ea∼p Q(s, a) = p∈∆(A) a∈A p(a)Q(s, a). (5)

In this, we use set arithmetic: rX + r′ X ′ = {rx + r′ x′ | x ∈ X , x′ ∈ X ′ } for

r, r′ ∈ R and X , X ′ ⊂ Rd . It is clear that feasibility sets are convex polytopes.

Aspiration propagation. Algorithm scheme 1 shows a general manner of fulfilling

a feasible initial aspiration, starting from a given state or state-action pair. It
memorizes and updates aspirations E and Ea , initially equalling E0 . The agent
Policies that fulfill multi-criterion aspirations in expectation 5

Algorithm 1 General scheme for fulfilling feasible aspiration sets

1: procedure FulfillStateAspiration(s ∈ S \ S⊤ , nonempty E ⊆ V(s))
2: Find suitable Ea ⊆ Q(s, a) for all a ∈ A, and p ∈ ∆(A), s.t. Ea∼p (Ea ) ⊆ E.
3: Draw action a from distribution p and do FulfillActionAspiration(s, a, Ea ).
4: procedure FulfillActionAspiration(s ∈ S, a ∈ A, nonempty Ea ⊆ Q(s, a))
5: Find suitable Es′ ⊆ V(s′ ) for all s′ ∈ S s.t. Es′ ∼s,a (f (s, a, s′ ) + Es′ ) ⊆ Ea .
6: Execute action a and observe successor state s′ .
7: If s′ is terminal, stop; else do FulfillStateAspiration(s′ , Es′ ).

alternates between being in a certain state s with a state-aspiration E ⊆ V(s),

and being in a state s and having chosen, but not yet performed, an action a,
with an action-aspiration Ea ⊆ Q(s, a). Although this algorithm is written as
two mutually recursive functions, it can be formally implemented by a memory-
based policy that memorizes the current aspiration E or Ea .
The way the aspiration set E is propagated between steps is the key part. The
two directions of aspiration propagation are slightly different: in state-aspiration
to action-aspiration propagation, shown on line 2, the agent may choose the
probability distribution (p) over actions, whereas in action-aspiration to state-
aspiration propagation, shown on line 5, the next state is determined by the
environment with fixed probabilities (T ).
The correctness of algorithm 1 follows from the requirements of lines 2 and
5; that these are possible to fulfill is a consequence of equations (5) and (4).
Feasibility of aspirations is maintained as an invariant.
To implement this scheme, we have to specify how to perform aspiration prop-
agation. The procedure used to select action-aspirations Ea and state-aspirations
Es′ should preferably allow some control over how the size of these sets changes
over time. On one hand, preventing aspiration sets from shrinking too fast pre-
serves a wider range of acceptable behaviors in later steps6 , but on the other
hand, keeping the aspiration sets somewhat smaller than the feasibility sets also
provides immediate freedom in the choice of the next action, as detailed in sec-
tion 4.2. An additional challenge is posed by the complex shape of the feasibility
sets, which we must handle in a tractable way.

Approximating feasibility sets by simplices. Let conv(X) denote the convex hull
of any set X ⊆ Rd . Given any tuple R of d + 1 memoryless reference policies
π1 , . . . , πd+1 ∈ Π 0 , we define reference simplices in evaluation-space as V R (s) =
conv{V π (s) | π ∈ R} and QR (s, a) = conv{Qπ (s, a) | π ∈ R}. It is immediate
that these are subsets of the convex feasibility sets V(s) resp. Q(s, a), and that
S
V R (s) ⊆ p∈∆(A) Ea∼p QR (s, a) , (6)

QR (s, a) ⊆ Es′ ∼s,a f (s, a, s′ ) + V R (s′ ) . (7)

6
However, algorithm 1 remains correct even if the aspiration sets shrink to singletons.
6 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver

These imply that we can replace every occurrence of V and Q in algorithm 1

with V R resp. QR , obtaining a correct algorithm to guarantee fulfillment of any
initial aspiration E0 provided it intersects the reference simplex V R (s0 ).
It turns out that the latter can be guaranteed by a proper choice of reference
policies, and that we can always use pure Markov policies for this:
Lemma 1. For any state s, we have V(s) = conv{V π (s) | π ∈ Π p }.
Proof. Any memory-based policy admits a Markovian policy with the same oc-
cupancy measure and hence the same expected Total [7]. Hence V(s) = {V π (s) |
π ∈ Π 0 }. A convex set is the convex hull of its vertices, so V(s) = conv{V π (s) |
′
∃y ∈ Rd , π = arg maxπ′ ∈Π 0 (y · V π (s))}. Finally, maximal policies may be taken
to be deterministic, which concludes the argument. ⊔
⊓
As a consequence, for any aspiration set E intersecting the feasibility set V(s0 ),
there exists a tuple R of pure Markov reference policies such that V R (s0 )∩E ̸= ∅.
Section 5 describes a heuristic algorithm for finding such reference policies.
We now turn to explaining a way to enact the aspiration-propagation steps
needed in lines 2 and 5 of algorithm 1, based on shifting and shrinking.

4 Propagating aspirations
4.1 Propagating action-aspirations to state-aspirations
To implement algorithm schema 1, we first focus on line 5 in procedure Ful-
fillActionAspiration, which is the easier part. Given a state-action pair
s, a and an action-aspiration set Ea ⊆ QR (s, a), we must construct nonempty
state-aspiration sets Es′ ⊆ V R (s′ ) for all possible successor states s′ , such that
Es′ ∼s,a (f (s, a, s′ ) + Es′ ) ⊆ Ea . We assume that all reference simplices are nonde-
generate, i.e. have full dimension d. This is almost surely the case if there are
sufficiently many actions and these have enough possible consequences.

Tracing maps. Under this assumption, we define tracing maps ρs,a,s′ from the
reference simplex QR (s, a) to V R (s′ ). Since domain and codomain are simplices,
we can choose ρs,a,s′ to be the unique affine linear map that maps vertices to
vertices, ρs,a,s′ (Qπi (s, a)) = V πi (s′ ). For any point e ∈ QR (s, a), it follows from
equation (2) that Es′ ∼s,a (f (s, a, s′ ) + ρs,a,s′ (e)) = e. Accordingly, to propagate
aspirations of the form Ea = {e}, it is sufficient to just set Es′ = {ρs,a,s′ (e)}. How-
ever, for general subsets Ea ⊆ QR (s, a), the set Es′ ∼s,a (f (s, a, s′ ) + ρs,a,s′ (Ea )) is
in general strictly larger than Ea , and hence we map Ea in a different way.
For this, choose an arbitrary “anchor” point e ∈ Ea ; here we let e be the
average of the vertices, C(Ea ), but any other standard type of center would also
work (e.g. analytic center, center of mass/centroid, Chebyshev center). Now, let
Xs′ = E − e + ρs,a,s′ (e) be shifted copies of the action-aspiration. We would like
to use the Xs′ as state-aspirations, and indeed they have the property that
Es′ ∼s,a (f (s, a, s′ ) + Xs′ ) = Es′ ∼s,a (f (s, a, s′ ) + ρs,as′ (e) − e + Ea ) (8)
= e − e + Es′ ∼s,a (Ea ) = Ea as Ea is convex. (9)
Policies that fulfill multi-criterion aspirations in expectation 7

This is almost what we want, but it might be that Xs′ is not a subset of V R (s′ ). To
rectify this, we opt for a shrinking approach, setting Es′ = rs′ ·(Ea −e)+ρs,a,s′ (e)
for the largest rs′ ∈ [0, 1] such that the result is a subset of V R (s′ ). As Es′ does
not depend on any other successor state s′′ ̸= s′ , we can wait until the true s′ is
known and only compute Es′ for that s′ , which saves computation time.

Proposition 1. Given all values V πi (s) and Qπi (s, a) and both the kc ≥ d + 1
constraints and kv ≥ d + 1 vertices defining Ea , the shrinking version of action-
to-state aspiration propagation has time complexity O([kc1.5 d+(dkv )1.5 ]L), where
L is the precision parameter defined in [15]. If Ea is a simplex, this is O(d3 L).

Proof. A linear program (LP) with m constraints and n variables has time com-
plexity O(f (m, n)L), where f (m, n) = (m+n)1.5 n and L is a precision parameter
[15]. C(Ea ) can be calculated with an LP with m = kc + 1 and n = d + 1. Find-
ing the convex coefficients for ρs,a,s′ requires solving a system of d + 1 linear
equations, needing time O(dω ), where 2 ≤ ω < 2.5 is the exponent of matrix
multiplication. Finding the shrinking factor r is an LP with m = (d + 1)kv and
n = 1, and then computing the constraints and vertices of Es′ is O(d(kc +kv )). In
all, this gives a time complexity of O(Lf (kc , d) + dω + Lf (dkv , 1) + d(kc + kv )) ≤
O(L(kc + d)1.5 d + dω + L(dkv )1.5 ) ≤ O(L(kc1.5 d + (dkv )1.5 )). ⊔
⊓

4.2 Choosing actions and action-aspirations

This is the core of our construction. In state s with state-aspiration E, the policy
probabilistically selects an action a and action-aspiration Ea as follows:
(i) For each of the d + 1 vertices V πi (s) of the state’s reference simplex
V R (s), find a directional action set Ai ⊆ A containing those actions whose
reference simplices QR (s, a) “lie between” E and the vertex V πi (s).
(ii) From the full action set A0 = A and from each directional action set Ai
(i = 1 . . . d+1) independently, use some arbitrary, potentially probabilistic
procedure to select one element, giving candidate actions a0 , . . . , ad+1 .
(iii) For each candidate ai , compute an action-aspiration Eai by shifting and
shrinking the state-aspiration E into the reference simplex QR (s, ai ).
(iv) Compute a probability distribution p ∈ ∆({0, . . . , d + 1}) that makes
Ei∼p Eai ⊆ E and has as large a p0 as possible.
(v) Finally execute candidate action ai with probability pi (i = 0 . . . d + 1)
and memorize its action-aspiration Eai .
We now describe steps (i)–(iv) in detail. Algorithm 2 has the corresponding
pseudocode, and Figure 1 illustrates the involved geometric operations.

(i) Directional action sets. Compute the average of the vertices of E, x = C(E),
and the “shape” E ′ = E − x. For i = 0, put Ai = A. For i > 0, let Xi be the
segment from x to the vertex V πi (s) and check for each action a ∈ A whether
its reference simplex QR (s, a) intersects Xi , using linear programming. Let Ai
be the set of those a, which is nonempty since πi (s) ∈ Ai by definition of V πi (s).
8 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver

Algorithm 2 Action selection (“shrinking” variant)

Require: reference simplex vertices V πi (s), Qπi (s, a); state s; state-aspiration E;
shrinking schedule rmax (s).
1: x ← C(Es ); E ′ ← E − x ▷ center, shape
2: for i = 0, . . . , d + 1 do
3: Ai ← DirectionalActionSet(i)
4: ai ← SampleCandidateAction(Ai ) ▷ any (probabilistic) choice from Ai
5: Eai ← ShrinkAspiration(aiP , i) ▷ action-aspiration
6: solve linear program p0 = max!, d+1 i=0 pi Eai ⊆ E for p ∈ ∆({0, . . . , d + 1})
7: sample i ∼ p and execute ai
8: memorize m ← (s, ai , Eai )

9: procedure DirectionalActionSet(i)
10: if i = 0, return A
11: Xi ← conv{x, V πi (s)} ▷ segment from m to ith vertex
12: return {a ∈ A | QR (s, a) ∩ Xi ̸= ∅} ▷ actions with feasible points on Xi
13: procedure ShrinkAspiration(ai , i)
14: v ← V πi (s) if i > 0, else C(QR (s, ai )) ▷ vertex or center of target
15: y ←v−x ▷ shifting direction
16: r ← max{r ∈ [0, rmax (s)] | ∃ℓ ≥ 0, ℓy + x + rE ′ ⊆ QR (s, a)} ▷ size of largest
shifted and shrunk copy of E that fits into QR (s, a)
17: ℓ ← min{ℓ ≥ 0 | ℓy + x + rE ′ ⊆ QR (s, a)} ▷ shortest dist. that makes it fit in
18: return ℓy + x + rE ′ ▷ shift and shrink

(ii) Candidate actions. For each i = 0 . . . d + 1, use an arbitrary, possibly prob-

abilistic procedure to select a candidate action ai ∈ Ai . Since A0 = A, any
possible action might be among the candidate actions. This freedom can be
used to improve the policy in terms of the evaluation metrics, e.g., by choosing
actions that are expected to lead to a rather low variance of the received Total.
It can also be used to incorporate additional action-selection criteria that are un-
related to these evaluation metrics, to increase overall safety, e.g., by preferring
actions that avoid unnecessary side-effects. We’ll discuss this in Section 6.

(iii) Action-aspirations. Given direction i and action ai , we now aim to select a

large subset Eai ⊆ QR (s, ai ) that fits into a shifted version zi +E ′ . We determine a
direction y towards which to shift Es : if i = 0, towards the average of the vertices,
C(QR (s, ai )), otherwise towards the reference vertex V πi (s). This is lines 14
and 15 of Alg. 2. As before, we use shrinking: find the largest shrinking factor
r ∈ [0, rmax (s)] for which there is a shifting distance ℓ ≥ 0 so that ℓy + rE ′ ⊆
QR (s, ai ), using a linear program with two variables r, ℓ, and then find the
smallest such ℓ for that r using another linear program. The “shrinking schedule”
rmax (s) ∈ [0, 1] might be used to enforce some amount of shrinking to increase
the freedom for choosing the action mixture, which is the next step.

(iv) Suitable mixture of candidate actions. We next find probabilities p0 , . . . , pd+1

for the candidate actions so that the corresponding mixture of aspirations Eai
Policies that fulfill multi-criterion aspirations in expectation 9

V3(s) = Q3(s,π3(s))

QR (s,π3(s))

QR(s,a3)
QR(s,a0)
E a3 E a0

X3 E a1 V1(s) = Q1(s,π1(s))

E x X1
QR(s,π1(s))
X2
QR(s,π2(s)) VR (s)
E a2

V2(s) = Q2(s,π2(s))

Fig. 1. Construction of action-aspirations Ea from state-aspiration E and reference

simplices V R (s) and QR (s, a) by shifting and shrinking. See main text for details.

Pd+1
is a subset of Es , i=0 pi Eai ⊆ Es . We show below that this equation has a
solution. Because we want the action a0 that was chosen freely from the whole
action set A to be as likely as possible, we maximize its probability p0 . This is
done in line 6 of Algorithm 2 using linear programming. Note that the smaller
the sets Ea , the looser the set inclusion constraint and thus the larger p0 . We can
influence the latter via the shrinking schedule rmax (s), for example by putting
rmax (s) = (1 − 1/T (s))1/d where T (s) is the remaining number of time steps at
state s, which would reduce the amount of freedom (in terms of the volume of
the aspiration set) linearly and end in a point aspiration for terminal states.
Lemma 2. The linear program in line 6 of Algorithm 2 has a solution.
Pd+1
Proof. Because x ∈ V R (s), there are convex coefficients p′ with i=1 p′i (V πi (s)−
x) = 0. As each shifting vector zi is a positive multiple of V πi (s) − x, there are
Pd+1
also convex coefficients p with i=1 pi zai = 0. Since Eai ⊆ Es + zai and Es is
convex, we then have
P P P
i pi Eai ⊆ i pi (Es + zai ) ⊆ Es + i pi zai = Es . ⊔
⊓ (10)
Proposition 2. Given all values V πi (s) and Qπi (s, a) and both the kc ≥ d + 1
many constraints and kv ≥ d + 1 many vertices defining E, this part of the
construction (Algorithm 2) has time complexity O([kv1.5 d2.5 |A| + (kv kc )1.5 d]L).
If E is a simplex, this is O(d4 |A|L).
Proof. Using the notation from Prop. 1, computing x is O(f (kv , d)L). For each
i, a, verifying whether a ∈ Ai and computing r, ℓ is done by LPs with m ≤ O(dkv )
constraints and n ≤ 2 variables, giving O(d|A|f (dkv , 2)L) = O(d2.5 kv1.5 |A|L).
The LP for calculating p has m = d + 4 + kv kc and n = d + 2, hence complexity
O((kv kc )1.5 dL). The other arithmetic operations are of lower complexity. ⊔
⊓
10 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver

5 Determining appropriate reference policies

First, find some feasible aspiration point x ∈ E0 ∩ V(s0 ) (e.g., using binary
search). We now aim to find policies whose values are likely to contain x in their
convex hull, by using backwards induction and greedily minimizing the angle
between the vector Eτ − x and a suitable direction in evaluation space.
More precisely: Pick an arbitrary direction (unit vector) y1 ∈ Rd , e.g., uni-
formly at random. Then, for k = 1, 2, . . . until stopping:

1. Let πk be the pure Markovian policy π defined by backwards induction as

yk⊤ (Qπ (s, a) − x)

π(s) = argmax . (11)
a∈A ||Qπ (s, a) − x||2

Let vk = V πk (s0 ) be the resulting candidate vertex for our reference simplex.
If k ≥ d + 1, run a primal-sparse linear program solver [18] to determine if
x ∈ conv{v1 , . . . , vk }. If so, the solver will return a basic feasible solution,
i.e. d + 1 many vertices that contain x in their convex hull. Let the policies
corresponding to the vertices that are part of the basic feasible solution be
our reference policies and stop. Otherwise, continue to step 2 below:
2. Let ek = (x − vk )/||x − vk ||2 be the unit vector in the direction from vk to x.
Pk
3. Let yk+1 = i=1 ei /k be the average of all those directions. Assuming that
x ∈ V(s0 ) and because of the hyperplane separation theorem, choosing di-
rections like this ensures that the algorithm doesn’t loop with vl+1 = vl for
some l. Also, note that to check x ∈ conv{v1 , . . . , vk , vk+1 } it is sufficient
to check vk+1 − x ∈ cone{x − v1 , · · · , x − vk }, the cone generated by the
negative of the vertices centred at x. So hunting for a policy whose value lies
approximately in the direction yk+1 from x gives us a good chance of finding
a vertex in the aforementioned cone.

We were not able to prove any complexity bounds for this part, but performed
numerical experiments with random binary tree-shaped environments of various
time horizons, with only two actions (making the directions vk −x deviate rather
much from y and thus presenting a kind of worst-case scenario) and two possible
successor states per step, and uniformly random transition probabilities and
Deltas ∈ [0, 1]d . These suggest that the expected number of iterations of the
above is O(d), which we thus conjecture to be also the case for other sufficiently
well-behaved random environments. Indeed, even if the policies πk were chosen
uniformly at random (rather than targeted like here) and the corresponding
points V πk (s0 ) were distributed uniformly in all directions around x (which is a
plausible uninformative prior assumption), then one can show easily (using [17])
that the expected number of iterations would be exactly 2d + 1.7
7
According to [17], the probability P that we need exactly k ≥
P∞ d + 1 iterations is
f (k − 1) − f (k) withPf (k) = 21−k d−1 ℓ=0
k−1
ℓ
. Hence Ek = k=d+1 k(f (k − 1) −
= (d+1)f (d)+ ∞
f (k))P k=d+1 f (k). It is then reasonably simple to prove by induction
that ∞ k=d+1 f (k) = d, and hence Ek = 2d + 1.
Policies that fulfill multi-criterion aspirations in expectation 11

6 Selection of candidate actions

As we have seen in section 4.2 (ii), when choosing actions, we still have many
remaining degrees of freedom. Thus, we can use additional criteria to choose
actions while still fulfilling the aspirations. We discuss a few candidate criteria
here which are related either to gaining information, improving performance, or
reducing potential safety-related impacts of implementing the policy.
For many of the criteria, there are myopic versions, which only rely on quan-
tities that are already available at each step in the algorithms presented so far,
or farsighted versions which depend on the continuation policy and thus have to
be specifically computed recursively via Bellman-style equations.

Information-related criteria. If the used world model is imperfect, one might

want the agent to aim to gain knowledge by exploration, e.g. by considering
some measure of expected information gain such as the evidence lower bound.

6.1 Performance-related criteria

For now, the task of the agent in this paper has been given by specifying aspira-
tion sets for the expected total of the evaluation function. It is natural to consider
extensions of this approach to further properties of the trajectory distribution,
e.g. by specifying that the variance of the total should be small.
A simple, myopic approach to reducing variance is preferring actions and
action-aspirations that are somehow close to the state aspiration E, e.g. by choos-
ing action-aspirations where the Hausdorff distance dH (Ea , E) is small. A more
principled, farsighted approach would be choosing actions and action-aspirations
such that the variance of the resulting total is small. Based on equation (2), the
variance can be computed from the total raw second moment M π as

M π (s, a, Ea ) = Es′ ,Es′ ∼s,a,Ea ||f (s, a, s′ )||22 + 2f (s, a, s′ ) · V π (s′ , Es′ )

+ Ea′ ,Ea′ ∼π(s′ ,Es′ ) M π (s′ , a′ , Ea′ ) , (12)
Var(s, a, Ea ) = M π (s, a, Ea ) − ||Qπ (s, a, Ea )||22 . (13)
Note that computing this farsighted metric requires knowing the continuation
policy π, for which algorithm 2 does not suffice in its current form as it only
samples actions. It is however easy to convert it to an algorithm for computing
the whole local policy π(s, E), which is described in the Supplement.

6.2 Safety-related criteria

As mentioned in the introduction, unintended consequences of optimization can
be a source of safety problems, thus we suggest to not use any of the criteria
introduced in this section as maximization/minimization goals to completely
determine the chosen actions; instead, they can be combined
P into
a loss for a
softmin action selection policy πi (a ∈ Ai ) ∝ exp − β j αj gj (a) , where gj (a)
are the individual criteria. Indeed, in analogy to quantilizers, choosing among
adequate actions at random can by itself be considered a useful safety measure,
as a random action is very unlikely to be special in a harmful way.
12 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver

Disordering potential. Our first safety criterion is related to the idea of “fail
safety”, and (somewhat more loosely) to “power seeking”. More precisely, it aims
to prevent the agent from moving the system into a state from which it could
make the environment’s state trajectory become very unpredictable (bring the
environment into “disorder”) because of an internal failure, or if it wanted to. We
define the disordering potential at a state to be the Shannon entropy H π (st ) of
the stochastic state trajectory S>t = (St+1 , St+2 , . . . ) that would arise from the
policy π which maximizes that entropy:

H π (st ) := Es>t |st ,π (− log Pr(s>t |st , π)). (14)

It is straightforward to compute this quantity using the Bellman-type equations

H π (s) = 1s∈S π
/ ⊤ Ea∼π(s) (− log π(s)(a) + H (s, a)), (15)
π ′ π ′
H (s, a) = Es′ ∼s,a (− log T (s, a)(s ) + H (s )). (16)

To find the maximally disordering policy π, we assume π(s′ ) and thus H π (s′ ) is
already known for all potential successors s′ of s. Then H π (s,P a) is also known
for all a and to find p
P a = π(s)(a) we need to maximize f (p) = π
a pa (H (s, a) −
log pa ) such that a pa = 1. Using Lagrange multipliers, we find that for all a,
∂pa f (p) = Hπ (s, a)−log pa −1 = λ for some constant λ, hence pa ∝ exp(H π (s, a))
is a softmax policy w.r.t. future expected Shannon entropy. Therefore
P
π(s)(a) = pa = exp(H π (s, a))/Z, Z = a exp(H π (s, a)), (17)
π P π
H (s) = log Z = log a exp(H (s, a)). (18)

Deviation from default policy. If we have access to a default policy π 0 (e.g. a

policy that was learned by observing humans or other agents performing similar
tasks), we might want to choose actions in a way that is similar to this default
policy. An easy way to measure this is by using the Kullback–Leibler divergence
from the default policy π 0 to the agent’s policy π. Given that we do not know
the local policy π(s) yet when we decide how to choose the action in the state
s, we use an estimate p̂a (e.g. p̂a = 1/(2 + d)) instead to compute the expected
total Kullback–Leibler divergence like

KLdiv(s, p̂a , a, Ea ) = log(p̂a /π 0 (s)(a)) + Es′ ∼s,a KLdiv(s, g(s, a, Ea , s′ )), (19)
/ ⊤ E(a,Ea )∼π(s,E) KLdiv(s, π(s, E)(a), a, Ea ),
KLdiv(s, E) = 1s∈S (20)

where g : (s, a, Ea , s′ ) 7→ Es′ implements action-to-state aspiration propagation.

7 Discussion and conclusion

7.1 Special cases
A single evaluation metric. It is natural to ask what our algorithm reduces
to in the single-criterion case d = 1. The reference simplices can then sim-
ply be taken to be the intervals V R (s) = [V πmin (s), V πmax (s)] and QR (s, a) =
Policies that fulfill multi-criterion aspirations in expectation 13

[Qπmin (s, a), Qπmax (s, a)], where πmin , πmax are the minimizing and maximizing
policies for the single evaluation metric. Aspiration sets are also intervals, and
action-aspirations Ea are constructed by shifting the state-aspiration E upwards
or downwards into QR (s, a) and shrinking it to that interval if necessary. To
maximize pa0 , the linear program for p will assign zero probability to that “di-
rectional” action a1 or a2 whose Ea lies in the same direction from E as Ea0 does.
In other words, the agent will mix between the “freely” chosen action a0 and a
suitable amount of a “counteracting” action a1 or a2 in the other direction.

Relationship to satisficing. A subcase of the d = 1 case is when the upper

bound of the initial state-aspiration interval coincides with the maximal possible
value, E0 = [e, V πmax (s0 )], i.e., when the goal is to achieve an expected Total of
at least e. The agent then starts out as a form of “satisficer” [10]. However,
due to the shrinking of aspirations over time, aspiration sets of later states s′
might no longer be of the same form but might end at values strictly lower
than V πmax (s′ ) if the interval [V πmin (s′ ), V πmax (s′ )] is wider than the interval
[Qπmin (s, a), Qπmax (s, a)]. In other words, even an initial satisficer can turn into
a “proper” aspiration-based agent in our algorithm that avoids maximization in
more situations than a satisficer would. In particular, also the form of satisficing
known as “quantilization” [13], where all feasible expected Totals above some
threshold get positive probability, is not a special case of our algorithm. One
can however change the algorithm to quantilization behaviour by constructing
successor state aspirations differently, by simply applying the tracing map to the
interval, Es′ = ρs,a,s′ [Ea ] (which is not feasible for d > 1).

Probabilities of desired or undesired events. Another special case is when d > 1

but the d evaluation metrics are simply indicator functions for certain events.
E.g., assume all Deltas are zero except when reaching a terminal state s′ ∈ S⊤ ,
in which case fi (s, a, s′ ) = 1(s′ ∈ Ei ) for some subset of desirable or undesirable
states Ei ⊆ S⊤ . If the first k ≤ d many events are desirable in the sense that we
want each probability Pr(Ei ) to be ≥ α for some α < 1, and the other d−k many
events are undesirable in the sense that we want each probability Pr(Ej ) to be
≤ β for some β > 0, then we can encode this goal as the initial aspiration set
E0 = [α, 1]k ×[0, β]d−k . Note that the different events need not be independent or
mutually exclusive, as long as the aspiration is feasible. Aspirations of this type
might be especially natural in combination with methods of inductive reasoning
and belief revision that are also based on this type of encoding [8]. This could
eventually be useful for a “provably safe” approach to AI [6].

7.2 Relationship to reinforcement learning

Even though we formulated our approach in a planning framework where the
environment’s transition probabilities are known and simple enough to admit
dynamic programming, it is clear from Eq. (11) that the required reference poli-
cies π and corresponding reference vertices V π (s), Qπ (s, a) can in principle also
be approximated by reinforcement learning techniques such as (deep) expected
14 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver

SARSA in more complex environments or environments that are given only as

samplers without access to transition probabilities. For the single-criterion case,
preliminary results from numerical experiments suggest that this is indeed a vi-
able approach.8 Future work should explore this further and also consider using
approximate dynamic programming methods (e.g., [2]).
If the expected number of learning passes needed to find the necessary refer-
ence policies is indeed O(d) as conjectured (see end of Section 5)9 , our approach
might turn out to have much lower average complexity than the alternative re-
inforcement learning approach to convex aspirations from [9], which appears to
require up to O(ϵ−2 ) many learning passes to achieve an error of less than ϵ.

7.3 Invariance under reparameterization

For many applications there will be several possible parameterizations of the d-

dimensional evaluation space into d different evaluation metrics, so the question
arises which parts of our approach are invariant under which types of reparam-
eterizations of evaluation space. It is easy to see that all parts are invariant
under affine transformations, except for the algorithm for finding reference poli-
cies which is only invariant under orthogonal transformations since it makes use
of angles, and except for certain safety criteria such as total variance.

Acknowledgments. We thank the members of the SatisfIA project, AI Safety Camp,

the Supervised Program for Alignment Research, and the organizers of the Virtual AI
Safety Unconference.

Supplementary Materials. This article is accompanied by a supplementary text

containing alternative versions of the main algorithm, and a supplementary video illus-
trating the evolution of action-aspirations over a sample episode with d = 2. Python
code is available at https://doi.org/10.5281/zenodo.13221511.

Disclosure of Interests. The authors have no competing interests to declare.

CRediT author statement. Authors are listed in alphabetical ordering and have
contributed equally. Simon Dima: Formal analysis, Writing - Original Draft, Writing -
Review & Editing. Simon Fischer: Formal analysis, Writing - Original Draft, Writing -
Review & Editing. Jobst Heitzig: Conceptualization, Methodology, Software, Writing -
Original Draft, Writing - Review & Editing, Supervision. Joss Oliver: Formal analysis,
Writing - Original Draft, Writing - Review & Editing.

8
Note however that some safety-related action selection criteria, especially those based
on information-theoretic concepts, require access to transition probabilities which
would then have to be learned in addition to the reference simplices.
9
Farsighted action selection criteria would require an additional learning pass to also
learn the actual policy and the resulting action evaluations.
Policies that fulfill multi-criterion aspirations in expectation 15

References
1. Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., Mané, D.: Con-
crete problems in AI safety. arXiv preprint arXiv:1606.06565 (2016)
2. Bonet, B., Geffner, H.: Solving POMDPs: RTDP-Bel vs. point-based algorithms.
In: IJCAI. pp. 1641–1646. Pasadena CA (2009)
3. Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P.,
et al.: Decision transformer: Reinforcement learning via sequence modeling (2021)
4. Clymer, J., et al.: Generalization analogies (GENIES): A testbed for generalizing
AI oversight to hard-to-measure domains. arXiv preprint arXiv:2311.07723 (2023)
5. Conitzer, V., Freedman, R., Heitzig, J., Holliday, W.H., Jacobs, B.M., Lambert,
N., Mossé, M., Pacuit, E., Russell, S., et al.: Social choice for AI alignment: Dealing
with diverse human feedback. arXiv preprint arXiv:2404.10271 (2024)
6. Dalrymple, D., Skalse, J., Bengio, Y., Russell, S., Tegmark, M., Seshia, S., Omohun-
dro, S., Szegedy, C., et al.: Towards guaranteed safe AI: A framework for ensuring
robust and reliable AI systems. arXiv preprint arXiv:2405.06624 (2024)
7. Feinberg, E.A., Sonin, I.: Notes on equivalent stationary policies in Markov decision
processes with total rewards. Math. Methods Oper. Res. 44(2), 205–221 (1996).
https://doi.org/10.1007/BF01194331, https://doi.org/10.1007/BF01194331
8. Kern-Isberner, G., Spohn, W.: Inductive reasoning, conditionals, and belief dy-
namics. Journal of Applied Logics 2631(1), 89 (2024)
9. Miryoosefi, S., Brantley, K., Daumé, H., Dudík, M., Schapire, R.E.: Reinforce-
ment learning with convex constraints. In: Proceedings of the 33rd International
Conference on Neural Information Processing Systems (2019)
10. Simon, H.A.: Rational choice and the structure of the environment. Psychological
review 63(2), 129 (1956)
11. Skalse, J.M.V., Farrugia-Roberts, M., Russell, S., Abate, A., Gleave, A.: Invariance
in policy optimisation and partial identifiability in reward learning. In: Interna-
tional Conference on Machine Learning. pp. 32033–32058. PMLR (2023)
12. Subramani, R., Williams, M., et al.: On the expressivity of objective-specification
formalisms in reinforcement learning. arXiv preprint arXiv:2310.11840 (2023)
13. Taylor, J.: Quantilizers: A safer alternative to maximizers for limited optimization
(2015), https://intelligence.org/files/QuantilizersSaferAlternative.pdf
14. Tschantz, A., et al.: Reinforcement learning through active inference (2020)
15. Vaidya, P.: Speeding-up linear programming using fast matrix multiplication. In:
30th Annual Symposium on Foundations of Computer Science. pp. 332–337 (1989)
16. Vamplew, P., Foale, C., Dazeley, R., Bignold, A.: Potential-based multiobjective
reinforcement learning approaches to low-impact agents for AI safety. Engineering
Applications of Artificial Intelligence 100, 104186 (2021)
17. Wendel, J.G.: A problem in geometric probability. Mathematica Scandinavica
11(1), 109–111 (1962)
18. Yen, I.E.H., Zhong, K., Hsieh, C.J., Ravikumar, P.K., Dhillon, I.S.: Sparse linear
programming via primal and dual augmented coordinate descent. Advances in
Neural Information Processing Systems 28 (2015)
arXiv:2408.04385v1 [cs.AI] 8 Aug 2024
Supplement to: Non-maximizing policies that
fulfill multi-criterion aspirations in expectation

Simon Dima1[0009−0003−7815−8238] , Simon Fischer2[0009−0000−7261−3031] , Jobst

Heitzig3[0000−0002−0442−8077] , and Joss Oliver4[0009−0008−6333−7598]
1
École Normale Supérieure, Paris, France, [email protected]
2
Independent Researcher, Cologne, Germany
3
Potsdam Institute for Climate Impact Research, Potsdam, Germany,
[email protected]
4
Independent Researcher, London, UK, [email protected]

1 Clipping instead of shrinking

1.1 While propagating action-aspirations to state-aspirations

In section 4.1 of the main text, we define shifted copies Xs′ of the action-
aspiration E, which would be appropriate candidates for subsequent state-aspi-
rations if not for the fact that they are not necessarily subsets of V R (s′ ). In the
main text, we resolve this by shrinking Xs′ until it fits.
An alternative approach is clipping, where we calculate the sets Xs′ and
simply set Es′ = Xs′ ∩ V R (s′ ).

1.2 While choosing actions and action-aspirations

In section 4.2.(iii) of the main text, we shift the aspiration set in a certain
direction y and shrink it until it fits into the action feasibility reference simplices
QR (s, ai ).
Clipping is also an alternative here, as illustrated in figure 1. To use clipping
instead of shrinking, algorithm 2 of the main text can be modified by replacing
the procedure ShrinkAspiration with the procedure ClipAspiration defined
here:

Algorithm 1 Clipping variation for algorithm 2 of main text.

1: procedure ClipAspiration(ai , i)
2: v ← V πi (s) if i > 0, else C(QR (s, ai )) ▷ vertex or center of target
3: y ←v−x ▷ shifting direction
4: r ← max{r ≥ 0 : x − ry ∈ E} ▷ distance from boundary point b of E to x
5: ℓ ← min{ℓ ≥ 0 : x + ℓy ∈ QR (s, ai )} ▷ distance from x to reference simplex
6: ℓ′ ← max{ℓ′ ≥ 0 : x + ℓ′ y ∈ QR (s, ai )} ▷ dist. from x to far bdry. of ref. simplex
7: z ← min(r + ℓ, ℓ′ )y ▷ shifting vector
8: return z + Es ∩ QR (s, ai ) ▷ shift and clip (nonempty)
2 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver

The direction y is determined as in the shrinking variant. Next, we determine

by linear programming the outermost point b ∈ E lying on the ray from x in the
direction −y. E is shifted in direction y, stopping either when the shifted bound-
ary point b (and thus generally a large part of E) enters the reference simplex
QR (s, ai ), or when the shifted center x exits QR (s, ai ). The appropriate shifting
distance can be computed using linear programming. This is done in lines 5 to 7
of Algorithm 1. Finally, the shifted copy of E is clipped to QR (s, ai ) by uniting
their H-representations to get Eai , i.e., consider the set of all hyperplanes on
which a facet of either E or QR (s, ai ) lies. Then the vertices of Eai are computed
from the resulting H-representation. The sets Eai are nonempty because the ray
from x in direction y intersects QR (s, ai ).

V3(s) = Q3(s,π3(s))

QR (s,π3(s))

QR(s,a3)
QR(s,a0)
E a3
E a0

X3 V1(s) = Q1(s,π1(s))

E x X1
QR(s,π1(s))
b
X2
QR(s,π2(s)) VR (s)

V2(s) = Q2(s,π2(s))

Fig. 1. Construction of action-aspirations Ea from state-aspiration E and reference

simplices V R (s) and QR (s, a) by shifting and clipping. See main text for details.

1.3 Advantages and disadvantages of clipping

Clipping instead of shrinking has the advantage that it produces strictly larger
propagated aspiration sets, which might be desirable as discussed in section 3 of
the main text.
However, clipping changes the shape of aspiration sets, adding up to d +
1 defining hyperplanes at each step, and requiring recomputation of the set
of vertices which may become combinatorially large. Therefore, we expect the
complexity of the clipping variant to be worse than shrinking, though we have
not studied it formally.
Supplement: Policies that fulfill multi-criterion aspirations in expectation 3

2 Alternate version of action selection/local policy

computation

Some action selection criteria which we may wish to use are “farsighted” and
require knowing the future behavior of the policy as well. Algorithm 2 of the
main text does not provide this information, as it samples candidate actions
ai first before determining the probabilities with which to mix them. The al-
gorithm 2 presented here is an adaptation which does provide the full local
policy before sampling from it to choose the action taken. It does so by replac-
ing the sampling in lines 4 and 7 of the original algorithm by a computation
of the respective probabilities. Instead of solving the linear program in line 6
of main-text algorithm 2 for each of the at most |A|d+2 many candidate ac-
tion combinations (a0 , . . . , ad+1 ), line 7 uses a single linear program invocation
manipulating the average action-aspirations for each direction, Eai ∼πi Eai . This
ensures complexity remains linear in |A|. A solution to this linear program ex-
ists since Eai ∼πi Eai ⊆ Es + Eai ∼πi zai and Eai ∼πi zai is a positive multiple of
V πi (s) − x; the proof of Lemma 2 of the main text is readily adapted. Finally,
the loop in lines 8 to 11 is necessary as one same action may lie in two distinct
directional action sets Ai ̸= Aj .

Algorithm 2 Local policy computation (“clip” and “shrink” variants)

Require: Reference simplex vertices V πi (s), Qπi (s, a); state s; state-aspiration Es .
1: x ← C(Es ), E ′ ← Es − x ▷ center, shape
2: for i = 0, . . . , d + 1 do
3: Ai ← DirectionalActionSet(i)
4: πi ← CandidateActionDistribution(Ai ) ▷ any probability distribution on Ai
5: for ai ∈ Ai do
6: Eai ← ClipAspiration(ai , i) Por ShrinkAspiration(ai , i) ▷ action-aspiration
7: solve linear program p0 = max!, d+1 i=0 pi Eai ∼πi Eai ⊆ Es for p ∈ ∆({0, . . . , d + 1})
8: initialize π ≡ 0
9: for i = 0, . . . , d + 1 do ▷ collect probabilities across directions
10: for ai ∈ Ai do
11: π(ai , Eai ) ← π(ai , Eai ) + pi πi (ai )
return π ▷ local policy

3 Video of an example run

In the video accessible at the anonymous download link

https://mega.nz/file/gUgTzZqB#Co21aSoKQAo7PjeR9_kzxzquHdjlNav0qjp9wKX3NRU,
we illustrate the propagation of aspirations from states to actions in one episode
of a simple gridworld environment.
In that environment, the agent starts at the central position (3,3) in a 5-by-5
rectangular grid, can move up, down, left, right, or pass in each of 10 time steps,
4 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver

and gets a Delta f (s, a, s′ ) = g(s′ ) that only depends on its next position s′ on
the grid. The values g(s′ ) are iid 2-d standard normal random values.
In each time-step, the black dashed triangle is the current state’s reference
simplex V R (s), the blue triangles are the five candidate actions’ reference sim-
plices QR (s, a), the red square is the state-aspiration E, the dotted lines connect
its center, the red dot, with the vertices of V R (s) or with the centers of the sets
QR (s, a), and the green squares are the resulting action-aspirations Ea .
The coordinate system is “moving along” with the received Delta, so that
aspirations of consecutive P steps can be compared more easily. In other words,
t−1
the received Delta Rt = t′ =0 f (st , at , st+1 ) is added to all vertices so that,
e.g., the vertices of the back dashed triangle are shown at positions Rt + V πi (st )
rather than V πi (st ).
1/d
This run uses the linear volume shrinking schedule rmax (s) = 1−1/T (s) .
As one can see, the state-aspirations indeed shrink smoothly towards a final
point aspiration, which spends the initial amount of “freedom” rather evenly
across states. This way, the agent manages to avoid drift in this case, so that
the eventual point aspiration is still inside the initial aspiration set.

4 Numerical evidence for reference policy selection

complexity
In order to study the number of trials needed to find the d + 1 directions that
define suitable reference policies for the definition of reference simplicies, we
performed numerical experiments with binary-tree-shaped environments of the
following type.
Each environment has 10 time steps, each state has two actions, each ac-
tion can lead to two different successor states, leading to 410 many terminal
states. Deltas f (s, a, s′ ) are iid uniform variates in [0, 1]d . The initial aspiration
is (5, . . . , 5).
For each d ∈ {1, . . . , 8}, we perform 1000 independent simulations and find
that the average number k of trials until E0 ∈ conv{v1 , . . . , vk } is below 2d + 1
for all tested d.

CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
No ratings yet
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
9 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Lec 08
No ratings yet
Lec 08
59 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
06 MDP
No ratings yet
06 MDP
89 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
CS229
No ratings yet
CS229
17 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
2
No ratings yet
2
23 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
M 2
No ratings yet
M 2
12 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
No ratings yet
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
66 pages
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
No ratings yet
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
22 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
UNIT-4 OF AI
No ratings yet
UNIT-4 OF AI
9 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
15 MDP
No ratings yet
15 MDP
35 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
Markov Decision Process I
No ratings yet
Markov Decision Process I
111 pages
Average-reward Model-free Reinforcement Learning- A Systematic Review and Literature Mapping
No ratings yet
Average-reward Model-free Reinforcement Learning- A Systematic Review and Literature Mapping
36 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Conjugate Markov Decision Processes
No ratings yet
Conjugate Markov Decision Processes
8 pages
UNIT 4 (2)
No ratings yet
UNIT 4 (2)
6 pages
08_MDPs.pptx
No ratings yet
08_MDPs.pptx
111 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
subtitle (12)
No ratings yet
subtitle (12)
2 pages
On State Variables and POMDP-s
No ratings yet
On State Variables and POMDP-s
49 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
CSE2530__Reinforcement_Learning__2025_P1+2
No ratings yet
CSE2530__Reinforcement_Learning__2025_P1+2
115 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
RL-DQN-PG
No ratings yet
RL-DQN-PG
65 pages
Lec17-ReinforcementLearning
No ratings yet
Lec17-ReinforcementLearning
58 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
No ratings yet
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
13 pages
Module 3.0
No ratings yet
Module 3.0
17 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
Lecture_12_slides_-_after
No ratings yet
Lecture_12_slides_-_after
50 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Solving Stochastic Planning Problems With Large State and Action Spaces
No ratings yet
Solving Stochastic Planning Problems With Large State and Action Spaces
9 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
Provably Efficient Maximum Entropy Exploration
No ratings yet
Provably Efficient Maximum Entropy Exploration
11 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Action Election: Fundamentals and Applications
From Everand
Action Election: Fundamentals and Applications
Fouad Sabry
No ratings yet
Dynamic Programming
No ratings yet
Dynamic Programming
5 pages
AP Calc 2.4 Notes
No ratings yet
AP Calc 2.4 Notes
5 pages
CE 007 (Numerical Solutions To CE Problems)
No ratings yet
CE 007 (Numerical Solutions To CE Problems)
6 pages
Lesson Plan 7 - Area of Plane Figures
67% (3)
Lesson Plan 7 - Area of Plane Figures
4 pages
Syllabus Stat-Ub103
No ratings yet
Syllabus Stat-Ub103
3 pages
Topic 4 - Greedy Method
No ratings yet
Topic 4 - Greedy Method
67 pages
Otis App VI
No ratings yet
Otis App VI
5 pages
01 CJM Wildberger 2011 Mar 03
100% (1)
01 CJM Wildberger 2011 Mar 03
14 pages
Chvatal Linear Programming
100% (1)
Chvatal Linear Programming
220 pages
4 Handling Constraints: F (X) X R C J 1, - . - , M C 0, K 1, - . - , M
No ratings yet
4 Handling Constraints: F (X) X R C J 1, - . - , M C 0, K 1, - . - , M
10 pages
Practice Test Answer Key
No ratings yet
Practice Test Answer Key
4 pages
Notice and Eligibility Decision Regarding Carrie
No ratings yet
Notice and Eligibility Decision Regarding Carrie
6 pages
MFCS - Unit-3
No ratings yet
MFCS - Unit-3
54 pages
ADA Assignment 2 PDF
No ratings yet
ADA Assignment 2 PDF
10 pages
Ap Physics Unit 1 Homework
No ratings yet
Ap Physics Unit 1 Homework
2 pages
BDU R 101 Module 1 Video 2
No ratings yet
BDU R 101 Module 1 Video 2
2 pages
Chapter 4 - Transformations
No ratings yet
Chapter 4 - Transformations
15 pages
PDF Genmath11 q1 Mod9 Intercepts Zeroes and Asymptotes of Functions 08082020pdf Compress
No ratings yet
PDF Genmath11 q1 Mod9 Intercepts Zeroes and Asymptotes of Functions 08082020pdf Compress
37 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
30 pages
Assignment: Business Mathematics and Statistics Topic: Business Application of Mathematics and Statistics
No ratings yet
Assignment: Business Mathematics and Statistics Topic: Business Application of Mathematics and Statistics
7 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
5 pages
Weighted Mean Report by Ralph Arcilla
No ratings yet
Weighted Mean Report by Ralph Arcilla
6 pages
Markov Process: Properties, Analysis and Applications: Ajay Kumar
No ratings yet
Markov Process: Properties, Analysis and Applications: Ajay Kumar
113 pages
Limits
No ratings yet
Limits
5 pages
Systems Summative B - Answer Key
No ratings yet
Systems Summative B - Answer Key
10 pages
ECU Look Up Table
No ratings yet
ECU Look Up Table
6 pages
Fea Book
No ratings yet
Fea Book
155 pages
Lecture 13 - Least Squares
No ratings yet
Lecture 13 - Least Squares
28 pages
Equations of Motion, Momentum and Energy For Deformable Solids
No ratings yet
Equations of Motion, Momentum and Energy For Deformable Solids
11 pages
Electrostatics in Vacuum - 1673699222
No ratings yet
Electrostatics in Vacuum - 1673699222
73 pages

Non-Maximizing Policies That Fulfill Multi-Criterion Aspirations in Expectation

Uploaded by

Non-Maximizing Policies That Fulfill Multi-Criterion Aspirations in Expectation

Uploaded by

Non-maximizing policies that fulfill

multi-criterion aspirations in expectation

Simon Dima1[0009−0003−7815−8238] , Simon Fischer2[0009−0000−7261−3031] , Jobst

Abstract. In dynamic programming and reinforcement learning, the

Keywords: multi-objective decision-making · planning · Markov deci-

In typical reinforcement learning (RL) and dynamic programming problems an

Environment. An environment E = (S, s0 , S⊤ , A, T ) is a finite Markov Decision

Policy. A (memory-based) policy is given by some nonempty finite set M of

Evaluation, Delta, Total. A (multi-criterion) evaluation function for the envi-

is then the cumulative Delta received along the trajectory,

V π (m, s) = Ea,m′ ∼π(m,s) (Qπ (s, a, m′ )) , (2)

They can be calculated by backwards induction since the environment is finite

In this, we use set arithmetic: rX + r′ X ′ = {rx + r′ x′ | x ∈ X , x′ ∈ X ′ } for

Aspiration propagation. Algorithm scheme 1 shows a general manner of fulfilling

Algorithm 1 General scheme for fulfilling feasible aspiration sets

alternates between being in a certain state s with a state-aspiration E ⊆ V(s),

These imply that we can replace every occurrence of V and Q in algorithm 1

4.2 Choosing actions and action-aspirations

Algorithm 2 Action selection (“shrinking” variant)

(ii) Candidate actions. For each i = 0 . . . d + 1, use an arbitrary, possibly prob-

(iii) Action-aspirations. Given direction i and action ai , we now aim to select a

(iv) Suitable mixture of candidate actions. We next find probabilities p0 , . . . , pd+1

Fig. 1. Construction of action-aspirations Ea from state-aspiration E and reference

5 Determining appropriate reference policies

1. Let πk be the pure Markovian policy π defined by backwards induction as

yk⊤ (Qπ (s, a) − x)

6 Selection of candidate actions

Information-related criteria. If the used world model is imperfect, one might

6.1 Performance-related criteria

6.2 Safety-related criteria

H π (st ) := Es>t |st ,π (− log Pr(s>t |st , π)). (14)

It is straightforward to compute this quantity using the Bellman-type equations

Deviation from default policy. If we have access to a default policy π 0 (e.g. a

where g : (s, a, Ea , s′ ) 7→ Es′ implements action-to-state aspiration propagation.

7 Discussion and conclusion

Relationship to satisficing. A subcase of the d = 1 case is when the upper

Probabilities of desired or undesired events. Another special case is when d > 1

7.2 Relationship to reinforcement learning

SARSA in more complex environments or environments that are given only as

7.3 Invariance under reparameterization

For many applications there will be several possible parameterizations of the d-

Acknowledgments. We thank the members of the SatisfIA project, AI Safety Camp,

Supplementary Materials. This article is accompanied by a supplementary text

Disclosure of Interests. The authors have no competing interests to declare.

Simon Dima1[0009−0003−7815−8238] , Simon Fischer2[0009−0000−7261−3031] , Jobst

1 Clipping instead of shrinking

1.1 While propagating action-aspirations to state-aspirations

1.2 While choosing actions and action-aspirations

Algorithm 1 Clipping variation for algorithm 2 of main text.

The direction y is determined as in the shrinking variant. Next, we determine

Fig. 1. Construction of action-aspirations Ea from state-aspiration E and reference

1.3 Advantages and disadvantages of clipping

2 Alternate version of action selection/local policy

Algorithm 2 Local policy computation (“clip” and “shrink” variants)

3 Video of an example run

In the video accessible at the anonymous download link

4 Numerical evidence for reference policy selection

You might also like