0% found this document useful (0 votes)
9 views19 pages

Non-Maximizing Policies That Fulfill Multi-Criterion Aspirations in Expectation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views19 pages

Non-Maximizing Policies That Fulfill Multi-Criterion Aspirations in Expectation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Non-maximizing policies that fulfill

multi-criterion aspirations in expectation

Simon Dima1[0009−0003−7815−8238] , Simon Fischer2[0009−0000−7261−3031] , Jobst


Heitzig3[0000−0002−0442−8077] , and Joss Oliver4[0009−0008−6333−7598]
1
École Normale Supérieure, Paris, France, [email protected]
2
Independent Researcher, Cologne, Germany
arXiv:2408.04385v1 [cs.AI] 8 Aug 2024

3
Potsdam Institute for Climate Impact Research, Potsdam, Germany,
[email protected]
4
Independent Researcher, London, UK, [email protected]

Abstract. In dynamic programming and reinforcement learning, the


policy for the sequential decision making of an agent in a stochastic
environment is usually determined by expressing the goal as a scalar
reward function and seeking a policy that maximizes the expected total
reward. However, many goals that humans care about naturally concern
multiple aspects of the world, and it may not be obvious how to condense
those into a single reward function. Furthermore, maximization suffers
from specification gaming, where the obtained policy achieves a high
expected total reward in an unintended way, often taking extreme or
nonsensical actions.
Here we consider finite acyclic Markov Decision Processes with multiple
distinct evaluation metrics, which do not necessarily represent quanti-
ties that the user wants to be maximized. We assume the task of the
agent is to ensure that the vector of expected totals of the evaluation
metrics falls into some given convex set, called the aspiration set. Our
algorithm guarantees that this task is fulfilled by using simplices to ap-
proximate feasibility sets and propagate aspirations forward while en-
suring they remain feasible. It has complexity linear in the number of
possible state–action–successor triples and polynomial in the number of
evaluation metrics. Moreover, the explicitly non-maximizing nature of
the chosen policy and goals yields additional degrees of freedom, which
can be used to apply heuristic safety criteria to the choice of actions. We
discuss several such safety criteria that aim to steer the agent towards
more conservative behavior.

Keywords: multi-objective decision-making · planning · Markov deci-


sion processes · AI safety · satisficing · convex geometry

1 Introduction

In typical reinforcement learning (RL) and dynamic programming problems an


agent is trained or programmed to solve tasks encoded by a single real-valued
reward function that it shall maximize. However, many tasks are not easily
2 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver

expressed by such a function [12], human preferences are hard to learn and may
not be easy to aggregate across stakeholders [5], and maximizing a misspecified
objective may fall prey to reward hacking [1] and Goodhart’s law [11], leading
to unintended side-effects and potentially harmful consequences.
In this work, we study a particular aspiration-based approach to agent design.
We assume an existing task-specific world model in the form of a fully observed
Markov Decision Process (MDP), where the task is not encoded by a reward
function but instead a multi-criterion evaluation function and a bounded, con-
vex subset of its range, called an aspiration set, that can be thought of as an
“instruction” [4] to the agent. Aspiration-type goals can also naturally arise from
subtasks in complex environments even if the overall goal is to maximize some
objective, when the complexity requires a hierarchical decision-making approach
whose highest level selects subtasks that turn into aspiration sets for lower hier-
archal levels.
In our version of aspiration-based agents, the goal is to make the expected
value of the total with respect to this evaluation function fall within the aspi-
ration set, and select from this set according to certain performance and safety
criteria. They do so step-wise, exploiting recursion equations similar to the Bell-
man equation. Thus our approach is like multi-objective reinforcement learning
(MORL), with a primary aspiration-based objective and at least one secondary
objective incorporated via action-selection criteria [16]. Unlike MORL, the com-
ponents of the evaluation function (called evaluation metrics) are not objectives
in the sense of targets for maximization. Rather, an aspiration formulated w.r.t.
several evaluation metrics might correspond to a single objective (e.g., “make
a cup of tea”). Also, at no point does an aspiration-based agent aggregate the
evaluation metrics into a single value. Instead, any trade-offs are built into the
aspiration set itself, similar to what [6] call a “safety specification”. For example,
aspiring to buy a total of 10 oranges and/or apples for at most EUR 1 per item
could be encoded with the aspiration set {(o, a, c) : o, a ≥ 0; c ≤ o + a = 10}.
A similar set-up to ours has been used in [9] which has used reinforcement
learning to find a policy whose expected discounted reward vector lies inside
a convex set. Instead of a reinforcement learning perspective, we use a model-
based planning perspective and design an algorithm that explicitly calculates a
policy for solving the task, based on a model of the environment. 5 Also, the
approach in [9] is concerned with guaranteed bounds for the distance between
received rewards and the convex constraints in terms of the number of iterations,
whereas we focus on guaranteeing aspiration satisfaction in a fixed number of
computational steps, providing a verifiable guarantee in the sense of [6].
Other agent designs that follow a non-maximization-goal-based approach in-
clude quantilizers, decision transformers and active inference. Quantilizers are
agents that use random actions in the top n% of a “base distribution” over ac-
tions, sorted by expected return [13]. The goal for decision transformers is to
make the expected return equal a particular value Rtarget [3]. The goal for active
inference agents is to produce a particular probability distribution of observa-
5
Nevertheless, our algorithm can also be straightforwardly adapted to learning.
Policies that fulfill multi-criterion aspirations in expectation 3

tions [14]. While the goal space for quantilizers and decision transformers, being
based on a single real-valued function, is often too restricted for many applica-
tions, that of active inference agents (all probability distributions) appears too
wide for the formal study of many aspects of aspiration-based decision-making.
Our approach is of intermediary complexity.
An important consideration in this work, to ensure tractability in large envi-
ronments, is also the computational complexity in the number of actions, states,
and evaluation metrics. We will see that for our algorithm, the preparation of an
episode has linear complexity in the number of possible state–action–successor
transitions and (conjectured and numerically confirmed) linear average complex-
ity in the number of evaluation metrics, and then the additional per-time-step
complexity of the policy is linear in the number of actions, constant in the num-
ber of states, and polynomial in the number of evaluation metrics.
Our work also affects the emerging AI safety/alignment field, which views un-
intended consequences from maximization, e.g., reward hacking and Goodhart’s
law, as a major source of risk once agentic AI systems become very capable [1].

2 Preliminaries

Environment. An environment E = (S, s0 , S⊤ , A, T ) is a finite Markov Decision


Process without a reward function, consisting of a finite state space S, an initial
state s0 ∈ S, a nonempty subset S⊤ ⊆ S of terminal states, a nonempty finite
action space A, and a function T : (S \ S⊤ ) × A → ∆(S \ {s0 }) specifying
transition probabilities: T (s, a)(s′ ) is the probability that taking action a from
state s leads to state s′ . We assume that the environment is acyclic, i.e., that it is
impossible to reach a given state again after leaving it. We fix some environment
E and write s′ ∼ s, a to denote that s′ is distributed according to T (s, a).

Policy. A (memory-based) policy is given by some nonempty finite set M of


memory states internal to the agent, an initial memory state m0 ∈ M and a
function π : M × (S \ S⊤ ) → ∆(A × M) that maps each possible combination
of memory state m ∈ M and (environment) state s ∈ S \ S⊤ to a probability
distribution over combinations of actions a ∈ A and successor memory states
m′ ∈ M. Let ΠM be the set of all policies with memory space M. The special
class of Markovian or memoryless policies is obtained when M is a singleton.
Policies which are both Markovian and deterministic are called pure Markov
policies, and amount to a function (S \ S⊤ ) → A. We denote by Π 0 the set of
Markovian policies and by Π p the set of all pure Markov policies.

Evaluation, Delta, Total. A (multi-criterion) evaluation function for the envi-


ronment is a function f : (S \ S⊤ ) × A × (S \ {s0 }) → Rd where d ≥ 1. The
quantity f (s, a, s′ ) is called the Delta received under transition (s, a) → s′ . It
represents by how much certain evaluation metrics change when the agent takes
action a in state s and the successor state is s′ . Let us fix f for the rest of the pa-
per. The (received) Total of a trajectory h = (m0 , s0 , a1 , m1 , s1 , . . . , aT , mT , sT )
4 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver

is then the cumulative Delta received along the trajectory,


PT
τ (h) = t=1 f (st−1 , at , st ). (1)

Value functions. Given a policy π and an evaluation function f , the state value
function V π : M × S → Rd is defined as the expected Total accumulated in
future steps while following policy π. In particular, V π (m0 , s0 ) is the expected
Total of the whole trajectory: V π (m0 , s0 ) = E(τ ). Likewise, we define the action
value function Qπ : S × A × M → Rd . These satisfy the Bellman equations

V π (m, s) = Ea,m′ ∼π(m,s) (Qπ (s, a, m′ )) , (2)


π ′ π ′
Q (s, a, m) = Es′ ∼s,a (f (s, a, s ) + V (m, s )) . (3)

They can be calculated by backwards induction since the environment is finite


and acyclic, with the base case V π (m, s) = 0 for terminal states s ∈ S⊤ . For
memoryless policies, we will elide the argument m since M is a singleton.
The Delta and Total are analogous to reward and return in an MDP with a
reward function, and the value functions V and Q are defined as usual, albeit
with vector instead of scalar arithmetic. However, while we do assume that the
evaluation metrics represent some aspects relevant to the task at hand, we do
not assume that they represent a form of utility which it is always desirable
to increase. Accordingly, the agent’s goal will be specified by the user not as a
maximization task, but rather as a set of linear constraints on the expected sums
of the evaluation metrics, which we call an aspiration.

3 Fulfilling aspirations
Aspirations, feasibility. An (initial) aspiration is a convex polytope E0 ⊂ Rd ,
representing the values of the expected total Eτ which are considered acceptable.
We say that a policy π fulfills the aspiration when it satisfies V π (m0 , s0 ) ∈ E0 .
To answer the question of whether it is possible to fulfill a given aspiration, we
introduce feasibility sets. The state-feasibility set of s ∈ S is the set of possible
values for the expected future Total from s, under any memory-based policy:
V(s) = {V π (m, s) | M finite set; m0 , m ∈ M; π ∈ ΠM }; likewise, we define the
action-feasibility set Q(s, a). It is straightforward to verify that V(s ∈ S⊤ ) = {0},
and that the following recursive equations hold for s ∈ / S⊤ :
 P
Q(s, a) = Es′ ∼s,a f (s, a, s′ ) + V(s′ ) = s′ T (s, a)(s′ )·(f (s, a, s′ ) + V(s′ )), (4)
S S P
V(s) = p∈∆(A) Ea∼p Q(s, a) = p∈∆(A) a∈A p(a)Q(s, a). (5)

In this, we use set arithmetic: rX + r′ X ′ = {rx + r′ x′ | x ∈ X , x′ ∈ X ′ } for


r, r′ ∈ R and X , X ′ ⊂ Rd . It is clear that feasibility sets are convex polytopes.

Aspiration propagation. Algorithm scheme 1 shows a general manner of fulfilling


a feasible initial aspiration, starting from a given state or state-action pair. It
memorizes and updates aspirations E and Ea , initially equalling E0 . The agent
Policies that fulfill multi-criterion aspirations in expectation 5

Algorithm 1 General scheme for fulfilling feasible aspiration sets


1: procedure FulfillStateAspiration(s ∈ S \ S⊤ , nonempty E ⊆ V(s))
2: Find suitable Ea ⊆ Q(s, a) for all a ∈ A, and p ∈ ∆(A), s.t. Ea∼p (Ea ) ⊆ E.
3: Draw action a from distribution p and do FulfillActionAspiration(s, a, Ea ).
4: procedure FulfillActionAspiration(s ∈ S, a ∈ A, nonempty Ea ⊆ Q(s, a))
5: Find suitable Es′ ⊆ V(s′ ) for all s′ ∈ S s.t. Es′ ∼s,a (f (s, a, s′ ) + Es′ ) ⊆ Ea .
6: Execute action a and observe successor state s′ .
7: If s′ is terminal, stop; else do FulfillStateAspiration(s′ , Es′ ).

alternates between being in a certain state s with a state-aspiration E ⊆ V(s),


and being in a state s and having chosen, but not yet performed, an action a,
with an action-aspiration Ea ⊆ Q(s, a). Although this algorithm is written as
two mutually recursive functions, it can be formally implemented by a memory-
based policy that memorizes the current aspiration E or Ea .
The way the aspiration set E is propagated between steps is the key part. The
two directions of aspiration propagation are slightly different: in state-aspiration
to action-aspiration propagation, shown on line 2, the agent may choose the
probability distribution (p) over actions, whereas in action-aspiration to state-
aspiration propagation, shown on line 5, the next state is determined by the
environment with fixed probabilities (T ).
The correctness of algorithm 1 follows from the requirements of lines 2 and
5; that these are possible to fulfill is a consequence of equations (5) and (4).
Feasibility of aspirations is maintained as an invariant.
To implement this scheme, we have to specify how to perform aspiration prop-
agation. The procedure used to select action-aspirations Ea and state-aspirations
Es′ should preferably allow some control over how the size of these sets changes
over time. On one hand, preventing aspiration sets from shrinking too fast pre-
serves a wider range of acceptable behaviors in later steps6 , but on the other
hand, keeping the aspiration sets somewhat smaller than the feasibility sets also
provides immediate freedom in the choice of the next action, as detailed in sec-
tion 4.2. An additional challenge is posed by the complex shape of the feasibility
sets, which we must handle in a tractable way.

Approximating feasibility sets by simplices. Let conv(X) denote the convex hull
of any set X ⊆ Rd . Given any tuple R of d + 1 memoryless reference policies
π1 , . . . , πd+1 ∈ Π 0 , we define reference simplices in evaluation-space as V R (s) =
conv{V π (s) | π ∈ R} and QR (s, a) = conv{Qπ (s, a) | π ∈ R}. It is immediate
that these are subsets of the convex feasibility sets V(s) resp. Q(s, a), and that
S 
V R (s) ⊆ p∈∆(A) Ea∼p QR (s, a) , (6)

QR (s, a) ⊆ Es′ ∼s,a f (s, a, s′ ) + V R (s′ ) . (7)

6
However, algorithm 1 remains correct even if the aspiration sets shrink to singletons.
6 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver

These imply that we can replace every occurrence of V and Q in algorithm 1


with V R resp. QR , obtaining a correct algorithm to guarantee fulfillment of any
initial aspiration E0 provided it intersects the reference simplex V R (s0 ).
It turns out that the latter can be guaranteed by a proper choice of reference
policies, and that we can always use pure Markov policies for this:
Lemma 1. For any state s, we have V(s) = conv{V π (s) | π ∈ Π p }.
Proof. Any memory-based policy admits a Markovian policy with the same oc-
cupancy measure and hence the same expected Total [7]. Hence V(s) = {V π (s) |
π ∈ Π 0 }. A convex set is the convex hull of its vertices, so V(s) = conv{V π (s) |

∃y ∈ Rd , π = arg maxπ′ ∈Π 0 (y · V π (s))}. Finally, maximal policies may be taken
to be deterministic, which concludes the argument. ⊔

As a consequence, for any aspiration set E intersecting the feasibility set V(s0 ),
there exists a tuple R of pure Markov reference policies such that V R (s0 )∩E ̸= ∅.
Section 5 describes a heuristic algorithm for finding such reference policies.
We now turn to explaining a way to enact the aspiration-propagation steps
needed in lines 2 and 5 of algorithm 1, based on shifting and shrinking.

4 Propagating aspirations
4.1 Propagating action-aspirations to state-aspirations
To implement algorithm schema 1, we first focus on line 5 in procedure Ful-
fillActionAspiration, which is the easier part. Given a state-action pair
s, a and an action-aspiration set Ea ⊆ QR (s, a), we must construct nonempty
state-aspiration sets Es′ ⊆ V R (s′ ) for all possible successor states s′ , such that
Es′ ∼s,a (f (s, a, s′ ) + Es′ ) ⊆ Ea . We assume that all reference simplices are nonde-
generate, i.e. have full dimension d. This is almost surely the case if there are
sufficiently many actions and these have enough possible consequences.

Tracing maps. Under this assumption, we define tracing maps ρs,a,s′ from the
reference simplex QR (s, a) to V R (s′ ). Since domain and codomain are simplices,
we can choose ρs,a,s′ to be the unique affine linear map that maps vertices to
vertices, ρs,a,s′ (Qπi (s, a)) = V πi (s′ ). For any point e ∈ QR (s, a), it follows from
equation (2) that Es′ ∼s,a (f (s, a, s′ ) + ρs,a,s′ (e)) = e. Accordingly, to propagate
aspirations of the form Ea = {e}, it is sufficient to just set Es′ = {ρs,a,s′ (e)}. How-
ever, for general subsets Ea ⊆ QR (s, a), the set Es′ ∼s,a (f (s, a, s′ ) + ρs,a,s′ (Ea )) is
in general strictly larger than Ea , and hence we map Ea in a different way.
For this, choose an arbitrary “anchor” point e ∈ Ea ; here we let e be the
average of the vertices, C(Ea ), but any other standard type of center would also
work (e.g. analytic center, center of mass/centroid, Chebyshev center). Now, let
Xs′ = E − e + ρs,a,s′ (e) be shifted copies of the action-aspiration. We would like
to use the Xs′ as state-aspirations, and indeed they have the property that
Es′ ∼s,a (f (s, a, s′ ) + Xs′ ) = Es′ ∼s,a (f (s, a, s′ ) + ρs,as′ (e) − e + Ea ) (8)
= e − e + Es′ ∼s,a (Ea ) = Ea as Ea is convex. (9)
Policies that fulfill multi-criterion aspirations in expectation 7

This is almost what we want, but it might be that Xs′ is not a subset of V R (s′ ). To
rectify this, we opt for a shrinking approach, setting Es′ = rs′ ·(Ea −e)+ρs,a,s′ (e)
for the largest rs′ ∈ [0, 1] such that the result is a subset of V R (s′ ). As Es′ does
not depend on any other successor state s′′ ̸= s′ , we can wait until the true s′ is
known and only compute Es′ for that s′ , which saves computation time.

Proposition 1. Given all values V πi (s) and Qπi (s, a) and both the kc ≥ d + 1
constraints and kv ≥ d + 1 vertices defining Ea , the shrinking version of action-
to-state aspiration propagation has time complexity O([kc1.5 d+(dkv )1.5 ]L), where
L is the precision parameter defined in [15]. If Ea is a simplex, this is O(d3 L).

Proof. A linear program (LP) with m constraints and n variables has time com-
plexity O(f (m, n)L), where f (m, n) = (m+n)1.5 n and L is a precision parameter
[15]. C(Ea ) can be calculated with an LP with m = kc + 1 and n = d + 1. Find-
ing the convex coefficients for ρs,a,s′ requires solving a system of d + 1 linear
equations, needing time O(dω ), where 2 ≤ ω < 2.5 is the exponent of matrix
multiplication. Finding the shrinking factor r is an LP with m = (d + 1)kv and
n = 1, and then computing the constraints and vertices of Es′ is O(d(kc +kv )). In
all, this gives a time complexity of O(Lf (kc , d) + dω + Lf (dkv , 1) + d(kc + kv )) ≤
O(L(kc + d)1.5 d + dω + L(dkv )1.5 ) ≤ O(L(kc1.5 d + (dkv )1.5 )). ⊔

4.2 Choosing actions and action-aspirations


This is the core of our construction. In state s with state-aspiration E, the policy
probabilistically selects an action a and action-aspiration Ea as follows:
(i) For each of the d + 1 vertices V πi (s) of the state’s reference simplex
V R (s), find a directional action set Ai ⊆ A containing those actions whose
reference simplices QR (s, a) “lie between” E and the vertex V πi (s).
(ii) From the full action set A0 = A and from each directional action set Ai
(i = 1 . . . d+1) independently, use some arbitrary, potentially probabilistic
procedure to select one element, giving candidate actions a0 , . . . , ad+1 .
(iii) For each candidate ai , compute an action-aspiration Eai by shifting and
shrinking the state-aspiration E into the reference simplex QR (s, ai ).
(iv) Compute a probability distribution p ∈ ∆({0, . . . , d + 1}) that makes
Ei∼p Eai ⊆ E and has as large a p0 as possible.
(v) Finally execute candidate action ai with probability pi (i = 0 . . . d + 1)
and memorize its action-aspiration Eai .
We now describe steps (i)–(iv) in detail. Algorithm 2 has the corresponding
pseudocode, and Figure 1 illustrates the involved geometric operations.

(i) Directional action sets. Compute the average of the vertices of E, x = C(E),
and the “shape” E ′ = E − x. For i = 0, put Ai = A. For i > 0, let Xi be the
segment from x to the vertex V πi (s) and check for each action a ∈ A whether
its reference simplex QR (s, a) intersects Xi , using linear programming. Let Ai
be the set of those a, which is nonempty since πi (s) ∈ Ai by definition of V πi (s).
8 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver

Algorithm 2 Action selection (“shrinking” variant)


Require: reference simplex vertices V πi (s), Qπi (s, a); state s; state-aspiration E;
shrinking schedule rmax (s).
1: x ← C(Es ); E ′ ← E − x ▷ center, shape
2: for i = 0, . . . , d + 1 do
3: Ai ← DirectionalActionSet(i)
4: ai ← SampleCandidateAction(Ai ) ▷ any (probabilistic) choice from Ai
5: Eai ← ShrinkAspiration(aiP , i) ▷ action-aspiration
6: solve linear program p0 = max!, d+1 i=0 pi Eai ⊆ E for p ∈ ∆({0, . . . , d + 1})
7: sample i ∼ p and execute ai
8: memorize m ← (s, ai , Eai )

9: procedure DirectionalActionSet(i)
10: if i = 0, return A
11: Xi ← conv{x, V πi (s)} ▷ segment from m to ith vertex
12: return {a ∈ A | QR (s, a) ∩ Xi ̸= ∅} ▷ actions with feasible points on Xi
13: procedure ShrinkAspiration(ai , i)
14: v ← V πi (s) if i > 0, else C(QR (s, ai )) ▷ vertex or center of target
15: y ←v−x ▷ shifting direction
16: r ← max{r ∈ [0, rmax (s)] | ∃ℓ ≥ 0, ℓy + x + rE ′ ⊆ QR (s, a)} ▷ size of largest
shifted and shrunk copy of E that fits into QR (s, a)
17: ℓ ← min{ℓ ≥ 0 | ℓy + x + rE ′ ⊆ QR (s, a)} ▷ shortest dist. that makes it fit in
18: return ℓy + x + rE ′ ▷ shift and shrink

(ii) Candidate actions. For each i = 0 . . . d + 1, use an arbitrary, possibly prob-


abilistic procedure to select a candidate action ai ∈ Ai . Since A0 = A, any
possible action might be among the candidate actions. This freedom can be
used to improve the policy in terms of the evaluation metrics, e.g., by choosing
actions that are expected to lead to a rather low variance of the received Total.
It can also be used to incorporate additional action-selection criteria that are un-
related to these evaluation metrics, to increase overall safety, e.g., by preferring
actions that avoid unnecessary side-effects. We’ll discuss this in Section 6.

(iii) Action-aspirations. Given direction i and action ai , we now aim to select a


large subset Eai ⊆ QR (s, ai ) that fits into a shifted version zi +E ′ . We determine a
direction y towards which to shift Es : if i = 0, towards the average of the vertices,
C(QR (s, ai )), otherwise towards the reference vertex V πi (s). This is lines 14
and 15 of Alg. 2. As before, we use shrinking: find the largest shrinking factor
r ∈ [0, rmax (s)] for which there is a shifting distance ℓ ≥ 0 so that ℓy + rE ′ ⊆
QR (s, ai ), using a linear program with two variables r, ℓ, and then find the
smallest such ℓ for that r using another linear program. The “shrinking schedule”
rmax (s) ∈ [0, 1] might be used to enforce some amount of shrinking to increase
the freedom for choosing the action mixture, which is the next step.

(iv) Suitable mixture of candidate actions. We next find probabilities p0 , . . . , pd+1


for the candidate actions so that the corresponding mixture of aspirations Eai
Policies that fulfill multi-criterion aspirations in expectation 9

V3(s) = Q3(s,π3(s))

QR (s,π3(s))

QR(s,a3)
QR(s,a0)
E a3 E a0

X3 E a1 V1(s) = Q1(s,π1(s))

E x X1
QR(s,π1(s))
X2
QR(s,π2(s)) VR (s)
E a2

V2(s) = Q2(s,π2(s))

Fig. 1. Construction of action-aspirations Ea from state-aspiration E and reference


simplices V R (s) and QR (s, a) by shifting and shrinking. See main text for details.

Pd+1
is a subset of Es , i=0 pi Eai ⊆ Es . We show below that this equation has a
solution. Because we want the action a0 that was chosen freely from the whole
action set A to be as likely as possible, we maximize its probability p0 . This is
done in line 6 of Algorithm 2 using linear programming. Note that the smaller
the sets Ea , the looser the set inclusion constraint and thus the larger p0 . We can
influence the latter via the shrinking schedule rmax (s), for example by putting
rmax (s) = (1 − 1/T (s))1/d where T (s) is the remaining number of time steps at
state s, which would reduce the amount of freedom (in terms of the volume of
the aspiration set) linearly and end in a point aspiration for terminal states.
Lemma 2. The linear program in line 6 of Algorithm 2 has a solution.
Pd+1
Proof. Because x ∈ V R (s), there are convex coefficients p′ with i=1 p′i (V πi (s)−
x) = 0. As each shifting vector zi is a positive multiple of V πi (s) − x, there are
Pd+1
also convex coefficients p with i=1 pi zai = 0. Since Eai ⊆ Es + zai and Es is
convex, we then have
P P P
i pi Eai ⊆ i pi (Es + zai ) ⊆ Es + i pi zai = Es . ⊔
⊓ (10)
Proposition 2. Given all values V πi (s) and Qπi (s, a) and both the kc ≥ d + 1
many constraints and kv ≥ d + 1 many vertices defining E, this part of the
construction (Algorithm 2) has time complexity O([kv1.5 d2.5 |A| + (kv kc )1.5 d]L).
If E is a simplex, this is O(d4 |A|L).
Proof. Using the notation from Prop. 1, computing x is O(f (kv , d)L). For each
i, a, verifying whether a ∈ Ai and computing r, ℓ is done by LPs with m ≤ O(dkv )
constraints and n ≤ 2 variables, giving O(d|A|f (dkv , 2)L) = O(d2.5 kv1.5 |A|L).
The LP for calculating p has m = d + 4 + kv kc and n = d + 2, hence complexity
O((kv kc )1.5 dL). The other arithmetic operations are of lower complexity. ⊔

10 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver

5 Determining appropriate reference policies

First, find some feasible aspiration point x ∈ E0 ∩ V(s0 ) (e.g., using binary
search). We now aim to find policies whose values are likely to contain x in their
convex hull, by using backwards induction and greedily minimizing the angle
between the vector Eτ − x and a suitable direction in evaluation space.
More precisely: Pick an arbitrary direction (unit vector) y1 ∈ Rd , e.g., uni-
formly at random. Then, for k = 1, 2, . . . until stopping:

1. Let πk be the pure Markovian policy π defined by backwards induction as

yk⊤ (Qπ (s, a) − x)


π(s) = argmax . (11)
a∈A ||Qπ (s, a) − x||2

Let vk = V πk (s0 ) be the resulting candidate vertex for our reference simplex.
If k ≥ d + 1, run a primal-sparse linear program solver [18] to determine if
x ∈ conv{v1 , . . . , vk }. If so, the solver will return a basic feasible solution,
i.e. d + 1 many vertices that contain x in their convex hull. Let the policies
corresponding to the vertices that are part of the basic feasible solution be
our reference policies and stop. Otherwise, continue to step 2 below:
2. Let ek = (x − vk )/||x − vk ||2 be the unit vector in the direction from vk to x.
Pk
3. Let yk+1 = i=1 ei /k be the average of all those directions. Assuming that
x ∈ V(s0 ) and because of the hyperplane separation theorem, choosing di-
rections like this ensures that the algorithm doesn’t loop with vl+1 = vl for
some l. Also, note that to check x ∈ conv{v1 , . . . , vk , vk+1 } it is sufficient
to check vk+1 − x ∈ cone{x − v1 , · · · , x − vk }, the cone generated by the
negative of the vertices centred at x. So hunting for a policy whose value lies
approximately in the direction yk+1 from x gives us a good chance of finding
a vertex in the aforementioned cone.

We were not able to prove any complexity bounds for this part, but performed
numerical experiments with random binary tree-shaped environments of various
time horizons, with only two actions (making the directions vk −x deviate rather
much from y and thus presenting a kind of worst-case scenario) and two possible
successor states per step, and uniformly random transition probabilities and
Deltas ∈ [0, 1]d . These suggest that the expected number of iterations of the
above is O(d), which we thus conjecture to be also the case for other sufficiently
well-behaved random environments. Indeed, even if the policies πk were chosen
uniformly at random (rather than targeted like here) and the corresponding
points V πk (s0 ) were distributed uniformly in all directions around x (which is a
plausible uninformative prior assumption), then one can show easily (using [17])
that the expected number of iterations would be exactly 2d + 1.7
7
According to [17], the probability P that we need exactly k ≥
P∞ d + 1 iterations is
f (k − 1) − f (k) withPf (k) = 21−k d−1 ℓ=0
k−1

. Hence Ek = k=d+1 k(f (k − 1) −
= (d+1)f (d)+ ∞
f (k))P k=d+1 f (k). It is then reasonably simple to prove by induction
that ∞ k=d+1 f (k) = d, and hence Ek = 2d + 1.
Policies that fulfill multi-criterion aspirations in expectation 11

6 Selection of candidate actions


As we have seen in section 4.2 (ii), when choosing actions, we still have many
remaining degrees of freedom. Thus, we can use additional criteria to choose
actions while still fulfilling the aspirations. We discuss a few candidate criteria
here which are related either to gaining information, improving performance, or
reducing potential safety-related impacts of implementing the policy.
For many of the criteria, there are myopic versions, which only rely on quan-
tities that are already available at each step in the algorithms presented so far,
or farsighted versions which depend on the continuation policy and thus have to
be specifically computed recursively via Bellman-style equations.

Information-related criteria. If the used world model is imperfect, one might


want the agent to aim to gain knowledge by exploration, e.g. by considering
some measure of expected information gain such as the evidence lower bound.

6.1 Performance-related criteria


For now, the task of the agent in this paper has been given by specifying aspira-
tion sets for the expected total of the evaluation function. It is natural to consider
extensions of this approach to further properties of the trajectory distribution,
e.g. by specifying that the variance of the total should be small.
A simple, myopic approach to reducing variance is preferring actions and
action-aspirations that are somehow close to the state aspiration E, e.g. by choos-
ing action-aspirations where the Hausdorff distance dH (Ea , E) is small. A more
principled, farsighted approach would be choosing actions and action-aspirations
such that the variance of the resulting total is small. Based on equation (2), the
variance can be computed from the total raw second moment M π as

M π (s, a, Ea ) = Es′ ,Es′ ∼s,a,Ea ||f (s, a, s′ )||22 + 2f (s, a, s′ ) · V π (s′ , Es′ )

+ Ea′ ,Ea′ ∼π(s′ ,Es′ ) M π (s′ , a′ , Ea′ ) , (12)
Var(s, a, Ea ) = M π (s, a, Ea ) − ||Qπ (s, a, Ea )||22 . (13)
Note that computing this farsighted metric requires knowing the continuation
policy π, for which algorithm 2 does not suffice in its current form as it only
samples actions. It is however easy to convert it to an algorithm for computing
the whole local policy π(s, E), which is described in the Supplement.

6.2 Safety-related criteria


As mentioned in the introduction, unintended consequences of optimization can
be a source of safety problems, thus we suggest to not use any of the criteria
introduced in this section as maximization/minimization goals to completely
determine the chosen actions; instead, they can be combined
P into
 a loss for a
softmin action selection policy πi (a ∈ Ai ) ∝ exp − β j αj gj (a) , where gj (a)
are the individual criteria. Indeed, in analogy to quantilizers, choosing among
adequate actions at random can by itself be considered a useful safety measure,
as a random action is very unlikely to be special in a harmful way.
12 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver

Disordering potential. Our first safety criterion is related to the idea of “fail
safety”, and (somewhat more loosely) to “power seeking”. More precisely, it aims
to prevent the agent from moving the system into a state from which it could
make the environment’s state trajectory become very unpredictable (bring the
environment into “disorder”) because of an internal failure, or if it wanted to. We
define the disordering potential at a state to be the Shannon entropy H π (st ) of
the stochastic state trajectory S>t = (St+1 , St+2 , . . . ) that would arise from the
policy π which maximizes that entropy:

H π (st ) := Es>t |st ,π (− log Pr(s>t |st , π)). (14)

It is straightforward to compute this quantity using the Bellman-type equations

H π (s) = 1s∈S π
/ ⊤ Ea∼π(s) (− log π(s)(a) + H (s, a)), (15)
π ′ π ′
H (s, a) = Es′ ∼s,a (− log T (s, a)(s ) + H (s )). (16)

To find the maximally disordering policy π, we assume π(s′ ) and thus H π (s′ ) is
already known for all potential successors s′ of s. Then H π (s,P a) is also known
for all a and to find p
P a = π(s)(a) we need to maximize f (p) = π
a pa (H (s, a) −
log pa ) such that a pa = 1. Using Lagrange multipliers, we find that for all a,
∂pa f (p) = Hπ (s, a)−log pa −1 = λ for some constant λ, hence pa ∝ exp(H π (s, a))
is a softmax policy w.r.t. future expected Shannon entropy. Therefore
P
π(s)(a) = pa = exp(H π (s, a))/Z, Z = a exp(H π (s, a)), (17)
π P π
H (s) = log Z = log a exp(H (s, a)). (18)

Deviation from default policy. If we have access to a default policy π 0 (e.g. a


policy that was learned by observing humans or other agents performing similar
tasks), we might want to choose actions in a way that is similar to this default
policy. An easy way to measure this is by using the Kullback–Leibler divergence
from the default policy π 0 to the agent’s policy π. Given that we do not know
the local policy π(s) yet when we decide how to choose the action in the state
s, we use an estimate p̂a (e.g. p̂a = 1/(2 + d)) instead to compute the expected
total Kullback–Leibler divergence like

KLdiv(s, p̂a , a, Ea ) = log(p̂a /π 0 (s)(a)) + Es′ ∼s,a KLdiv(s, g(s, a, Ea , s′ )), (19)
/ ⊤ E(a,Ea )∼π(s,E) KLdiv(s, π(s, E)(a), a, Ea ),
KLdiv(s, E) = 1s∈S (20)

where g : (s, a, Ea , s′ ) 7→ Es′ implements action-to-state aspiration propagation.

7 Discussion and conclusion


7.1 Special cases
A single evaluation metric. It is natural to ask what our algorithm reduces
to in the single-criterion case d = 1. The reference simplices can then sim-
ply be taken to be the intervals V R (s) = [V πmin (s), V πmax (s)] and QR (s, a) =
Policies that fulfill multi-criterion aspirations in expectation 13

[Qπmin (s, a), Qπmax (s, a)], where πmin , πmax are the minimizing and maximizing
policies for the single evaluation metric. Aspiration sets are also intervals, and
action-aspirations Ea are constructed by shifting the state-aspiration E upwards
or downwards into QR (s, a) and shrinking it to that interval if necessary. To
maximize pa0 , the linear program for p will assign zero probability to that “di-
rectional” action a1 or a2 whose Ea lies in the same direction from E as Ea0 does.
In other words, the agent will mix between the “freely” chosen action a0 and a
suitable amount of a “counteracting” action a1 or a2 in the other direction.

Relationship to satisficing. A subcase of the d = 1 case is when the upper


bound of the initial state-aspiration interval coincides with the maximal possible
value, E0 = [e, V πmax (s0 )], i.e., when the goal is to achieve an expected Total of
at least e. The agent then starts out as a form of “satisficer” [10]. However,
due to the shrinking of aspirations over time, aspiration sets of later states s′
might no longer be of the same form but might end at values strictly lower
than V πmax (s′ ) if the interval [V πmin (s′ ), V πmax (s′ )] is wider than the interval
[Qπmin (s, a), Qπmax (s, a)]. In other words, even an initial satisficer can turn into
a “proper” aspiration-based agent in our algorithm that avoids maximization in
more situations than a satisficer would. In particular, also the form of satisficing
known as “quantilization” [13], where all feasible expected Totals above some
threshold get positive probability, is not a special case of our algorithm. One
can however change the algorithm to quantilization behaviour by constructing
successor state aspirations differently, by simply applying the tracing map to the
interval, Es′ = ρs,a,s′ [Ea ] (which is not feasible for d > 1).

Probabilities of desired or undesired events. Another special case is when d > 1


but the d evaluation metrics are simply indicator functions for certain events.
E.g., assume all Deltas are zero except when reaching a terminal state s′ ∈ S⊤ ,
in which case fi (s, a, s′ ) = 1(s′ ∈ Ei ) for some subset of desirable or undesirable
states Ei ⊆ S⊤ . If the first k ≤ d many events are desirable in the sense that we
want each probability Pr(Ei ) to be ≥ α for some α < 1, and the other d−k many
events are undesirable in the sense that we want each probability Pr(Ej ) to be
≤ β for some β > 0, then we can encode this goal as the initial aspiration set
E0 = [α, 1]k ×[0, β]d−k . Note that the different events need not be independent or
mutually exclusive, as long as the aspiration is feasible. Aspirations of this type
might be especially natural in combination with methods of inductive reasoning
and belief revision that are also based on this type of encoding [8]. This could
eventually be useful for a “provably safe” approach to AI [6].

7.2 Relationship to reinforcement learning


Even though we formulated our approach in a planning framework where the
environment’s transition probabilities are known and simple enough to admit
dynamic programming, it is clear from Eq. (11) that the required reference poli-
cies π and corresponding reference vertices V π (s), Qπ (s, a) can in principle also
be approximated by reinforcement learning techniques such as (deep) expected
14 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver

SARSA in more complex environments or environments that are given only as


samplers without access to transition probabilities. For the single-criterion case,
preliminary results from numerical experiments suggest that this is indeed a vi-
able approach.8 Future work should explore this further and also consider using
approximate dynamic programming methods (e.g., [2]).
If the expected number of learning passes needed to find the necessary refer-
ence policies is indeed O(d) as conjectured (see end of Section 5)9 , our approach
might turn out to have much lower average complexity than the alternative re-
inforcement learning approach to convex aspirations from [9], which appears to
require up to O(ϵ−2 ) many learning passes to achieve an error of less than ϵ.

7.3 Invariance under reparameterization

For many applications there will be several possible parameterizations of the d-


dimensional evaluation space into d different evaluation metrics, so the question
arises which parts of our approach are invariant under which types of reparam-
eterizations of evaluation space. It is easy to see that all parts are invariant
under affine transformations, except for the algorithm for finding reference poli-
cies which is only invariant under orthogonal transformations since it makes use
of angles, and except for certain safety criteria such as total variance.

Acknowledgments. We thank the members of the SatisfIA project, AI Safety Camp,


the Supervised Program for Alignment Research, and the organizers of the Virtual AI
Safety Unconference.

Supplementary Materials. This article is accompanied by a supplementary text


containing alternative versions of the main algorithm, and a supplementary video illus-
trating the evolution of action-aspirations over a sample episode with d = 2. Python
code is available at https://doi.org/10.5281/zenodo.13221511.

Disclosure of Interests. The authors have no competing interests to declare.

CRediT author statement. Authors are listed in alphabetical ordering and have
contributed equally. Simon Dima: Formal analysis, Writing - Original Draft, Writing -
Review & Editing. Simon Fischer: Formal analysis, Writing - Original Draft, Writing -
Review & Editing. Jobst Heitzig: Conceptualization, Methodology, Software, Writing -
Original Draft, Writing - Review & Editing, Supervision. Joss Oliver: Formal analysis,
Writing - Original Draft, Writing - Review & Editing.

8
Note however that some safety-related action selection criteria, especially those based
on information-theoretic concepts, require access to transition probabilities which
would then have to be learned in addition to the reference simplices.
9
Farsighted action selection criteria would require an additional learning pass to also
learn the actual policy and the resulting action evaluations.
Policies that fulfill multi-criterion aspirations in expectation 15

References
1. Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., Mané, D.: Con-
crete problems in AI safety. arXiv preprint arXiv:1606.06565 (2016)
2. Bonet, B., Geffner, H.: Solving POMDPs: RTDP-Bel vs. point-based algorithms.
In: IJCAI. pp. 1641–1646. Pasadena CA (2009)
3. Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P.,
et al.: Decision transformer: Reinforcement learning via sequence modeling (2021)
4. Clymer, J., et al.: Generalization analogies (GENIES): A testbed for generalizing
AI oversight to hard-to-measure domains. arXiv preprint arXiv:2311.07723 (2023)
5. Conitzer, V., Freedman, R., Heitzig, J., Holliday, W.H., Jacobs, B.M., Lambert,
N., Mossé, M., Pacuit, E., Russell, S., et al.: Social choice for AI alignment: Dealing
with diverse human feedback. arXiv preprint arXiv:2404.10271 (2024)
6. Dalrymple, D., Skalse, J., Bengio, Y., Russell, S., Tegmark, M., Seshia, S., Omohun-
dro, S., Szegedy, C., et al.: Towards guaranteed safe AI: A framework for ensuring
robust and reliable AI systems. arXiv preprint arXiv:2405.06624 (2024)
7. Feinberg, E.A., Sonin, I.: Notes on equivalent stationary policies in Markov decision
processes with total rewards. Math. Methods Oper. Res. 44(2), 205–221 (1996).
https://doi.org/10.1007/BF01194331, https://doi.org/10.1007/BF01194331
8. Kern-Isberner, G., Spohn, W.: Inductive reasoning, conditionals, and belief dy-
namics. Journal of Applied Logics 2631(1), 89 (2024)
9. Miryoosefi, S., Brantley, K., Daumé, H., Dudík, M., Schapire, R.E.: Reinforce-
ment learning with convex constraints. In: Proceedings of the 33rd International
Conference on Neural Information Processing Systems (2019)
10. Simon, H.A.: Rational choice and the structure of the environment. Psychological
review 63(2), 129 (1956)
11. Skalse, J.M.V., Farrugia-Roberts, M., Russell, S., Abate, A., Gleave, A.: Invariance
in policy optimisation and partial identifiability in reward learning. In: Interna-
tional Conference on Machine Learning. pp. 32033–32058. PMLR (2023)
12. Subramani, R., Williams, M., et al.: On the expressivity of objective-specification
formalisms in reinforcement learning. arXiv preprint arXiv:2310.11840 (2023)
13. Taylor, J.: Quantilizers: A safer alternative to maximizers for limited optimization
(2015), https://intelligence.org/files/QuantilizersSaferAlternative.pdf
14. Tschantz, A., et al.: Reinforcement learning through active inference (2020)
15. Vaidya, P.: Speeding-up linear programming using fast matrix multiplication. In:
30th Annual Symposium on Foundations of Computer Science. pp. 332–337 (1989)
16. Vamplew, P., Foale, C., Dazeley, R., Bignold, A.: Potential-based multiobjective
reinforcement learning approaches to low-impact agents for AI safety. Engineering
Applications of Artificial Intelligence 100, 104186 (2021)
17. Wendel, J.G.: A problem in geometric probability. Mathematica Scandinavica
11(1), 109–111 (1962)
18. Yen, I.E.H., Zhong, K., Hsieh, C.J., Ravikumar, P.K., Dhillon, I.S.: Sparse linear
programming via primal and dual augmented coordinate descent. Advances in
Neural Information Processing Systems 28 (2015)
arXiv:2408.04385v1 [cs.AI] 8 Aug 2024
Supplement to: Non-maximizing policies that
fulfill multi-criterion aspirations in expectation

Simon Dima1[0009−0003−7815−8238] , Simon Fischer2[0009−0000−7261−3031] , Jobst


Heitzig3[0000−0002−0442−8077] , and Joss Oliver4[0009−0008−6333−7598]
1
École Normale Supérieure, Paris, France, [email protected]
2
Independent Researcher, Cologne, Germany
3
Potsdam Institute for Climate Impact Research, Potsdam, Germany,
[email protected]
4
Independent Researcher, London, UK, [email protected]

1 Clipping instead of shrinking

1.1 While propagating action-aspirations to state-aspirations

In section 4.1 of the main text, we define shifted copies Xs′ of the action-
aspiration E, which would be appropriate candidates for subsequent state-aspi-
rations if not for the fact that they are not necessarily subsets of V R (s′ ). In the
main text, we resolve this by shrinking Xs′ until it fits.
An alternative approach is clipping, where we calculate the sets Xs′ and
simply set Es′ = Xs′ ∩ V R (s′ ).

1.2 While choosing actions and action-aspirations

In section 4.2.(iii) of the main text, we shift the aspiration set in a certain
direction y and shrink it until it fits into the action feasibility reference simplices
QR (s, ai ).
Clipping is also an alternative here, as illustrated in figure 1. To use clipping
instead of shrinking, algorithm 2 of the main text can be modified by replacing
the procedure ShrinkAspiration with the procedure ClipAspiration defined
here:

Algorithm 1 Clipping variation for algorithm 2 of main text.


1: procedure ClipAspiration(ai , i)
2: v ← V πi (s) if i > 0, else C(QR (s, ai )) ▷ vertex or center of target
3: y ←v−x ▷ shifting direction
4: r ← max{r ≥ 0 : x − ry ∈ E} ▷ distance from boundary point b of E to x
5: ℓ ← min{ℓ ≥ 0 : x + ℓy ∈ QR (s, ai )} ▷ distance from x to reference simplex
6: ℓ′ ← max{ℓ′ ≥ 0 : x + ℓ′ y ∈ QR (s, ai )} ▷ dist. from x to far bdry. of ref. simplex
7: z ← min(r + ℓ, ℓ′ )y ▷ shifting vector
8: return z + Es ∩ QR (s, ai ) ▷ shift and clip (nonempty)
2 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver

The direction y is determined as in the shrinking variant. Next, we determine


by linear programming the outermost point b ∈ E lying on the ray from x in the
direction −y. E is shifted in direction y, stopping either when the shifted bound-
ary point b (and thus generally a large part of E) enters the reference simplex
QR (s, ai ), or when the shifted center x exits QR (s, ai ). The appropriate shifting
distance can be computed using linear programming. This is done in lines 5 to 7
of Algorithm 1. Finally, the shifted copy of E is clipped to QR (s, ai ) by uniting
their H-representations to get Eai , i.e., consider the set of all hyperplanes on
which a facet of either E or QR (s, ai ) lies. Then the vertices of Eai are computed
from the resulting H-representation. The sets Eai are nonempty because the ray
from x in direction y intersects QR (s, ai ).

V3(s) = Q3(s,π3(s))

QR (s,π3(s))

QR(s,a3)
QR(s,a0)
E a3
E a0

X3 V1(s) = Q1(s,π1(s))

E x X1
QR(s,π1(s))
b
X2
QR(s,π2(s)) VR (s)

V2(s) = Q2(s,π2(s))

Fig. 1. Construction of action-aspirations Ea from state-aspiration E and reference


simplices V R (s) and QR (s, a) by shifting and clipping. See main text for details.

1.3 Advantages and disadvantages of clipping

Clipping instead of shrinking has the advantage that it produces strictly larger
propagated aspiration sets, which might be desirable as discussed in section 3 of
the main text.
However, clipping changes the shape of aspiration sets, adding up to d +
1 defining hyperplanes at each step, and requiring recomputation of the set
of vertices which may become combinatorially large. Therefore, we expect the
complexity of the clipping variant to be worse than shrinking, though we have
not studied it formally.
Supplement: Policies that fulfill multi-criterion aspirations in expectation 3

2 Alternate version of action selection/local policy


computation

Some action selection criteria which we may wish to use are “farsighted” and
require knowing the future behavior of the policy as well. Algorithm 2 of the
main text does not provide this information, as it samples candidate actions
ai first before determining the probabilities with which to mix them. The al-
gorithm 2 presented here is an adaptation which does provide the full local
policy before sampling from it to choose the action taken. It does so by replac-
ing the sampling in lines 4 and 7 of the original algorithm by a computation
of the respective probabilities. Instead of solving the linear program in line 6
of main-text algorithm 2 for each of the at most |A|d+2 many candidate ac-
tion combinations (a0 , . . . , ad+1 ), line 7 uses a single linear program invocation
manipulating the average action-aspirations for each direction, Eai ∼πi Eai . This
ensures complexity remains linear in |A|. A solution to this linear program ex-
ists since Eai ∼πi Eai ⊆ Es + Eai ∼πi zai and Eai ∼πi zai is a positive multiple of
V πi (s) − x; the proof of Lemma 2 of the main text is readily adapted. Finally,
the loop in lines 8 to 11 is necessary as one same action may lie in two distinct
directional action sets Ai ̸= Aj .

Algorithm 2 Local policy computation (“clip” and “shrink” variants)


Require: Reference simplex vertices V πi (s), Qπi (s, a); state s; state-aspiration Es .
1: x ← C(Es ), E ′ ← Es − x ▷ center, shape
2: for i = 0, . . . , d + 1 do
3: Ai ← DirectionalActionSet(i)
4: πi ← CandidateActionDistribution(Ai ) ▷ any probability distribution on Ai
5: for ai ∈ Ai do
6: Eai ← ClipAspiration(ai , i) Por ShrinkAspiration(ai , i) ▷ action-aspiration
7: solve linear program p0 = max!, d+1 i=0 pi Eai ∼πi Eai ⊆ Es for p ∈ ∆({0, . . . , d + 1})
8: initialize π ≡ 0
9: for i = 0, . . . , d + 1 do ▷ collect probabilities across directions
10: for ai ∈ Ai do
11: π(ai , Eai ) ← π(ai , Eai ) + pi πi (ai )
return π ▷ local policy

3 Video of an example run

In the video accessible at the anonymous download link


https://mega.nz/file/gUgTzZqB#Co21aSoKQAo7PjeR9_kzxzquHdjlNav0qjp9wKX3NRU,
we illustrate the propagation of aspirations from states to actions in one episode
of a simple gridworld environment.
In that environment, the agent starts at the central position (3,3) in a 5-by-5
rectangular grid, can move up, down, left, right, or pass in each of 10 time steps,
4 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver

and gets a Delta f (s, a, s′ ) = g(s′ ) that only depends on its next position s′ on
the grid. The values g(s′ ) are iid 2-d standard normal random values.
In each time-step, the black dashed triangle is the current state’s reference
simplex V R (s), the blue triangles are the five candidate actions’ reference sim-
plices QR (s, a), the red square is the state-aspiration E, the dotted lines connect
its center, the red dot, with the vertices of V R (s) or with the centers of the sets
QR (s, a), and the green squares are the resulting action-aspirations Ea .
The coordinate system is “moving along” with the received Delta, so that
aspirations of consecutive P steps can be compared more easily. In other words,
t−1
the received Delta Rt = t′ =0 f (st , at , st+1 ) is added to all vertices so that,
e.g., the vertices of the back dashed triangle are shown at positions Rt + V πi (st )
rather than V πi (st ).
1/d
This run uses the linear volume shrinking schedule rmax (s) = 1−1/T (s) .
As one can see, the state-aspirations indeed shrink smoothly towards a final
point aspiration, which spends the initial amount of “freedom” rather evenly
across states. This way, the agent manages to avoid drift in this case, so that
the eventual point aspiration is still inside the initial aspiration set.

4 Numerical evidence for reference policy selection


complexity
In order to study the number of trials needed to find the d + 1 directions that
define suitable reference policies for the definition of reference simplicies, we
performed numerical experiments with binary-tree-shaped environments of the
following type.
Each environment has 10 time steps, each state has two actions, each ac-
tion can lead to two different successor states, leading to 410 many terminal
states. Deltas f (s, a, s′ ) are iid uniform variates in [0, 1]d . The initial aspiration
is (5, . . . , 5).
For each d ∈ {1, . . . , 8}, we perform 1000 independent simulations and find
that the average number k of trials until E0 ∈ conv{v1 , . . . , vk } is below 2d + 1
for all tested d.

You might also like