Non-Maximizing Policies That Fulfill Multi-Criterion Aspirations in Expectation
Non-Maximizing Policies That Fulfill Multi-Criterion Aspirations in Expectation
3
Potsdam Institute for Climate Impact Research, Potsdam, Germany,
[email protected]
4
Independent Researcher, London, UK, [email protected]
1 Introduction
expressed by such a function [12], human preferences are hard to learn and may
not be easy to aggregate across stakeholders [5], and maximizing a misspecified
objective may fall prey to reward hacking [1] and Goodhart’s law [11], leading
to unintended side-effects and potentially harmful consequences.
In this work, we study a particular aspiration-based approach to agent design.
We assume an existing task-specific world model in the form of a fully observed
Markov Decision Process (MDP), where the task is not encoded by a reward
function but instead a multi-criterion evaluation function and a bounded, con-
vex subset of its range, called an aspiration set, that can be thought of as an
“instruction” [4] to the agent. Aspiration-type goals can also naturally arise from
subtasks in complex environments even if the overall goal is to maximize some
objective, when the complexity requires a hierarchical decision-making approach
whose highest level selects subtasks that turn into aspiration sets for lower hier-
archal levels.
In our version of aspiration-based agents, the goal is to make the expected
value of the total with respect to this evaluation function fall within the aspi-
ration set, and select from this set according to certain performance and safety
criteria. They do so step-wise, exploiting recursion equations similar to the Bell-
man equation. Thus our approach is like multi-objective reinforcement learning
(MORL), with a primary aspiration-based objective and at least one secondary
objective incorporated via action-selection criteria [16]. Unlike MORL, the com-
ponents of the evaluation function (called evaluation metrics) are not objectives
in the sense of targets for maximization. Rather, an aspiration formulated w.r.t.
several evaluation metrics might correspond to a single objective (e.g., “make
a cup of tea”). Also, at no point does an aspiration-based agent aggregate the
evaluation metrics into a single value. Instead, any trade-offs are built into the
aspiration set itself, similar to what [6] call a “safety specification”. For example,
aspiring to buy a total of 10 oranges and/or apples for at most EUR 1 per item
could be encoded with the aspiration set {(o, a, c) : o, a ≥ 0; c ≤ o + a = 10}.
A similar set-up to ours has been used in [9] which has used reinforcement
learning to find a policy whose expected discounted reward vector lies inside
a convex set. Instead of a reinforcement learning perspective, we use a model-
based planning perspective and design an algorithm that explicitly calculates a
policy for solving the task, based on a model of the environment. 5 Also, the
approach in [9] is concerned with guaranteed bounds for the distance between
received rewards and the convex constraints in terms of the number of iterations,
whereas we focus on guaranteeing aspiration satisfaction in a fixed number of
computational steps, providing a verifiable guarantee in the sense of [6].
Other agent designs that follow a non-maximization-goal-based approach in-
clude quantilizers, decision transformers and active inference. Quantilizers are
agents that use random actions in the top n% of a “base distribution” over ac-
tions, sorted by expected return [13]. The goal for decision transformers is to
make the expected return equal a particular value Rtarget [3]. The goal for active
inference agents is to produce a particular probability distribution of observa-
5
Nevertheless, our algorithm can also be straightforwardly adapted to learning.
Policies that fulfill multi-criterion aspirations in expectation 3
tions [14]. While the goal space for quantilizers and decision transformers, being
based on a single real-valued function, is often too restricted for many applica-
tions, that of active inference agents (all probability distributions) appears too
wide for the formal study of many aspects of aspiration-based decision-making.
Our approach is of intermediary complexity.
An important consideration in this work, to ensure tractability in large envi-
ronments, is also the computational complexity in the number of actions, states,
and evaluation metrics. We will see that for our algorithm, the preparation of an
episode has linear complexity in the number of possible state–action–successor
transitions and (conjectured and numerically confirmed) linear average complex-
ity in the number of evaluation metrics, and then the additional per-time-step
complexity of the policy is linear in the number of actions, constant in the num-
ber of states, and polynomial in the number of evaluation metrics.
Our work also affects the emerging AI safety/alignment field, which views un-
intended consequences from maximization, e.g., reward hacking and Goodhart’s
law, as a major source of risk once agentic AI systems become very capable [1].
2 Preliminaries
Value functions. Given a policy π and an evaluation function f , the state value
function V π : M × S → Rd is defined as the expected Total accumulated in
future steps while following policy π. In particular, V π (m0 , s0 ) is the expected
Total of the whole trajectory: V π (m0 , s0 ) = E(τ ). Likewise, we define the action
value function Qπ : S × A × M → Rd . These satisfy the Bellman equations
3 Fulfilling aspirations
Aspirations, feasibility. An (initial) aspiration is a convex polytope E0 ⊂ Rd ,
representing the values of the expected total Eτ which are considered acceptable.
We say that a policy π fulfills the aspiration when it satisfies V π (m0 , s0 ) ∈ E0 .
To answer the question of whether it is possible to fulfill a given aspiration, we
introduce feasibility sets. The state-feasibility set of s ∈ S is the set of possible
values for the expected future Total from s, under any memory-based policy:
V(s) = {V π (m, s) | M finite set; m0 , m ∈ M; π ∈ ΠM }; likewise, we define the
action-feasibility set Q(s, a). It is straightforward to verify that V(s ∈ S⊤ ) = {0},
and that the following recursive equations hold for s ∈ / S⊤ :
P
Q(s, a) = Es′ ∼s,a f (s, a, s′ ) + V(s′ ) = s′ T (s, a)(s′ )·(f (s, a, s′ ) + V(s′ )), (4)
S S P
V(s) = p∈∆(A) Ea∼p Q(s, a) = p∈∆(A) a∈A p(a)Q(s, a). (5)
Approximating feasibility sets by simplices. Let conv(X) denote the convex hull
of any set X ⊆ Rd . Given any tuple R of d + 1 memoryless reference policies
π1 , . . . , πd+1 ∈ Π 0 , we define reference simplices in evaluation-space as V R (s) =
conv{V π (s) | π ∈ R} and QR (s, a) = conv{Qπ (s, a) | π ∈ R}. It is immediate
that these are subsets of the convex feasibility sets V(s) resp. Q(s, a), and that
S
V R (s) ⊆ p∈∆(A) Ea∼p QR (s, a) , (6)
QR (s, a) ⊆ Es′ ∼s,a f (s, a, s′ ) + V R (s′ ) . (7)
6
However, algorithm 1 remains correct even if the aspiration sets shrink to singletons.
6 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver
4 Propagating aspirations
4.1 Propagating action-aspirations to state-aspirations
To implement algorithm schema 1, we first focus on line 5 in procedure Ful-
fillActionAspiration, which is the easier part. Given a state-action pair
s, a and an action-aspiration set Ea ⊆ QR (s, a), we must construct nonempty
state-aspiration sets Es′ ⊆ V R (s′ ) for all possible successor states s′ , such that
Es′ ∼s,a (f (s, a, s′ ) + Es′ ) ⊆ Ea . We assume that all reference simplices are nonde-
generate, i.e. have full dimension d. This is almost surely the case if there are
sufficiently many actions and these have enough possible consequences.
Tracing maps. Under this assumption, we define tracing maps ρs,a,s′ from the
reference simplex QR (s, a) to V R (s′ ). Since domain and codomain are simplices,
we can choose ρs,a,s′ to be the unique affine linear map that maps vertices to
vertices, ρs,a,s′ (Qπi (s, a)) = V πi (s′ ). For any point e ∈ QR (s, a), it follows from
equation (2) that Es′ ∼s,a (f (s, a, s′ ) + ρs,a,s′ (e)) = e. Accordingly, to propagate
aspirations of the form Ea = {e}, it is sufficient to just set Es′ = {ρs,a,s′ (e)}. How-
ever, for general subsets Ea ⊆ QR (s, a), the set Es′ ∼s,a (f (s, a, s′ ) + ρs,a,s′ (Ea )) is
in general strictly larger than Ea , and hence we map Ea in a different way.
For this, choose an arbitrary “anchor” point e ∈ Ea ; here we let e be the
average of the vertices, C(Ea ), but any other standard type of center would also
work (e.g. analytic center, center of mass/centroid, Chebyshev center). Now, let
Xs′ = E − e + ρs,a,s′ (e) be shifted copies of the action-aspiration. We would like
to use the Xs′ as state-aspirations, and indeed they have the property that
Es′ ∼s,a (f (s, a, s′ ) + Xs′ ) = Es′ ∼s,a (f (s, a, s′ ) + ρs,as′ (e) − e + Ea ) (8)
= e − e + Es′ ∼s,a (Ea ) = Ea as Ea is convex. (9)
Policies that fulfill multi-criterion aspirations in expectation 7
This is almost what we want, but it might be that Xs′ is not a subset of V R (s′ ). To
rectify this, we opt for a shrinking approach, setting Es′ = rs′ ·(Ea −e)+ρs,a,s′ (e)
for the largest rs′ ∈ [0, 1] such that the result is a subset of V R (s′ ). As Es′ does
not depend on any other successor state s′′ ̸= s′ , we can wait until the true s′ is
known and only compute Es′ for that s′ , which saves computation time.
Proposition 1. Given all values V πi (s) and Qπi (s, a) and both the kc ≥ d + 1
constraints and kv ≥ d + 1 vertices defining Ea , the shrinking version of action-
to-state aspiration propagation has time complexity O([kc1.5 d+(dkv )1.5 ]L), where
L is the precision parameter defined in [15]. If Ea is a simplex, this is O(d3 L).
Proof. A linear program (LP) with m constraints and n variables has time com-
plexity O(f (m, n)L), where f (m, n) = (m+n)1.5 n and L is a precision parameter
[15]. C(Ea ) can be calculated with an LP with m = kc + 1 and n = d + 1. Find-
ing the convex coefficients for ρs,a,s′ requires solving a system of d + 1 linear
equations, needing time O(dω ), where 2 ≤ ω < 2.5 is the exponent of matrix
multiplication. Finding the shrinking factor r is an LP with m = (d + 1)kv and
n = 1, and then computing the constraints and vertices of Es′ is O(d(kc +kv )). In
all, this gives a time complexity of O(Lf (kc , d) + dω + Lf (dkv , 1) + d(kc + kv )) ≤
O(L(kc + d)1.5 d + dω + L(dkv )1.5 ) ≤ O(L(kc1.5 d + (dkv )1.5 )). ⊔
⊓
(i) Directional action sets. Compute the average of the vertices of E, x = C(E),
and the “shape” E ′ = E − x. For i = 0, put Ai = A. For i > 0, let Xi be the
segment from x to the vertex V πi (s) and check for each action a ∈ A whether
its reference simplex QR (s, a) intersects Xi , using linear programming. Let Ai
be the set of those a, which is nonempty since πi (s) ∈ Ai by definition of V πi (s).
8 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver
9: procedure DirectionalActionSet(i)
10: if i = 0, return A
11: Xi ← conv{x, V πi (s)} ▷ segment from m to ith vertex
12: return {a ∈ A | QR (s, a) ∩ Xi ̸= ∅} ▷ actions with feasible points on Xi
13: procedure ShrinkAspiration(ai , i)
14: v ← V πi (s) if i > 0, else C(QR (s, ai )) ▷ vertex or center of target
15: y ←v−x ▷ shifting direction
16: r ← max{r ∈ [0, rmax (s)] | ∃ℓ ≥ 0, ℓy + x + rE ′ ⊆ QR (s, a)} ▷ size of largest
shifted and shrunk copy of E that fits into QR (s, a)
17: ℓ ← min{ℓ ≥ 0 | ℓy + x + rE ′ ⊆ QR (s, a)} ▷ shortest dist. that makes it fit in
18: return ℓy + x + rE ′ ▷ shift and shrink
V3(s) = Q3(s,π3(s))
QR (s,π3(s))
QR(s,a3)
QR(s,a0)
E a3 E a0
X3 E a1 V1(s) = Q1(s,π1(s))
E x X1
QR(s,π1(s))
X2
QR(s,π2(s)) VR (s)
E a2
V2(s) = Q2(s,π2(s))
Pd+1
is a subset of Es , i=0 pi Eai ⊆ Es . We show below that this equation has a
solution. Because we want the action a0 that was chosen freely from the whole
action set A to be as likely as possible, we maximize its probability p0 . This is
done in line 6 of Algorithm 2 using linear programming. Note that the smaller
the sets Ea , the looser the set inclusion constraint and thus the larger p0 . We can
influence the latter via the shrinking schedule rmax (s), for example by putting
rmax (s) = (1 − 1/T (s))1/d where T (s) is the remaining number of time steps at
state s, which would reduce the amount of freedom (in terms of the volume of
the aspiration set) linearly and end in a point aspiration for terminal states.
Lemma 2. The linear program in line 6 of Algorithm 2 has a solution.
Pd+1
Proof. Because x ∈ V R (s), there are convex coefficients p′ with i=1 p′i (V πi (s)−
x) = 0. As each shifting vector zi is a positive multiple of V πi (s) − x, there are
Pd+1
also convex coefficients p with i=1 pi zai = 0. Since Eai ⊆ Es + zai and Es is
convex, we then have
P P P
i pi Eai ⊆ i pi (Es + zai ) ⊆ Es + i pi zai = Es . ⊔
⊓ (10)
Proposition 2. Given all values V πi (s) and Qπi (s, a) and both the kc ≥ d + 1
many constraints and kv ≥ d + 1 many vertices defining E, this part of the
construction (Algorithm 2) has time complexity O([kv1.5 d2.5 |A| + (kv kc )1.5 d]L).
If E is a simplex, this is O(d4 |A|L).
Proof. Using the notation from Prop. 1, computing x is O(f (kv , d)L). For each
i, a, verifying whether a ∈ Ai and computing r, ℓ is done by LPs with m ≤ O(dkv )
constraints and n ≤ 2 variables, giving O(d|A|f (dkv , 2)L) = O(d2.5 kv1.5 |A|L).
The LP for calculating p has m = d + 4 + kv kc and n = d + 2, hence complexity
O((kv kc )1.5 dL). The other arithmetic operations are of lower complexity. ⊔
⊓
10 Simon Dima, Simon Fischer, Jobst Heitzig, and Joss Oliver
First, find some feasible aspiration point x ∈ E0 ∩ V(s0 ) (e.g., using binary
search). We now aim to find policies whose values are likely to contain x in their
convex hull, by using backwards induction and greedily minimizing the angle
between the vector Eτ − x and a suitable direction in evaluation space.
More precisely: Pick an arbitrary direction (unit vector) y1 ∈ Rd , e.g., uni-
formly at random. Then, for k = 1, 2, . . . until stopping:
Let vk = V πk (s0 ) be the resulting candidate vertex for our reference simplex.
If k ≥ d + 1, run a primal-sparse linear program solver [18] to determine if
x ∈ conv{v1 , . . . , vk }. If so, the solver will return a basic feasible solution,
i.e. d + 1 many vertices that contain x in their convex hull. Let the policies
corresponding to the vertices that are part of the basic feasible solution be
our reference policies and stop. Otherwise, continue to step 2 below:
2. Let ek = (x − vk )/||x − vk ||2 be the unit vector in the direction from vk to x.
Pk
3. Let yk+1 = i=1 ei /k be the average of all those directions. Assuming that
x ∈ V(s0 ) and because of the hyperplane separation theorem, choosing di-
rections like this ensures that the algorithm doesn’t loop with vl+1 = vl for
some l. Also, note that to check x ∈ conv{v1 , . . . , vk , vk+1 } it is sufficient
to check vk+1 − x ∈ cone{x − v1 , · · · , x − vk }, the cone generated by the
negative of the vertices centred at x. So hunting for a policy whose value lies
approximately in the direction yk+1 from x gives us a good chance of finding
a vertex in the aforementioned cone.
We were not able to prove any complexity bounds for this part, but performed
numerical experiments with random binary tree-shaped environments of various
time horizons, with only two actions (making the directions vk −x deviate rather
much from y and thus presenting a kind of worst-case scenario) and two possible
successor states per step, and uniformly random transition probabilities and
Deltas ∈ [0, 1]d . These suggest that the expected number of iterations of the
above is O(d), which we thus conjecture to be also the case for other sufficiently
well-behaved random environments. Indeed, even if the policies πk were chosen
uniformly at random (rather than targeted like here) and the corresponding
points V πk (s0 ) were distributed uniformly in all directions around x (which is a
plausible uninformative prior assumption), then one can show easily (using [17])
that the expected number of iterations would be exactly 2d + 1.7
7
According to [17], the probability P that we need exactly k ≥
P∞ d + 1 iterations is
f (k − 1) − f (k) withPf (k) = 21−k d−1 ℓ=0
k−1
ℓ
. Hence Ek = k=d+1 k(f (k − 1) −
= (d+1)f (d)+ ∞
f (k))P k=d+1 f (k). It is then reasonably simple to prove by induction
that ∞ k=d+1 f (k) = d, and hence Ek = 2d + 1.
Policies that fulfill multi-criterion aspirations in expectation 11
Disordering potential. Our first safety criterion is related to the idea of “fail
safety”, and (somewhat more loosely) to “power seeking”. More precisely, it aims
to prevent the agent from moving the system into a state from which it could
make the environment’s state trajectory become very unpredictable (bring the
environment into “disorder”) because of an internal failure, or if it wanted to. We
define the disordering potential at a state to be the Shannon entropy H π (st ) of
the stochastic state trajectory S>t = (St+1 , St+2 , . . . ) that would arise from the
policy π which maximizes that entropy:
H π (s) = 1s∈S π
/ ⊤ Ea∼π(s) (− log π(s)(a) + H (s, a)), (15)
π ′ π ′
H (s, a) = Es′ ∼s,a (− log T (s, a)(s ) + H (s )). (16)
To find the maximally disordering policy π, we assume π(s′ ) and thus H π (s′ ) is
already known for all potential successors s′ of s. Then H π (s,P a) is also known
for all a and to find p
P a = π(s)(a) we need to maximize f (p) = π
a pa (H (s, a) −
log pa ) such that a pa = 1. Using Lagrange multipliers, we find that for all a,
∂pa f (p) = Hπ (s, a)−log pa −1 = λ for some constant λ, hence pa ∝ exp(H π (s, a))
is a softmax policy w.r.t. future expected Shannon entropy. Therefore
P
π(s)(a) = pa = exp(H π (s, a))/Z, Z = a exp(H π (s, a)), (17)
π P π
H (s) = log Z = log a exp(H (s, a)). (18)
KLdiv(s, p̂a , a, Ea ) = log(p̂a /π 0 (s)(a)) + Es′ ∼s,a KLdiv(s, g(s, a, Ea , s′ )), (19)
/ ⊤ E(a,Ea )∼π(s,E) KLdiv(s, π(s, E)(a), a, Ea ),
KLdiv(s, E) = 1s∈S (20)
[Qπmin (s, a), Qπmax (s, a)], where πmin , πmax are the minimizing and maximizing
policies for the single evaluation metric. Aspiration sets are also intervals, and
action-aspirations Ea are constructed by shifting the state-aspiration E upwards
or downwards into QR (s, a) and shrinking it to that interval if necessary. To
maximize pa0 , the linear program for p will assign zero probability to that “di-
rectional” action a1 or a2 whose Ea lies in the same direction from E as Ea0 does.
In other words, the agent will mix between the “freely” chosen action a0 and a
suitable amount of a “counteracting” action a1 or a2 in the other direction.
CRediT author statement. Authors are listed in alphabetical ordering and have
contributed equally. Simon Dima: Formal analysis, Writing - Original Draft, Writing -
Review & Editing. Simon Fischer: Formal analysis, Writing - Original Draft, Writing -
Review & Editing. Jobst Heitzig: Conceptualization, Methodology, Software, Writing -
Original Draft, Writing - Review & Editing, Supervision. Joss Oliver: Formal analysis,
Writing - Original Draft, Writing - Review & Editing.
8
Note however that some safety-related action selection criteria, especially those based
on information-theoretic concepts, require access to transition probabilities which
would then have to be learned in addition to the reference simplices.
9
Farsighted action selection criteria would require an additional learning pass to also
learn the actual policy and the resulting action evaluations.
Policies that fulfill multi-criterion aspirations in expectation 15
References
1. Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., Mané, D.: Con-
crete problems in AI safety. arXiv preprint arXiv:1606.06565 (2016)
2. Bonet, B., Geffner, H.: Solving POMDPs: RTDP-Bel vs. point-based algorithms.
In: IJCAI. pp. 1641–1646. Pasadena CA (2009)
3. Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P.,
et al.: Decision transformer: Reinforcement learning via sequence modeling (2021)
4. Clymer, J., et al.: Generalization analogies (GENIES): A testbed for generalizing
AI oversight to hard-to-measure domains. arXiv preprint arXiv:2311.07723 (2023)
5. Conitzer, V., Freedman, R., Heitzig, J., Holliday, W.H., Jacobs, B.M., Lambert,
N., Mossé, M., Pacuit, E., Russell, S., et al.: Social choice for AI alignment: Dealing
with diverse human feedback. arXiv preprint arXiv:2404.10271 (2024)
6. Dalrymple, D., Skalse, J., Bengio, Y., Russell, S., Tegmark, M., Seshia, S., Omohun-
dro, S., Szegedy, C., et al.: Towards guaranteed safe AI: A framework for ensuring
robust and reliable AI systems. arXiv preprint arXiv:2405.06624 (2024)
7. Feinberg, E.A., Sonin, I.: Notes on equivalent stationary policies in Markov decision
processes with total rewards. Math. Methods Oper. Res. 44(2), 205–221 (1996).
https://doi.org/10.1007/BF01194331, https://doi.org/10.1007/BF01194331
8. Kern-Isberner, G., Spohn, W.: Inductive reasoning, conditionals, and belief dy-
namics. Journal of Applied Logics 2631(1), 89 (2024)
9. Miryoosefi, S., Brantley, K., Daumé, H., Dudík, M., Schapire, R.E.: Reinforce-
ment learning with convex constraints. In: Proceedings of the 33rd International
Conference on Neural Information Processing Systems (2019)
10. Simon, H.A.: Rational choice and the structure of the environment. Psychological
review 63(2), 129 (1956)
11. Skalse, J.M.V., Farrugia-Roberts, M., Russell, S., Abate, A., Gleave, A.: Invariance
in policy optimisation and partial identifiability in reward learning. In: Interna-
tional Conference on Machine Learning. pp. 32033–32058. PMLR (2023)
12. Subramani, R., Williams, M., et al.: On the expressivity of objective-specification
formalisms in reinforcement learning. arXiv preprint arXiv:2310.11840 (2023)
13. Taylor, J.: Quantilizers: A safer alternative to maximizers for limited optimization
(2015), https://intelligence.org/files/QuantilizersSaferAlternative.pdf
14. Tschantz, A., et al.: Reinforcement learning through active inference (2020)
15. Vaidya, P.: Speeding-up linear programming using fast matrix multiplication. In:
30th Annual Symposium on Foundations of Computer Science. pp. 332–337 (1989)
16. Vamplew, P., Foale, C., Dazeley, R., Bignold, A.: Potential-based multiobjective
reinforcement learning approaches to low-impact agents for AI safety. Engineering
Applications of Artificial Intelligence 100, 104186 (2021)
17. Wendel, J.G.: A problem in geometric probability. Mathematica Scandinavica
11(1), 109–111 (1962)
18. Yen, I.E.H., Zhong, K., Hsieh, C.J., Ravikumar, P.K., Dhillon, I.S.: Sparse linear
programming via primal and dual augmented coordinate descent. Advances in
Neural Information Processing Systems 28 (2015)
arXiv:2408.04385v1 [cs.AI] 8 Aug 2024
Supplement to: Non-maximizing policies that
fulfill multi-criterion aspirations in expectation
In section 4.1 of the main text, we define shifted copies Xs′ of the action-
aspiration E, which would be appropriate candidates for subsequent state-aspi-
rations if not for the fact that they are not necessarily subsets of V R (s′ ). In the
main text, we resolve this by shrinking Xs′ until it fits.
An alternative approach is clipping, where we calculate the sets Xs′ and
simply set Es′ = Xs′ ∩ V R (s′ ).
In section 4.2.(iii) of the main text, we shift the aspiration set in a certain
direction y and shrink it until it fits into the action feasibility reference simplices
QR (s, ai ).
Clipping is also an alternative here, as illustrated in figure 1. To use clipping
instead of shrinking, algorithm 2 of the main text can be modified by replacing
the procedure ShrinkAspiration with the procedure ClipAspiration defined
here:
V3(s) = Q3(s,π3(s))
QR (s,π3(s))
QR(s,a3)
QR(s,a0)
E a3
E a0
X3 V1(s) = Q1(s,π1(s))
E x X1
QR(s,π1(s))
b
X2
QR(s,π2(s)) VR (s)
V2(s) = Q2(s,π2(s))
Clipping instead of shrinking has the advantage that it produces strictly larger
propagated aspiration sets, which might be desirable as discussed in section 3 of
the main text.
However, clipping changes the shape of aspiration sets, adding up to d +
1 defining hyperplanes at each step, and requiring recomputation of the set
of vertices which may become combinatorially large. Therefore, we expect the
complexity of the clipping variant to be worse than shrinking, though we have
not studied it formally.
Supplement: Policies that fulfill multi-criterion aspirations in expectation 3
Some action selection criteria which we may wish to use are “farsighted” and
require knowing the future behavior of the policy as well. Algorithm 2 of the
main text does not provide this information, as it samples candidate actions
ai first before determining the probabilities with which to mix them. The al-
gorithm 2 presented here is an adaptation which does provide the full local
policy before sampling from it to choose the action taken. It does so by replac-
ing the sampling in lines 4 and 7 of the original algorithm by a computation
of the respective probabilities. Instead of solving the linear program in line 6
of main-text algorithm 2 for each of the at most |A|d+2 many candidate ac-
tion combinations (a0 , . . . , ad+1 ), line 7 uses a single linear program invocation
manipulating the average action-aspirations for each direction, Eai ∼πi Eai . This
ensures complexity remains linear in |A|. A solution to this linear program ex-
ists since Eai ∼πi Eai ⊆ Es + Eai ∼πi zai and Eai ∼πi zai is a positive multiple of
V πi (s) − x; the proof of Lemma 2 of the main text is readily adapted. Finally,
the loop in lines 8 to 11 is necessary as one same action may lie in two distinct
directional action sets Ai ̸= Aj .
and gets a Delta f (s, a, s′ ) = g(s′ ) that only depends on its next position s′ on
the grid. The values g(s′ ) are iid 2-d standard normal random values.
In each time-step, the black dashed triangle is the current state’s reference
simplex V R (s), the blue triangles are the five candidate actions’ reference sim-
plices QR (s, a), the red square is the state-aspiration E, the dotted lines connect
its center, the red dot, with the vertices of V R (s) or with the centers of the sets
QR (s, a), and the green squares are the resulting action-aspirations Ea .
The coordinate system is “moving along” with the received Delta, so that
aspirations of consecutive P steps can be compared more easily. In other words,
t−1
the received Delta Rt = t′ =0 f (st , at , st+1 ) is added to all vertices so that,
e.g., the vertices of the back dashed triangle are shown at positions Rt + V πi (st )
rather than V πi (st ).
1/d
This run uses the linear volume shrinking schedule rmax (s) = 1−1/T (s) .
As one can see, the state-aspirations indeed shrink smoothly towards a final
point aspiration, which spends the initial amount of “freedom” rather evenly
across states. This way, the agent manages to avoid drift in this case, so that
the eventual point aspiration is still inside the initial aspiration set.