Multi-Agent Learning Dynamics
Multi-Agent Learning Dynamics
Learning
Abstract
In this paper we survey the basics of Reinforcement Learning and (Evolutionary) Game Theory,
applied to the field of Multi-Agent Systems. This paper contains three parts. We start with an
overview on the fundamentals of Reinforcement Learning. Next we summarize the most important
aspects of Evolutionary Game Theory. Finally, we discuss the state-of-the-art of Multi-Agent
Reinforcement Learning and the mathematical connection with Evolutionary Game Theory.
1 Introduction
In this paper we describe the basics of Reinforcement Learning and Evolutionary Game Theory,
applied to the field of Multi-Agent Systems. The uncertainty inherent to the Multi-Agent
environment implies that an agent needs to learn from, and adapt to, this environment to be
successful. Indeed, it is impossible to foresee all situations an agent can encounter beforehand.
Therefore, learning and adaptiveness become crucial for the successful application of Multi-agent
systems to contemporary technological challenges as for instance routing in telecom, e-commerce,
robocup, etc. Reinforcement Learning (RL) is already an established and profound theoretical
framework for learning in stand-alone or single-agent systems. Yet, extending RL to multi-agent
systems (MAS) does not guarantee the same theoretical grounding. As long as the environment
an agent is experiencing is Markov2, and the agent can experiment enough, RL guarantees
convergence to the optimal strategy. In a MAS however, the reinforcement an agent receives,
may depend on the actions taken by the other agents present in the system. Hence, the Markov
property no longer holds. And as such, guarantees of convergence do no longer hold.
In the light of the above problem it is important to fully understand the dynamics of
reinforcement learning and the effect of exploration in MAS. For this aim we review Evolutionary
Game Theory (EGT) as a solid basis for understanding learning and constructing new learning
algorithms. The Replicator Equations will appear to be an interesting model to study learning
in various settings. This model consists of a system of differential equations describing how a
population (or a probability distribution) of strategies evolves over time, and plays a central role
in biological and economical models.
In Section 2 we summarize the fundamentals of Reinforcement Learning. More precisely, we
discuss policy and value iteration methods, RL as a stochastic approximation technique and some
1
Note that as of October 1st 2005, the first author will move to the University of Maastricht, Institute
for Knowledge and Agent Technology (IKAT), The Netherlands. His corresponding adress will change
into [email protected]
2
The Markov property states that only the present state is relevant for the future behavior of the learning
process. Knowledge of the history of the process does not add any new information.
2
convergence issues. We also discuss distributed RL in this section. Next we discuss basic concepts
of traditional and evolutionary game theory in Section 3. We provide definitions and examples of
the most basic concepts as Nash equilibrium, Pareto optimality, Evolutionary Stable Strategies
and the Replicator Equations. We also discuss the relationship between EGT and RL. Section 4
is dedicated to Multi-Agent Reinforcement Learning. We discus some possible approaches, their
advantages and limitations. More precisely, we will describe the joint action space approach,
independent learners, informed agents and an EGT approach. Finally, we conclude in Section 5.
Reinforcement learning (RL) finds its roots in animal learning. It is well known that we can
teach an animal to respond in a desired way by rewarding and punishing it appropriately. For
example we can train a dog to detect drugs in people’s luggage at customs by rewarding it each
time it responds correctly and punishing it otherwise. Based on this external feedback signal the
dog adapts to the desired behavior. More general, the objective of a reinforcement learner is to
discover a policy, i.e. a mapping from situations to actions, so as to maximize the reinforcement
it receives. The reinforcement is a scalar value which is usually negative to express a punishment,
and positive to indicate a reward. Unlike supervised learning techniques, reinforcement learning
methods do not assume the presence of a teacher who is able to judge the action taken in a
particular situation. Instead the learner finds out what the best actions are by trying them out
and by evaluating the consequences of the actions by itself. For many problems the consequences
of an action become not immediately apparent after performing the action, but only after a
number of other actions have been taken. In other words the selected action may not only affect
the immediate reward/punishment the learner receives, but also the reinforcement it might get in
subsequent situations, i.e. the delayed rewards and punishments. Originally, RL was considered
to be single agent learning. All events the agent has no control over, are considered to be part
of the environment. In this section we consider the single agent setting, in Section 4 we discuss
different approaches to multi-agent RL.
which are called the transition probabilities. Now, given any state and action s and a, together
with any next state s′ , the expected value of the next reward is,
a ′
Rss′ = E{rt+1 |st = s, at = a, st+1 = s } (2)
3
Important here to note is that we assume that the Markov property is valid. This allows us
to determine the optimal action based on the observation of the current state only. Below we
introduce the two approaches in DP: policy iteration and value iteration, and introduce their RL
counter parts.
It is well known that the Vπ∗ (s) , with π ∗ the optimal policy, are the solutions of the Bellman
optimality equation given below:
X
a a
V ∗ (s) = maxa Pss ∗ ′
′ [Rss′ + γV (s )] (5)
s′
To locally improve the policy in a given state s, the best action a is looked for based on the
current state values Vk (s). So π is improved in state s, by updating π(s) into the action that
maximises the right hand side of equation 7, yielding an updated policy π. The policy iteration
algorithm is given below:
Internal
reinforcement
r
a
Evaluation unit action unit
Process’ state
r process
External
reinforcement
Where ζ is a positive constant determining the rate of change. This updating rule is the so called
temporal difference, TD(0), method of [Sut88]. As stated above, the goal of the evaluation unit
is to transform the environmental reinforcement signal r into a more informative internal signal
r. To generate the internal reinforcement, differences in the predictions between two successive
states are used. If the process moves from a state with a prediction of lower reinforcement into a
state with a prediction of higher reinforcement, the internal reinforcement signal will reward the
action that caused the move. In [Bar83] it is proposed to use the following internal reinforcement
signal:
r(t) = r(t) + γV (st+1 ) − V (st ) (9)
Given the current state, the action unit produces the action that will be applied next. Many
different approaches exist for implementing this action unit. If the action unit contains a mapping
from states to actions, the action that will be applied to the system can be generated by a two
step process. In the first step the most promising action is generated, this is that action to which
the state is mapped. This action is then modified by means of a stochastic modifier S. This second
step is necessary to allow exploration of alternative actions. Actions that are ”close” to the action
that was generated in the first step, are more likely to be the outcome of this second step. This
approach is often used if the action set is a continuum. If an action that was applied to the system
turned out to be better than expected, i.e. the internal reinforcement signal is positive, then the
mapping will be ”shifted” towards this action. If the action set is discrete, a table representation
can be used. Then the table maps states to probabilities of actions to be selected for a particular
state, and the probabilities are updated directly. Below we discuss a simple mechanism to update
these probabilities.
learning process these probabilities are updated based on responses from the environment. We
consider LA to be a method for solving RL problems in a policy iterative fashion.
The term Learning Automaton was introduced for the first time in the work of Narendra
and Thathacher in 1974 [Nar74]. Since then there has been a lot of development in the field
and a number of survey papers and books on this topic have been published: to cite a few
[Tha02, Nar89, Nar74].
In Figure 2 a Learning Automaton is illustrated in its most general form.
The automaton tries to determine an optimal action out of a set of possible actions to perform.
Let us now first zoom in to the environment part of Figure 2. This part is illustrated in Figure
3. The environment responds to the input action α by producing an output β. The output also
belongs to a set of possible outcomes, i.e. {0, 1}, which is probabilistically related to the set of
inputs through the environment vector c.
The environment is represented by a triple {α, c, β}, where α represents a finite action set, β
represents the response set of the environment, and c is a vector of penalty probabilities, where
each component ci corresponds to an action αi .
The response β from the environment can take on 2 values β1 or β2 . Often they are chosen
to be 0 and 1, where 1 is associated with a penalty response (a failure) and 0 with a reward (a
success).
Now, the penalty probabilities ci can be defined as
Consequently, ci is the probability that action αi will result in a penalty response. If these
probabilities are constant, the environment is called stationary.
Several models are recognized by the response set of the environment. Models in which the
response β can only take 2 values are called P-models. Models which allow a finite number of
values in the fixed interval [0, 1] are called Q-models. When β is a continuous random variable in
the fixed interval [0, 1], the model is called S-model.
Now we considered the environment of the LA model of Figure 2, we will now zoom in at the
automaton itself of Figure 2. More precisely, Figure 4 illustrates this.
The automaton is represented by a set of states φ = {φ1 , ..., φs }. As opposed to the environ-
ment, β becomes the input and α the output. This implicitly defines a function F : φ × β → φ,
mapping the current state and input into the next state, and a function H : φ × β → α, mapping
the current state and current input into the current output. In this text we will use p as the
6
probability vector over the possible actions of the automaton which corresponds to the function
H.
Summarizing, this brings us to the definition of a Learning Automaton. More precisely, it is
defined by a quintuple {α, β, F, p, T } for which α is the action or output set {α1 , α2 , . . . αr } of
the automaton , β is a random variable in the interval [0, 1], F is the state transition function,
p is the action probability vector of the automaton or agent and T denotes an update scheme.
The output α of the automaton is actually the input to the environment. The input β of the
automaton is the output of the environment, which is modeled through penalty probabilities ci
with ci = P [β | αi ], i = 1 . . . r over the actions.
The automaton can be either stochastic or deterministic, the former’s output function H being
composed of probabilities based on the environment’s response, whilst the latter having a fixed
mapping function between the internal state and the function to be performed.
Further sub-division of classification occurs when considering the transition or updating
function F which determines the next state of the automaton given its current state and the
response from the environment. If this is fixed then the automaton is a fixed structure deterministic
or a fixed structure stochastic automaton.
However if the updating function is variable, allowing for the transition function to be modified
so that choosing the operations or actions changes after each iteration, then the automaton is
a variable structure deterministic or a variable structure stochastic automaton. In this paper we
are mainly concerned with the variable structure stochastic automata, which have the potential
of greater flexibility and therefore performance. Such an automaton A at timestep t is defined as:
A(t) = {α, β, p, T (α, β, p)}
where we have an action set α with r actions, an environment response set β and a probability
set p containing r probabilities, each being the probability of performing every action possible
in the current internal automaton state. The function T is the reinforcement algorithm which
modifies the action probability vector p with respect to the performed action and the received
response. The new probability vector can therefore be written as:
p(t + 1) = T {α, β, p(t)}
with t the timestep.
Next we summarize the different update schemes.
The most important update schemes are linear reward-penalty, linear reward-inaction and
linear reward-ǫ-penalty. The philosophy of those schemes is essentially to increase the probability
of an action when it results in a success and to decrease it when the response is a failure. The
general update algorithm is given by:
pi (t + 1) ← pi (t) + a(1 − β(t))(1 − pi (t)) − bβ(t)pi (t) (11)
if αi is the action taken at time t
pj (t + 1) ← pj (t) − a(1 − β(t))pj (t) + bβ(t)[(r − 1)−1 − pj (t)] (12)
if αj 6= αi
The constants a and b in ]0, 1[ are the reward and penalty parameters respectively. When a = b
the algorithm is referred to as linear reward-penalty (LR−P ), when b = 0 it is referred to as linear
7
Since for all states k → Vk (s) is a contraction mapping, the Vk (s) values converge in the limit to
the optimal values V ∗ (s). In practice the updating is stopped, when the changes become very
small, and the corresponding optimal policy doesn’t changes any more. Since value iteration in
DP assumes a transition model of the system is available, the optimal policy π ∗ can be obtained
using the equation below:
X
a a
π ∗ (s) = argmaxa Pss ′
′ [Rss′ + γV (s )] (15)
s′
The Q∗ (s, a) are equal to the expected return of taking action a in state s, and from then on
behaving according to the optimal policy π ∗ , i.e.
X
a a
Q∗ (s, a) = Pss ∗ ′
′ E[Rss′ + γV (s )] (17)
s′
The same way as in value iteration of DP, the Q-values are iteratively updated. But since in
a a
RL we don’t have a model of the environment, we don’t know the Pss ′ , nor the E[Rss′ ], therefore
stochastic approximation is used, yielding the well known Q-learning updating rule:
Not all RL techniques come with a proof of convergence. Especially the policy iteration approaches
often lack such a proof. The Learning Automata, introduced above as an example of one stage
policy iteration, do have a proof of convergence. A single LA that uses the reward-inaction
updating scheme is guaranteed to converge [Nar89], the same is true for a set of independent LA
(see Section 4).
The value iteration approach, Q-learning is also proved to converge if applied in a Markovian
environment, and provided some very reasonable assumptions apply such as appropriate settings
for α (see Section 2.2.2). The Markovian property is really crucial, as soon as this not longer
holds, the guarantee for convergence is lost. This however does not mean that RL can not be
applied in non-Markovian environments but care has to be taken.
Qk+1 (s, a) = Qk (s, a) + α(s,a) (k)[Qk (s, a)(Qk ) − Qk (s, a) + Wk (s, a)] (20)
9
a
P ′ ′
With Qk (s, a)(Qk ) = E[r(s, a) + γ s′ Pss ′ maxa′ Qk (s , a )] and (Qk ) is the vector containing all
Qk (s, a) values. α(s,a) (k) = 0 if Qk+1 (s, a) is not updated at time step k + 1, otherwise α(s,a) (k) ∈
]0, 1] obeying the restrictions stated above.
And
X
a
Wk (s, a) = (rk (s, a) + γmaxa′ Qk (s′ , a′ )) − E[r(s, a) + γ Pss ′ ′
′ maxa′ Qk (s , a )] (21)
s′
2.3 Distributed RL
Since the proof of Tsitsiklis allows that Q-values are updated asynchronously and based on
outdated information, it is rather straightforward to come up with a parallel or distributed
version of Q-learning.
Assume we subdivide the state space in different regions. In each region an agent gets the
responsibility of updating the corresponding Q-values by exploring his region. Agents can explore
their own region, and make updates in their copy of the table of the Q-values to the Q-values
that belong to their own region. As long as they make transitions in their own region, they can
apply the usual updating rule. If they make however a transition to another region, with Q-values
that belong to the responsibility of another agent, they should not directly communicate to that
other agent and ask for the particular Q-value, but use the information they have in their own
copy table, i.e. use out-dated information, and steer the exploration so as to get back to their
own region. Since out-date information needs to be updated from time to time, the agents should
communicate from time to time and distribute the Q-values of which they have the responsibility.
Since we do not put the Markovian property in to danger by this approach, the prove of Tsitsiklis
can be applied, and convergence is still assured. While this approach can be considered as a MAS,
we prefer to refer to it as a distributed or parallel version of Q-learning. The approach has been
successfully applied to the problem of Call Admission Control in telecommunications [Ste97]3 .
choices and the decisions of other agents. Different economical situations lead to different rational
strategies for the players involved.
When John Nash discovered the theory of games at Princeton, in the late 1940’s and early
1950’s, the impact was enormous. The impact of the developments in Game Theory expressed
itself especially in the field of economics, where its concepts played an important role in for
instance the study of international trade, bargaining, the economics of information and the
organization of corporations. But also in other disciplines such as social and natural sciences
the importance of Game Theory became clear, examples are: studies of legislative institutions,
voting behavior, warfare, international conflicts, and evolutionary biology.
However, von Neumann and Morgenstern had only managed to define an equilibrium concept
for 2-person zero-sum games. Zero-sum games correspond to situations of pure competition,
whatever one player wins, must be lost by another. John Nash addressed the case of competition
with mutual gain by defining best-reply functions and using Kakutani’s fixed point-theorem5 . The
main results of his work were the development of the Nash Equilibrium and the Nash Bargaining
Solution concept.
Despite the great usefulness of the Nash equilibrium concept, the assumptions traditional
game theory make, like hyper rational players that correctly anticipate the other players in
an equilibrium, made game theory stagnate for quite some time [Wei96, Gin00, Sam97]. A lot
of refinements of Nash equilibria came along (for instance trembling hand perfection), which
made it hard to choose the appropriate equilibrium in a particular situation. Almost any Nash
equilibrium could be justified in terms of some particular refinement. This made clear that the
static Nash concept did not reflect the (dynamic) real world where people do not make decisions
under hyper-rationality assumptions.
This is where evolutionary game theory originated. More precisely, John Maynard Smith
adopted the idea of evolution from biology [May73, May82]. He applied Game Theory (GT) to
Biology, which made him relax some of the premises of GT. Under these biological circumstances,
it becomes impossible to judge what choices are the most rational ones. The question now becomes
how a player can learn to optimize its behavior and maximize its return. This learning process is
the core of evolution in Biology.
These new ideas led Smith and Price to the concept of Evolutionary Stable Strategies (ESS),
a special case of the Nash condition. In contrast to GT, EGT is descriptive and starts from
more realistic views of the game and its players. Here the game is no longer played exactly once
by rational players who know all the details of the game, such as each others preferences over
outcomes. Instead EGT assumes that the game is played repeatedly by players randomly drawn
from large populations, uninformed of the preferences of the opponent players.
Evolutionary Game Theory offers a solid basis for rational decision making in an uncertain
world, it describes how individuals make decisions and interact in complex environments in the
real world. Modeling learning agents in the context of Multi-agent Systems requires insight in
the type and form of interactions with the environment and other agents in the system. Usually,
these agents are modeled similar to the different players in a standard game theoretical model.
In other words, these agents assume complete knowledge of the environment, have the ability to
correctly anticipate the opposing player (hyper-rationality) and know that the optimal strategy in
the environment is always the same (static Nash equilibrium). The intuition that in the real world
people are not completely knowledgeable and hyper-rational players and that an equilibrium can
change dynamically led to the development of evolutionary game theory.
Before introducing the most elementary concepts from (Evolutionary) Game Theory we
summarize some well known examples of strategic interaction in the next section.
5
Kakutani’s fixpoint theorem goes as follows. Consider X a nonempty set and F a point-to-set map from
X to subsets of X. Now, if F is continuous, X is compact and convex, and for each x in X, F (x) is
nonempty and convex, F has a fixed point. Applying this theorem (and thus checking its conditions) to
the best response function proves the existence of a Nash equilibrium.
11
D 1 5
A=
C 0 3
Table 1 Matrix (A) defines the payoff for the row player for the Prisoner’s dilemma. Strategy D is
Defect and strategy C is Cooperate.
D C
B= 1 0
5 3
Table 2 Matrix (B) defines the payoff for the column player for the Prisoner’s dilemma. Strategy D is
Defect and strategy C is Cooperate.
F 2 0
A=
O 0 1
Table 3 Matrix (A) defines the payoff for the row player for the Battle of the sexes. Strategy F is
choosing Football and strategy O is choosing the Opera.
F O
B= 1 0
0 2
Table 4 Matrix (B) defines the payoff for the column player for Battle of the sexes. Strategy F is
choosing Football and strategy O is choosing the Opera.
In this game two children hold both a pennie and independently choose which side of the coin
to show (Head or Tails). The first child wins if both coins show the same side, otherwise child
2 wins. This is an example of a zero-sum game as can be seen from the payoff Tables 5 and 6.
Whatever is lost by one player, must be won by the other player.
H 1 -1
A=
T -1 1
Table 5 Matrix (A) defines the payoff for the row player for the matching pennies game. Strategy H is
playing Head and strategy T is playing Tail.
H T
B= -1 1
1 -1
Table 6 Matrix (B) defines the payoff for the column player for the matching pennies game. Strategy
H is playing Head and strategy T is playing Tail.
Defining a game more formally we restrict ourselves to the 2-player 2-action game. Nevertheless,
an extension to n-players n-actions games is straightforward, but examples in the n-player case
do not show the same illustrative strength as in the 2-player case. A game G = (S1 , S2 , P1 , P2 )
is defined by the payoff functions P1 , P2 and their strategy sets S1 for the first player and S2
for the second player. In the 2-player 2-strategies case, the payoff functions P1 : S1 × S2 → ℜ and
P2 : S1 × S2 → ℜ are defined by the payoff matrices, A for the first player and B for the second
player, see Table 7. The payoff tables A, B define the instantaneous rewards. Element aij is the
reward the row-player (player 1) receives for choosing pure strategy si from set S1 when the
column-player (player 2) chooses the pure strategy sj from set S2 . Element bij is the reward for
the column-player for choosing the pure strategy sj from set S2 when the row-player chooses pure
strategy si from set S1 .
The family of 2 × 2 games is usually classified in three subclasses, as follows [Red01],
a11 a12 b11 b12
A= B=
a21 a22 b21 b22
Table 7 The left matrix (A) defines the payoff for the row player, the right matrix (B) defines the payoff
for the column player
13
Subclass 1: if (a11 − a21 )(a12 − a22 ) > 0 or (b11 − b12 )(b21 − b22 ) > 0, at least one of the 2
players has a dominant strategy, therefore there is just 1 strict equilibrium.
Subclass 2: if (a11 − a21 )(a12 − a22 ) < 0,(b11 − b12 )(b21 − b22 ) < 0, and (a11 − a21 )(b11 − b12 ) >
0, there are 2 pure equilibria and 1 mixed equilibrium.
Subclass 3: if (a11 − a21 )(a12 − a22 ) < 0,(b11 − b12 )(b21 − b22 ) < 0, and (a11 − a21 )(b11 − b12 ) <
0, there is just 1 mixed equilibrium.
The first subclass includes those type of games where each player has a dominant strategy6 , as for
instance the prisoner’s dilemma. However it includes a larger collection of games since only one
of the players needs to have a dominant strategy. In the second subclass none of the players has a
dominated strategy (e.g. battle of the sexes). But both players receive the highest payoff by both
playing their first or second strategy. This is expressed in the condition (a11 − a21 )(b11 − b12 ) > 0.
The third subclass only differs from the second in the fact that the players do not receive their
highest payoff by both playing the first or the second strategy (e.g. matching pennies game). This
is expressed by the condition (a11 − a21 )(b11 − b12 ) < 0.
Formally, a Nash equilibrium is defined as follows. When 2 players play the strategy profile s =
(si , sj ) belonging to the product set S1 × S2 then s is a Nash equilibrium if P1 (si , sj ) ≥ P1 (sx , sj )
∀x ∈ {1, ..., n} and P2 (si , sj ) ≥ P2 (si , sx ) ∀x ∈ {1, ..., m} 7 .
small number of the total population. If the reproductive success of the new strategy is smaller
than the original one, it will not overrule the original strategy and will eventually disappear. In
this case we say that the strategy is evolutionary stable against this new appearing strategy. More
generally, we say a strategy is an Evolutionary Stable strategy if it is robust against evolutionary
pressure from any appearing mutant strategy.
Formally an ESS is defined as follows. Suppose that a large population of agents is programmed
to play the (mixed) strategy s, and suppose that this population is invaded by a small number
′
of agents playing strategy s . The population share of agents playing this mutant strategy is
ǫ ∈ ]0, 1[. When an individual is playing the game against a random chosen agent, chances that
he is playing against a mutant are ǫ and against a non-mutant are 1 − ǫ. The expected payoff for
the first player, being a non mutant is:
′ ′
P (s, (1 − ǫ)s + ǫs ) = (1 − ǫ)P (s, s) + ǫp(s, s )
{ESS} ⊂ {N E}
The condition for an ESS is more strict than the Nash condition. Intuitively this can be understood
as follows: as defined above a Nash equilibrium is a best reply against the strategies of the other
players. Now if a strategy s1 is an ESS then it is also a best reply against itself, and as such
optimal. If it was not optimal against itself there would have been a strategy s2 that would lead
to a higher payoff against s1 than s1 itself.
i.e.
P (s2 , s2 ) = P (s1 , s2 )
If s2 does as well against itself as s1 does, then s2 earns at least as much against (1 − ǫ)s1 + ǫs2
as s1 and then s1 is no longer evolutionary stable. To summarize we now have the following 2
properties for an ESS s1 ,
1. P (s2 , s1 ) ≤ P (s1 , s1 ) ∀ s2
2. P (s2 , s1 ) = P (s1 , s1 ) =⇒ P (s2 , s2 ) < P (s1 , s2 ) ∀ s2 6= s1
15
3.3.6 Examples
In this section we provide an example for each class of game described in Section 3.3.1) and
illustrate the Nash equilibrium concept and Evolutionary Stable Strategy concept as well as
Pareto optimality.
For the first subclass we consider the prisoner’s dilemma game. The strategic setup of this
game has been explained in Section 3.2. The payoffs of the game are repeated in table 8. As one
can see both players have one dominant strategy, more precisely defect.
1 5 1 0
A= B=
0 3 5 3
Table 8 Prisoner’s dilemma: The left matrix (A) defines the payoff for the row player, the right one
(B) for the column player.
For both players, defecting is the dominant strategy and therefore always the best reply toward
any strategy of the opponent. So the Nash equilibrium in this game is for both players to defect.
Let’s now determine whether this equilibrium is also an evolutionary stable strategy. Suppose
ǫ ∈ [0, 1] is the number of cooperators in the population. The expected payoff of a cooperator is
3ǫ + (1 − 0ǫ) and that of a defector is 5ǫ + (1 − 1ǫ). Since for all ǫ,
defect is an ESS. So the number of defectors will always increase and the population will eventually
only consist of defectors. In Section 3.4 this dynamical process will be illustrated by the replicator
equations.
This equilibrium which is both Nash and ESS, is not a Pareto optimal solution. This can be eas-
ily seen if we look at the payoff tables. The combination (def ect, def ect) yields a payoff of (1, 1),
which is a smaller payoff for both players than the combination (cooperate, cooperate) which
yields a payoff of (3, 3). Moreover the combination (cooperate, cooperate) is a Pareto optimal
solution. However, if we apply the definition of Pareto optimality, then also (def ect, cooperate)
and (cooperate, def ect) are Pareto optimal. But both these Pareto optimal solutions do not
Pareto dominate the Nash equilibrium and therefore are not of interest to us. The combination
(cooperate, cooperate) is a Pareto optimal solution which Pareto dominates the Nash equilibrium.
For the second subclass we considered the battle of the sexes game [Gin00, Wei96]. In this
game there are 2 pure strategy Nash equilibria, i.e. (f ootball, f ootball) and (opera, opera), which
both are also evolutionary stable (as demonstrated in Section 3.4.4). There is also 1 mixed nash
equilibrium, i.e. where the row player (the husband) plays f ootball with 2/3 probability and
opera with 1/3 probability and the column player (the wife) plays opera with 2/3 probability
and f ootball with 1/3 probability. However, this equilibrium is not an evolutionary stable one.
2 0 1 0
A= B=
0 1 0 2
Table 9 Battle of the sexes: The left matrix (A) defines the payoff for the row player, the right one (B)
for the column player.
The third class consists of the games with a unique mixed equilibrium ((1/2, 1/2), (1/2, 1/2)).
For this category we used the game defined by the matrices in Table 10, i.e. maching pennies
. This equilibrium is not an evolutionary stable one. Typical for this class of games is that the
interior trajectories define closed orbits around the equilibrium point.
16
1 −1 −1 1
A= B=
−1 1 1 −1
Table 10 The left matrix (A) defines the payoff for the row player, the right one (B) for the column
player.
One way in which EGT proceeds is by constructing a dynamic process in which the proportions
of various strategies in a population evolve. Examining the expected value of this process gives
an approximation which is called the RD. An abstraction of an evolutionary process usually
combines two basic elements: selection and mutation. Selection favors some varieties over
others, while mutation provides variety in the population. The replicator dynamics highlight the
role of selection, it describes how systems consisting of different strategies change over time. They
are formalized as a system of differential equations. Each replicator (or genotype) represents one
(pure) strategy si . This strategy is inherited by all the offspring of the replicator. The general
form of a replicator dynamic is the following:
dxi
= [(Ax)i − x · Ax]xi (22)
dt
In equation (22), xi represents the density of strategy si in the population, A is the payoff
matrix which describes the different payoff values each individual replicator receives when
interacting with other replicators in the population. The state of the population (x) can be
described as a probability vector x = (x1 , x2 , ..., xJ ) which expresses the different densities of all
the different types of replicators in the population. Hence (Ax)i is the payoff which replicator si
receives in a population with state x, and x · Ax describes the average payoff in the population.
dxi
The growth rate xdti of the population share using strategy si equals the difference between the
strategy’s current payoff and the average payoff in the population. For further information we
refer the reader to [Wei96, Hof98].
This translates into the following replicator equations for the two populations:
dpi
= [(Aq)i − p · Aq]pi (23)
dt
dqi
= [(B p)i − q · B p]qi (24)
dt
As can be seen in equation (23) and (24), the growth rate of the types in each population is
now determined by the composition of the other population. Note that, when calculating the
rate of change using these systems of differential equations, two different payoff matrices (A and
B ) are used for the two different players.
where x is a mixed strategy in m-dimensional space (there are m pure strategies), and xi is
the probability with which strategy si is played. Calculating the RD for the unit vectors of this
space (putting all the weight on a particular pure strategy), yields zero. This is simply due to the
properties of the simplex ∆, where the sum of all population shares remains equal to 1 and no
population share can ever turn negative. So, if all pure strategies are present in the population
at any time, then they always have been and always will be present, and if a pure strategy is
absent from the population at any time, then it always has been and always will be absent8 . So,
this means that the pure strategies are rest points of the RD, but depending on the structure
of the game which is played these pure strategies do not need to be a Nash equilibrium. Hence
not every rest point of the RD is a Nash equilibrium. So the concept of dynamic equilibrium or
stationarity alone is not enough to have a better understanding of the RD.
For this reason the criterion of asymptotic stability came along, where you have some kind of local
test of dynamic robustness. Local in the sense of minimal perturbations. For a formal definition
of asymptotic stability, we refer to [Hir74]. Here we give an intuitive definition. An equilibrium
is asymptotic stable if the following two conditions hold:
• Any solution path of the RD that starts sufficiently close to the equilibrium remains
arbitrarily close to it. This condition is called Liapunov stability.
• Any solution path that starts close enough to the equilibrium, converges to the equilibrium.
Now, if an equilibrium of the RD is asymptotically stable (i.e. being robust to local perturbations)
then it is a Nash equilibrium. For a proof, the reader is referred to [Red01]. An interesting result
due to Sigmund ans Hofbauer [Hof98] is the following : If s is an ESS, then the population state
x = s is asymptotically stable in the sense of the RD. For a proof see [Hof98, Red01]. So, by this
result we have some kind of refinement of the asymptotic stable rest points of the RD and it
provides a way of selecting equilibria from the RD that show dynamic robustness.
8
Of course a solution orbit can evolve toward the boundary of the simplex as time goes to infinity, and
thus in the limit, when the distance to the boundary goes to zero, a pure strategy can disappear from
the population of strategies. For a more formal explanation, we refer the reader to [Wei96]
18
3.4.4 Examples
In this section we continue with the examples of Section 3.2 and the classification of games of
Section 3.3.1. We start over with the Prisoner’s Dilemma game (PD). In Figure 5 we plotted the
direction field of the replicator equations applied to the PD. A Direction field is a very elegant
and excellent tool to understand and illustrate a system of differential equations. The direction
fields presented here consist of a grid of arrows tangential to the solution curves of the system.
Its a graphical illustration of the vector field indicating the direction of the movement at every
point of the grid in the state space. Filling in the parameters for each game in equations 23 and
24, allowed us to plot this field.
0.8
0.6
y
0.4
0.2
Figure 5 The direction field of the RD of the prisoner’s dilemma using payoff Table 8.
The x-axis represents the probability with which the first player will play defect and the y-axis
represents the probability with which the second player will play defect. So the Nash equilibrium
and the ESS lie at coordinates (1, 1). As you can see from the field plot all the movement goes
toward this equilibrium.
Figure 6 illustrates the direction field diagram for the battle of the sexes game. As you may
recall from Section 3.3.6 this game has 2 pure Nash equilibria and 1 mixed Nash equilibrium.
These equilibria can be seen in the figure at coordinates (0, 0), (1, 1), (2/3, 1/3). The 2 pure
equilibria are ESS as well. This is also easy to verify from the plot, more precisely, any small
perturbation away from the equilibrium is led back to the equilibrium by the dynamics.
The mixed equilibrium, which is Nash, is not an asymptotic stable strategy, which is obvious
from the plot. From Section 3.3.6, we can now also conclude that this equilibrium is not
evolutionary stable either.
0.8
0.6
y
0.4
0.2
Figure 6 The direction field of the RD of the Battle of the sexes game using payoff Table 9.
19
Typical for the traditional game theoretic approach is to assume perfectly rational players (or
hyperrationality) who try to find the most rational strategy to play. These players have a perfect
knowledge of the environment and the payoff tables and they try to maximize their individual
payoff. These assumptions made by classical game theory just do not apply to the real world and
Multi-Agent settings in particular.
In contrast, EGT is descriptive and starts from more realistic views of the game and its
players. A game is not played only once, but repeatedly with changing opponents. Moreover,
the players are not completely informed, sometimes misinterpret each others’ actions, and
are not completely rational but also biologically and sociologically conditioned. Under these
circumstances, it becomes impossible to judge what choices are the most rational ones. The
question now becomes how a player can learn to optimize its behavior and maximize its return.
For this learning process, mathematical models are developed, e.g. replicator equations.
Summarizing the above we can say that EGT treats agents’ objectives as a matter of fact,
not logic, with a presumption that these objectives must be compatible with an appropriately
evolutionary dynamic [Gin00]. Evolutionary models do not predict self-interested behaviour. It
describes how agents can make decisions in complex environments, in which they interact with
other agents. In such complex environments software agents must be able to learn from their
environment and adapt to its non-stationarity.
The basic properties of a Multi-Agent System correspond exactly with that of EGT. First of
all, a MAS is made up of interactions between two or more agents, who each try to accomplish a
certain, possibly conflicting, goal. No agent has the guarantee to be completely informed about
the other agents intentions or goals, nor has it the guarantee to be completely informed about
the complete state of the environment. Of great importance is that EGT offers us a solid basis to
understand dynamic iterative situations in the context of strategic games. A MAS has a typical
dynamical character, which makes it hard to model and brings along a lot of uncertainty. At this
stage EGT seems to offer us a helping hand in understanding this typical dynamical processes in
a MAS and modeling them in simple settings as iterative games of two or more players.
the presence of other agents, who are possibly influencing the effects a single agent experiences,
can be completely ignored. Thus a single agent is learning as if the other agents are not around.
On the other hand, the presence of other agents can be modeled explicitly. This results in a
joint action space approach which recently received quite a lot of attention [Cla98, Hu99, Lit94].
period of time and then excluding actions from their private action space, so that the joint action
space gets considerably smaller and the agents are able to converge to a NE of the remaining
subgame. By repeatedly excluding action, the agents are able to figure out the Pareto front and
decide on which combination of actions is preferable.
possible successful application of evolutionary game theoretic concepts and models in these
different fields becomes more and more apparent.
If the word evolution is used in the biological sense, then this means we are concerned with
environments in which behavior is genetically determined. Strategy selection then depends on the
reproductive success of their carriers, i.e. genes. Often, evolution is not intended to be understood
in a biological sense but rather as a learning process which we call cultural evolution [Bjo95].
Of course it is implicit and intuitive that there is an analogy between biological evolution and
learning. We can now look at this analogy at two different levels. First there is the individual
level. An individual decision maker usually has many ideas or strategies in his mind according to
which he can behave. Which one of these ideas dominates, and which ones are given less attention
depends on the experiences of the individual. We can regard such a set of ideas as a population of
possible behaviors. The changes which such a population undergoes in the individual’s mind can
be very similar to biological evolution. Secondly, it is possible that individual learning behavior
is different from biological evolution. An example is best response learning where individuals
adjust too rapidly to be similar to evolution. However, then it might be the case that at the
population level, consisting of different individuals, a process is operating analogous to biological
evolution. In this paper we describe the similarity between biological evolution and learning at
the individual level in a formal and experimental manner.
In this section we discuss or merely point out the results in making this relation between
Multi-Agent Reinforcement Learning and EGT explicit. Börgers and Sarin have shown how the
two fields are related in terms of dynamic behaviour, i.e. the relation between Cross learning
and the replicator dynamics. The replicator dynamics postulate gradual movement from worse
to better strategies. This is in contrast to classical Game Theory, which is a static theory and
does not prescribe the dynamics of adjustment to equilibrium. The main result of Börgers and
Sarin is that in an appropriately constructed continuous time limit, the Cross’ learning model
converges to the asymmetric, continuous time version of the replicator dynamics. The continuous
time limit is constructed in such a manner that each time interval sees many iterations of the
game, and that the adjustments that the players (or Cross learners) make between two iterations
of the game are very small. If the limit is constructed in this manner, the (stochastic) learning
process becomes in the limit deterministic. This limit process satisfies the system of differential
equations which characterizes the replicator dynamics. For more details see [Bör97]. We illustrate
this result with the prisoners dilemma game. In Figure 7 we plotted the direction field of the
Replicator equations for the prisoner’s dilemma game and we also plotted the Cross learning
process for this same game.
0.8
0.8
0.6
0.6
y
0.4 0.4
0.2
0.2
Figure 7 Left: The direction field plot of the RD of the prisoner’s dilemma game. The x-axis represents
the probability with which the first player (or row player) plays defect, and the y-axis represents the
probability with which the second player (or column player) plays defect. The strong attractor and Nash
equilibrium of the game lies at coordinates (1, 1) as one can see in the plot. Right: The paths induced
by the Cross learning process of the prisoner’s dilemma game. The arrows point out the direction of the
learning process. These probabilities are now learned by the Cross learning algorithm.
23
For both players we plotted the probability of choosing their first strategy (in this case defect).
The x-axis represents the probability with which the row player plays defect, and the y-axis
represents the probability with which the column player plays this same strategy. As you can see
the sample paths of the Cross learning process approximates the paths of the RD.
In previous work the authors have extended the results of Börgers and Sarin to popular
Reinforcement Learning (RL) models as Learning Automata (LA) and Q-learning. In [Tuy02]
the authors have shown that the Cross learning model is a Learning Automaton with a linear-
reward-inaction updating scheme. All details and experiments are available in [Tuy02].
Next, we continue with the mathematical relation between Q-learning and the Replicator
Dynamics. In [Tuy03b] we derived mathematically the dynamics of Boltzmann Q-learning.
We investigated here whether there is a possible relation with the evolutionary dynamics of
Evolutionary Game Theory. More precisely we constructed a continuous time limit of the Q-
learning model, where Q-values are interpreted as Boltzmann probabilities for the action selection,
in an analogous manner of Börgers and Sarin for Cross learning. We briefly summarize the findings
here. All details can be consulted in [Tuy03b]. The derivation has been restricted to a 2 player
situation for reasons of simplicity. Each agent(or player) has a probability vector over his action
set , more precisely x1 , ..., xn over action set a1 , ..., an for the first player and y1 , ..., ym over
b1 , ..., bm for the second player. Formally the Boltzmann distribution is described by,
eτ Qai (k)
xi (k) = Pn τ Qaj (k)
j=1 e
where xi (k) is the probability of playing strategy i at time step k and τ is the temperature.
The temperature determines the degree of exploring different strategies. As the trade-off between
exploration-exploitation is very important in RL, it is important to set this parameter correctly.
Now suppose that we have payoff matrices A and B for the 2 players. Calculating the time limit
results in,
dxi X xj
= xi ατ ((Ay)i − x · Ay) + xi α xj ln( ) (25)
dt j
xi
for the first player and analogously for the second player in,
dyi X yj
= yi ατ ((B x)i − y · B x) + yi α yj ln( ) (26)
dt j
yi
Comparing (25) or (26) with the RD in (22), we see that the first term of (25) or (26) is exactly
the RD and thus takes care of the selection mechanism, see [Wei96]. The mutation mechanism
for Q-learning is therefore left in the second term, and can be rewritten as:
X
xi α xj ln(xj ) − ln(xi ) (27)
j
In equation (27) we recognize 2 entropy terms, one over the entire probability distribution x,
and one over strategy xi . Relating entropy and mutation is not new. It is a well known fact
[Schneid00, Sta99] that mutation increases entropy. In [Sta99], it is stated that the concepts
are familiar with thermodynamics in the following sense: the selection mechanism is analogous
to energy and mutation to entropy. So generally speaking, mutations tend to increase entropy.
Exploration can be considered as the mutation concept, as both concepts take care of providing
variety.
Equations 25 and 26 now express the dynamics of both Q-learners in terms of Boltzmann
probabilities, from which the RD emerge.
In [Tuy03c] we answered the question whether it is possible to first define the dynamic behavior
in terms of Evolutionary Game Theory (EGT) and then develop the appropriate RL-algorithm.
24
For these reasons we extended the RD of EGT. We call it the Extended Replicator Dynamics.
All details on this work can be found in [Tuy03c]. The main result is that the extended dynamics
guarantee an evolutionary stable outcome in all types of 1-stage games.
Finally, in [Hoe04] the authors have shown how the EGT approach can be used for Dispersion
Games [Gre02]. In this cooperative game, n agents must learn to choose from k tasks using local
rewards and full utility is only achieved if each agent chooses a distinct task. We visualized the
learning process of the MAS and showed typical phenomena of distributed learning in a MAS.
Moreover, we showed how the derived fine tuning of parameter settings from the RD can support
application of the COllective INtelligence (COIN) framework of Wolpert et al. [Wol98, Wol99]
using dispersion games. COIN is a proved engineering approach for learning of cooperative tasks
in MASs. Broadly speaking, COIN defines the conditions that an agent’s private utility function
has to meet to increase the probability that learning to optimize this function leads to increased
performance of the collective of agents. Thus, the challenge is to define suitable private utility
functions for the individual agents, given the performance of the collective. We showed that the
derived link between RD and RL predicts performance of the COIN framework and visualizes
the incentives provided in COIN toward cooperative behavior.
5 Final Remarks
In this survey paper we investigated Reinforcement Learning and Evolutionary Game Theory
in a Multi-Agent setting. We provided most of the fundamentals of RL and EGT and moreover
showed their remarkable similarities. We also discussed some of the excellent existing Multi-agent
RL algorithms today and gave a more detailed description of the Evolutionary Game Theoretic
approach of the authors. However, still a lot of work needs to be done and some problems are still
unresolved. Especially, overcoming problems of incomplete information and large state spaces,
in Multi-Agent Systems as for instance Sensor Webs, are still hard. More precisely, under these
conditions it becomes impossible to learn models over other agents, storing information on them
and using a lot of communication.
6 Acknowledgments
After writing this survey paper an acknowledgment is in place. We wish to thank our colleagues
of the Computational Modeling Lab from the Vrije Universiteit Brussel, Belgium. Most of all we
want to express our appreciation and gratitude to Katja Verbeeck and Maarten Peeters for their
important input and support in writing this article.
We also wish to express our gratitude to Dr. ir Pieter-Jan ’t Hoen of the CWI (center for
mathematics and computer science) in the Netherlands, especially for his input on the COIN
framework.
References
[And87]Anderson C.W., Strategy Learning with multilayer connectionist representations, Proceedings of
the 4th international conference on Machine Learning, pp. 103-114.
[Bar83]Barto A., Sutton R., and Anderson C., Neuronlike adaptive elements that can solve difficult
learning control problems, IEEE Transactions on Systems, Man and Cybernetics, Vol. 13, No. 5,pp.
834-846.
[Baz03]Bazzan A. L. C., Klugl Franziska, Learning to Behave Socially and Avoid the Braess Paradox In a
Commuting Scenario. Proceedings of the first international workshop on Evolutionary Game Theory
for Learning in MAS,july 14 2003, Melbourne Australia.
[Baz97]Bazzan A. L. C., A game-theoretic approach to coordination of traffic signal agents. PhD thesis,
Univ. of Karlsruhe, 1997.
[Bel62]Bellman R.E., and Dreyfuss S.E., Applied Dynamical Programming, Princeton University press.
[Ber76]Bertsekas, D.P., Dynamic Programming and Stochastic Control. Mathematics in Science and
Engineering, Vol. 125, Academic Press, 1976.
25
[Bjo95]Bjornerstedt J., and Weibull, J. Nash Equilibrium and Evolution by Imitation. The Rational
Foundations of Economic Behavior, (K. Arrow et al, ed.), Macmillan, 1995.
[Bör97]Börgers, T., Sarin, R., Learning through Reinforcement and Replicator Dynamics. Journal of
Economic Theory, Volume 77, Number 1, November 1997.
[Bus55]Bush, R. R., Mosteller, F., Stochastic Models for Learning, Wiley, New York, 1955.
[Cas94]Cassandra, A. R., Kaelbling, L. P., and Littman , M. L., Acting optimally in partially observable
stochastic domains. In Proceedings of the Twelfth National Conference on Artificial Intelligence,
Seattle, WA, 1994.
[Cla98]Claus, C., Boutilier, C., The Dynamics of Reinforcement Learning in Cooperative Multi-Agent
Systems, Proceedings of the 15th international conference on artificial intelligence, p.746-752, 1998.
[Gin00]Gintis, C.M., Game Theory Evolving. University Press, Princeton, June 2000.
[Gre02]T. Grenager, R. Powers, and Y. Shoham. Dispersion games: general definitions and some specific
learning results. In AAAI 2002, 2002.
[Hir74]Hirsch, M.W., and Smale, S., Differential Equations, Dynamical Systems and Linear Algebra.
Academic Press, Inc, 1974.
[Hoe04]’t Hoen, P.J., Tuyls, K., Engineering Multi-Agent Reinforcement Learning using Evolutionary
Dynamics. Proceedings of the 15th European Conference on Machine Learning (ECML’04), LNAI
Volume 3201, 20-24 september 2004, Pisa, Italy.
[Hof98]Hofbauer, J., Sigmund, K., Evolutionary Games and Population Dynamics. Cambridge University
Press, November 1998.
[Hu99]Hu, J., Wellman, M.P., Multiagent reinforcement learning in stochastic games. Cambridge
University Press,November 1999.
[Jaf01]Jafari, C., Greenwald, A., Gondek, D. and Ercal, G., On no-regret learning, fictitious play, and
nash equilibrium. Proceedings of the Eighteenth International Conference on Machine Learning,p 223
- 226, 2001.
[Kae96]Kaelbling, L.P., Littman, M.L., Moore, A.W., Reinforcement Learning: A Survey. Journal of
Artificial Intelligence Research, 1996.
[Kap02]Kapetanakis S. and Kudenko D., Reinforcement Learning of Coordination in Cooperative Multi-
agent Systems, AAAI 2002.
[Kap04]Kapetanakis S., Independent Learning of Coordination in Cooperative Single-stage Games, PhD
dissertation, University of York, 2004.
[Kon04]Ville Kononen, Multiagent Reinforcement Learning in Markov Games: Asymmetric and Symmet-
ric approaches, PhD dissertation, Helsinki University of Technology, 2004.
[Lau00]Lauer M. and Riedmiller M., An algorithm for distributed reinforcement learning in cooperative
multi-agent systems, Proceedings of the seventeenth International Conference on Machine Learning,
2000.
[Lit94]Littman, M.L., Markov games as a framework for multi-agent reinforcement learning. Proceedings
of the Eleventh International Conference on Machine Learning, p 157 - 163, 1994.
[Loc98]Loch, J., Singh, S., Using eligibility traces to find the best memoryless policy in a partially
observable markov process. Proceedings of the fifteenth International Conference on Machine Learning,
San Francisco, 1998.
[May82]Maynard-Smith, J., Evolution and the Theory of Games. Cambridge University Press, December
1982.
[May73]Maynard Smith, J., Price, G.R., The logic of animal conflict. Nature, 146: 15-18, 1973.
[Muk01]R.Mukherjee and S.Sen, Towards a Pareto Optimal Solution in general-sum games, Working
Notes of Fifth Conference on Autonomous Agents, 2001, pages 21 - 28.
[Nar74]Narendra, K., Thathachar, M., Learning Automata: A Survey. IEEE Trans. Syst., Man, Cybern.,
Vol. SMC-14, pages 323-334, 1974.
[Nar89]Narendra, K., Thathachar, M., Learning Automata: An Introduction. Prentice-Hall, 1989.
[Now99a]Nowé, A., Parent, J., Verbeeck, K., Social agents playing a periodical policy. Proceedings of the
12th European Conference on Machine Learning, p 382 - 393, 2001.
[Now99b]Nowé A. and Verbeeck K., Distributed Reinforcement learning, Loadbased Routing a case study,
Notes of the Neural, Symbolic and Reinforcement Methods for sequence Learning Workshop at ijcai99,
1999, Stockholm, Sweden.
[Neu44]von Neumann, J., Morgenstern, O., Theory of Games and Economic Behaviour, Princeton
University Press, 1944.
[Osb94]Osborne J.O., Rubinstein A., A course in game theory. Cambridge, MA: MIT Press,1994.
[Pee03]Maarten Peeters, A Study of Reinforcement Learning Techniques for Cooperative Multi-Agent
Systems, Computational Modeling Lab, Vrije Universiteit Brussel, 2003.
[Pen98]Pendrith M.D., McGarity M.J., An analysis of direct reinforcement learning in non-Markovian
domains. Proceedings of the fifteenth International Conference on Machine Learning, San Fran-
cisco,1998.
26
[Per02]Perkins T.J., Pendrith M.D., On the Existence of Fixed Points for Q-learning and Sarsa in
Partially Observable Domains. Proceedings of the International Conference on Machine Learning
(ICML02),2002.
[Red01]Redondo, F.V., Game Theory and Economics, Cambridge University Press, 2001.
[Sam97]Samuelson, L. Evolutionary Games and Equilibrium Selection, MIT Press, Cambridge, MA, 1997.
[Schneid00]Schneider, T.D., Evolution of biological information. Journal of NAR, volume 28, pages 2794
- 2799, 2000.
[Sta99]Stauffer, D., Life, Love and Death: Models of Biological Reproduction and Aging. Institute for
Theoretical physics, Köln, Euroland, 1999.
[Ste97]Steenhaut K., Now A., Fakir M. and Dirkx E., Towards a Hardware Implementation of Reinforce-
ment Learning for Call Admission Control in Networks for Integrated Services. In the proceedings of
the International Workshop on Applications of Neural Networks and other Intelligent Techniques to
Telecommunications 3, Melbourne, 1997.
[Sut88]Sutton, R.S., Learning to Predict by the Methods of Temporal Differences, Machine Learning 3,
Kluwer Academic Publishers, Boston, pp. 9-44.
[Sut00]Sutton, R.S., Barto, A.G., Reinforcement Learning: An introduction. Cambridge, MA: MIT Press,
1998.
[Sto00]Stone P., Layered Learning in Multi-Agent Systems. Cambridge, MA: MIT Press, 2000.
[Tha02]Thathacher M.A.L., Sastry P.S., Varieties of Learning Automata: An Overview. IEEE Transac-
tions on Systems, Man, And Cybernetics-Part B: Cybernetics, Vol. 32, NO.6, 2002.
[Tse62]Tsetlin M.L., On the behavior of finite automata in random media. Autom. Remote Control, vol.
22, pages 1210-1219, 1962.
[Tse73]Tsetlin M.L., Theory and Modeling of Biological Systems. New York: Academic, 1973.
[Tsi93]Tsitsiklis, J.N., Asynchronous stochastic approximation and Q-learning. Internal Report from the
laboratory for Information and Decision Systems and the Operation Research Center, MIT 1993.
[Tuy02]Tuyls, K., Lenaerts, T., Verbeeck, K., Maes, S. and Manderick, B, Towards a Relation Between
Learning Agents and Evolutionary Dynamics. Proceedings of the Belgium-Netherlands Artificial
Intelligence Conference 2002 (BNAIC). KU Leuven, Belgium.
[Tuy03a]Tuyls, K., Verbeeck, K., and Maes, S. On a Dynamical Analysis of Reinforcement Learning in
Games: Emergence of Occam’s Razor. Lecture Notes in Artificial Intelligence, Multi-Agent Systems
and Applications III, Lecture Notes in AI 2691, (Central and Eastern European conference on Multi-
Agent Systems 2003). Prague, 16-18 june 2003, Czech Republic.
[Tuy03b]Tuyls, K., Verbeeck, K., and Lenaerts, T. A Selection-Mutation model for Q-learning in Multi-
Agent Systems. The ACM International Conference Proceedings Series, Autonomous Agents and
Multi-Agent Systems 2003. Melbourne, 14-18 juli 2003, Australia.
[Tuy03c]Tuyls, K., Heytens, D., Nowé, A., and Manderick, B., Extended Replicator Dynamics as a
Key to Reinforcement Learning in Multi-Agent Systems. Proceedings of the European Conference
on Machine Learning’03, Lecture Notes in Artificial Intelligence. Cavtat-Dubrovnik, 22-26 september
2003, Croatia.
[Ver02]K.Verbeeck and A. Nowé and T.Lenaerts and J. Parent, Learning to reach the Pareto Optimal
Nash Equilibrium as a Team, Proceedings of the 15th Australian Joint Conference on Artificial
Intelligence, 2002, pp. 407 - 418, publisher=”Springer-Verlag LNAI2557
[Wat92]Watkins, C. and Dayan, P., Q-learning. Machine Learning, 8(3):279-292, 1992.
[Wei96]Weibull, J.W., Evolutionary Game Theory, MIT Press 1996.
[Wei98]Weibull, J.W., What we have learned from Evolutionary Game Theory so far? Stockholm School
of Economics and I.U.I. may 7, 1998.
[Wei99]Weiss, G., Multiagent Systems. A Modern Approach to Distributed Artificial Intelligence. Edited
by Gerard Weiss Cambridge, MA: MIT Press. 1999.
[Wol98]Wolpert, D.H., Tumer, K., and Frank, J., Using Collective Intelligence to Route Internet Traffic.
Advances in Neural Information Processing Systems-11, pages 952–958. Denver, 1998.
[Wol99]Wolpert, D.H., Wheller, K.R., and Tumer, K., General principles of learning-based multi-agent
systems. Proceedings of the Third International Conference on Autonomous Agents (Agents’99), ACM
Press. Seattle, WA, USA, 1999.
[Woo02]Wooldridge, M., An Introduction to MultiAgent Systems. Published in February 2002 by John
Wiley, Sons, Chichester, England, 2002.