0% found this document useful (0 votes)
4 views

Model-Based Stochastic Search For Large Scale Optimization of

Uploaded by

ynader2001
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Model-Based Stochastic Search For Large Scale Optimization of

Uploaded by

ynader2001
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Model-Based Stochastic Search for Large Scale Optimization of

Multi-Agent UAV Swarms


David D. Fan1 , Evangelos A. Theodorou2 , and John Reeder3

Abstract— Recent work from the reinforcement learning


community has shown that Evolution Strategies are a fast and
scalable alternative to other reinforcement learning methods.
In this paper we show that Evolution Strategies are a special
case of model-based stochastic search methods. This class of
algorithms has nice asymptotic convergence properties and
known convergence rates. We show how these methods can
be used to solve both cooperative and competitive multi-
agent problems in an efficient manner. We demonstrate the
effectiveness of this approach on two complex multi-agent UAV
swarm combat scenarios: where a team of fixed wing aircraft
must attack a well-defended base, and where two teams of
agents go head to head to defeat each other† .
I. I NTRODUCTION
Reinforcement Learning is concerned with maximizing
rewards from an environment through repeated interactions
and trial and error. Such methods often rely on various
approximations of the Bellman equation and include value Fig. 1: The SCRIMMAGE multi-agent simulation environ-
function approximation, policy gradient methods, and more ment. In this scenario, blue team fixed-wing agents attack
[1]. The Evolutionary Computation community, on the other red team quadcopter defenders. The two white lines indicate
hand, have developed a suite of methods for black box missed shots taken by the quadcopters at the blue team craft.
optimization and heuristic search [2]. Such methods have We use the SCRIMMAGE platform to perform experiments
been used to optimize the structure of neural networks for on a variety of scenarios.
vision tasks, for instance [3].
Recently, Salimans et al. have shown that a particular
variant of evolutionary computation methods, termed Evo- solve a problem with: 1) a more complex and decipherable
lution Strategies (ES) are a fast and scalable alternative to policy architecture which allows for safety considerations; 2)
other reinforcement learning approaches, solving the difficult a large-scale simulated environment with many interacting
humanoid MuJoCo task in 10 minutes [4]. ES are a class elements; 3) multiple sources of stochasticity including vari-
of algorithms with a long history in the evolutionary com- ations in initial conditions, disturbances, etc.; and 4) sparse
putation community beginning in Germany in the 19560’s rewards which only occur at the very end of a long episode.
(see [5], [6], [7], and for a review [8]) . Salimans et al. A common critique of evolutionary computation algo-
argue that ES has several benefits over other reinforcement rithms is a lack of convergence analysis or guarantees. Of
learning methods: 1) The need to backpropagate gradients course, for problems with non-differentiable and non-convex
through a policy is avoided, which opens up a wider class objective functions, analysis will always be difficult. Nev-
of policy parameterizations; 2) ES methods are massively ertheless, we show that the Evolution Strategies algorithm
parallelizable, which allows for scaling up learning to larger, used by [4] is a special case of a class of model-based
more complex problems; 3) ES often finds policies which stochastic search methods known as Gradient-Based Adap-
are more robust than other reinforcement learning methods; tive Stochastic Search (GASS) [9]. This class of methods
and 4) ES are better at assigning credit to changes in the generalizes many stochastic search methods such as the well-
policy over longer timescales, which enables solving tasks known Cross Entropy Method (CEM) [10], CMA-ES [11],
with longer time horizons and sparse rewards. In this work etc. By casting a non-differentiable, non-convex optimization
we leverage all four of these advantages by using ES to problem as a gradient descent problem, one can arrive at nice
1 Inst. for Robotics and Intelligent Machines, Georgia Institute of asymptotic convergence properties and known convergence
Technology. [email protected] rates [12].
2 Dept. of Aerospace Engineering, Georgia Institute of Technology.
With more confidence in the convergence of Evolution
[email protected] Strategies, we demonstrate how ES can be used to efficiently
3 SPAWAR Systems Center Pacific, San Diego, CA, USA.
[email protected] solve both cooperative and competitive large-scale multi-
† Video at http://goo.gl/dWvQi7 agent problems. Many approaches to solving multi-agent

978-1-5386-9276-9/18/$31.00 2018
c IEEE 2216
problems rely on hand-designed and hand-tuned algorithms mean and variance ω = [μ, σ ]). Then the expectation of J(θ )
(see [13] for a review). One such example, distributed Model over the distribution f (θ ; ω) will always be less than the
Predictive Control, relies on independent MPC controllers optimal value of J, i.e.
on each agent with some level of coordination between 
them [14], [15]. These controllers require hand-designing J(θ ) f (θ ; ω)dθ ≤ J(θ ∗ ) (2)
Θ
dynamics models, cost functions, feedback gains, etc. and
The idea of Gradient-based Adaptive Stochastic Search
require expert domain knowledge. Additionally, scaling these
(GASS) is that one can perform a search in the space
methods up to more complex problems continues to be
of parameters of the distribution Ω rather than Θ, for a
an issue. Evolutionary algorithms have also been tried as
distribution which maximizes the expectation in (2):
a solution to multi-agent problems; usually with smaller, 
simpler environments, and policies with low complexity [16], ω ∗ = arg max J(θ ) f (θ ; ω)dθ (3)
[17]. Recently, a hybrid approach combining MPC and the ω∈Ω Θ
use of genetic algorithms to evolve the cost function for a Maximizing this expectation corresponds to finding a distri-
hand-tuned MPC controller has been demonstrated for a UAV bution which is maximally distributed around the optimal θ .
swarm combat scenario [18]. However, unlike maximizing (1), this objective function can
In this work we demonstrate the effectiveness of our now be made continuous and differentiable with respect to
approach on two complex multi-agent UAV swarm combat ω. With some assumptions on the form of the distribution,
scenarios: where a team of fixed wing aircraft must attack the gradient with respect to ω can be pushed inside the
a well-defended base, and where two teams of agents go expectation.
head to head to defeat each other. Such scenarios have been The GASS algorithm presented by [12] is applicable to
previously considered in simulated environments with less the exponential family of probability densities:
fidelity and complexity [19], [18]. We leverage the compu-
tational efficiency and flexibility of the recently developed f (θ ; ω) = exp{ω  T (θ ) − φ (θ )} (4)

SCRIMMAGE multi-agent simulator for our experiments where φ (θ ) = ln exp(ω  T (θ ))dθ , and T (θ ) is the vector
(Figure 1) [20]. We compare the performance of ES against of sufficient statistics. Since we are concerned with showing
the Cross Entropy Method. We also show for the competitive the connection with ES which uses parameter perturbations
scenario how the policy learns over time to coordinate a sampled with Gaussian noise, we assume that f (θ ; ω) is
strategy in response to an enemy learning to do the same. Gaussian. Furthermore, since we are concerned with learning
We make our code freely available for use (https:// a large number of parameters (i.e. weights in a neural
github.com/ddfan/swarm_evolve). network), we assume an independent Gaussian distribution
over each parameter. Then, T (θ ) = [θ , θ 2 ] ∈ R2n and ω =
II. P ROBLEM F ORMULATION
[μ/σ 2 , −1/nσ 2 ] ∈ R2n , where μ and σ are vectors of the
We can pose our problem as the non-differentiable, non- mean and standard deviation corresponding to the distribu-
convex optimization problem tion of each parameter, respectively.
θ ∗ = arg max J(θ ) (1)
θ ∈Θ Algorithm 1 Gradient-Based Adaptive Stochastic Search
where Θ ⊂ Rn , a nonempty compact set, is the space Require: Learning rate αk , sample size Nk , initial policy
of solutions, and J(θ ) is a non-differentiable, non-convex parameters ω0 = [μ0 , σ02 ] , smoothing function S(), small
real-valued objective function J : Θ → R. θ could be any constant γ > 0.
combination of decision variables of our problem, including 1: for k = 0, 1, · · · do
neural network weights, PID gains, hardware design param- iid
2: Sample θki ∼ f (θ ; ωk ), i = 1, 2, · · · , Nk .
eters, etc. which affect the outcome of the returns J. For 3: Compute returns wik = S(J(θki )) for i = 1, · · · , Nk .
reinforcement learning problems θ usually represents the 4: Compute variance terms Vk = V̂k + γI, eq (5),(6)
parameters of the policy and J is an implicit function of the 5: Calculate normalizer η = ∑i=1
Nk
wik .
sequential application of the policy to the environment. We 6: Update ωk+1 :
first review how this problem can be solved using Gradient- Nk  i   
θk μ
Based Adaptive Stochastic Search methods and then show 7: ωk+1 ← ωk + αk η1 Vk−1 ∑ wik −
i=1
(θki )2 σ 2 + μ2
how the ES algorithm is a special case of these methods.
8: end for
A. Gradient-Based Adaptive Stochastic Search
The goal of model-based stochastic search methods is We present the GASS algorithm for this specific set of
to cast the non-differentiable optimization problem (1) as probability models (Algorithm 1), although the analysis for
a differentiable one by specifying a probabilistic model convergence holds for the more general exponential family
(hence ”model-based”) from which to sample [12]. Let this of distributions. For each iteration k, The GASS algorithm
iid
model be p(θ |ω) = f (θ ; ω), ω ∈ Ω, where ω is a parameter involves drawing Nk samples of parameters θki ∼ f (θ ; ωk ), i =
which defines the probability distribution (e.g. for Gaussian 1, 2, · · · , Nk . These parameters are then used to sample the
distributions, the distribution is fully parameterized by the return function J(θki ). The returns are fed through a shaping

IEEE Symposium Series on Computational Intelligence SSCI 2018 2217


function S(·) : R → R+ and then used to calculate an update ii) The sample size Nk = ξ
√N0 k , where ξ > 0; also αk and
on the model parameters ωk+1 . Nk jointly satisfy α/ Nk = O(k−β ).
The shaping function S(·) is required to be nondecreasing iii) T (θ ) is bounded on Θ.
and bounded from above and below for bounded inputs, iv) If

ω ∗ is a local maximum of (3), the Hessian of
with the lower bound away from 0. Additionally, the set Θ J(θ ) f (θ ; ω)dθ is continuous and symmetric negative
{arg maxθ ∈Θ S(J(θ ))} must be a nonempty subset of the definite in a neighborhood of ω ∗ .
set of solutions of the original problem {arg maxθ ∈Θ J(θ )}. Theorem 1
The shaping function can be used to adjust the explo- Assume that Assumption 1 holds. Let αk = α0 /kα for 0 <
ration/exploitation trade-off or help deal with outliers when α < 1. Let Nk = N0 kτ−α where τ > 2α is a constant. Then
sampling. The original analysis of GASS assumes a more the sequence {ωk } generated by
general form of Sk (·) where S can change at each iteration. √ Algorithm 1 converges to a
limit set w.p.1. with rate O(1/ kτ ).
For simplicity we assume here it is deterministic and un-
changing per iteration. B. Evolutionary Strategies
GASS can be considered a second-order gradient method We now review the ES algorithm proposed by [4] and
and requires estimating the variance of the sampled param- show how it is a first-order approximation of the GASS
eters: algorithm. The ES algorithm consists of the same two
phases as GASS: 1) Randomly perturb parameters with noise
1 Nk
V̂k = ∑ T (θki )T (θki )
Nk − 1 i=1
sampled from a Gaussian distribution. 2) Calculate returns
  and calculate an update to the parameters. The algorithm is
Nk Nk 
1 outlined in Algorithm 2. Once returns are calculated, they are
− 2
Nk − Nk i=1
∑ kT (θ i
) ∑ T (θki ) . (5) sent through a function S(·) which performs fitness shaping
i=1
[21]. Salimans et al. used a rank transformation function for
In practice if the size of the parameter space Θ is large, as is S(·) which they argue reduced the influence of outliers at
the case in neural networks, this variance matrix will be of each iteration and helped to avoid local optima. It is clear that
size 2n × 2n and will be costly to compute. In our work we
approximate V̂k with independent calculations of the variance Algorithm 2 Evolution Strategies
on the parameters of each independent Gaussian. With a Require: Learning rate αk , noise standard deviation γ, initial
slight abuse of notation, consider θ̃ki as a scalar element of policy parameters θ0 , smoothing function S().
θki . We then have, for each scalar element θ̃ki a 2 × 2 variance 1: for k = 0, 1, · · · do
matrix: 2: Sample ε1 , · · · , εn ∼ N (0, I n×n )
3: Compute returns wik = S(J(θk + γεi )) for i = 1, · · · , Nk
 
1 Nk θ̃ki
V̂k = ∑ (θ̃ki )2 θ̃ki (θ̃ki )2
Nk − 1 i=1
4: Update θk+1 :
Nk
  θk+1 ← θk + αk N1 γ ∑i=1 wik εi
  Nk
5:
k
Nk
1 θ̃ki 6: end for
− 2 ∑
Nk − Nk i=1 (θ̃ki )2
∑ θ̃ki (θ̃ki )2 . (6)
i=1
the ES algorithm is a sub-case of the GASS algorithm when
Theorem 1 shows that GASS produces a sequence of the sampling distribution is a point distribution. We can also
ωk that converges to a limit set which specifies a set of recover the ES algorithm by ignoring the variance terms on
distributions that maximize (3). Distributions in this set will line 7 in Algorithm 1. Instead of the normalizing term η, ES
specify how to choose θ ∗ to ultimately maximize (1). As uses the number of samples Nk . The small constant in GASS
with most non-convex optimization algorithms, we are not γ becomes the variance term in the ES algorithm. The update
guaranteed to arrive at the global maximum, but using prob- rule in Algorithm 2 involves multiplying the scaled returns by
abilistic models and careful choice of the shaping function the noise, which is exactly θki − μ in Algorithm 1. We see that
should help avoid early convergence into suboptimal local ES enjoys the same asymptotic convergence rates offered by
maximum. The proof relies on casting the update rule in the the analysis of GASS. While GASS is a second-order method
form of a generalized Robbins-Monro algorithm (see [12], and ES is only a first-order method, in practice ES uses
Thms 1 and 2). Theorem 1 also specifies convergence rates approximate second-order gradient descent methods which
in terms of the number of iterations k, the number of samples adapt the learning rate in order to speed up and stabilize
per iteration Nk , and the learning rate αk . In practice Theorem learning. Examples of these methods include ADAM, RM-
1 implies the need to carefully balance the increase in the SProp, SGD with momentum, etc., which have been shown
number of samples per iteration and the decrease in learning to perform very well for neural networks. Therefore we can
rate as iterations progress. treat ES a first-order approximation of the full second-order
Assumption 1 variance updates which GASS uses. In our experiments we
i) The learning rate αk > 0, αk → 0 as k → ∞, and use ADAM [22] to adapt the learning rate for each parameter.
∑∞
k=0 αk = ∞. As similarly reported in [4], when using adaptive learning

2218 IEEE Symposium Series on Computational Intelligence SSCI 2018


ůůLJƐƚĂƚĞƐ

‫ݔ‬௥௘௙
‫ݕ‬௥௘௙
‫ݖ‬௥௘௙ ^ĂĨĞƚLJ
н
෍ W/
>ŽŐŝĐ ‫ݑ‬ሺ‫ݐ‬ሻ ‫ݔ‬ሺ‫ݐ‬ሻ
Ԧ
Ͳ

݂ሺ‫݋‬ሺ‫ݐ‬ሻǢ
Ԧ ߠሻ
^ĞŶƐŝŶŐ ‫݋‬ሺ‫ݐ‬ሻ
Ԧ EĞƵƌĂůEĞƚǁŽƌŬ

Fig. 2: Diagram of the structured policy used by each agent to navigate and execute team behaviors. Nearby ally states
(which are fully observable) and enemy states (which are partially observable), base locations, and the agent’s own state
are fed into a neural network which produces a reference target in relative xyz coordinates. The reference target is fed into
the safety logic block which checks for invalid targets which would cause collisions with neighbors or the ground. It then
updates this reference target to produce a “safe” reference target which is fed to the PID controller. The PID controller
tracks the reference target and provides low-level controls for the agent (thrust, aileron, elevator, rudder).

rates we found little improvement over adapting the variance to control itself in a decentralized manner, while allowing
of the sampling distribution. We hypothesize that a first order for beneficial group behaviors to emerge. Furthermore, we
method with adaptive learning rates is sufficient for achieving assume that friendly agents can communicate to share states
good performance when optimizing neural networks. For with each other (see Figure 2). Because we have a large
other types of policy parameterizations however, the full number of agents (up to 50 per team), to keep communication
second-order treatment of GASS may be more useful. It is costs lower we only allow agents to share information locally,
also possible to mix and match which parameters require a i.e. agents close to each other have access to each other’s
full variance update and which can be updated with a first- states. In our experiments we allow each agent to sense the
order approximate method. We use the rank transformation states of the closest 5 friendly agents for a total of 5∗10 = 50
function for S(·) and keep Nk constant. incoming state messages.
Additionally, each agent is equipped with sensors to detect
C. Learning Structured Policies for Multi-Agent Problems enemy agents. Full state observability is not available here,
Now that we are more confident about the convergence instead we assume that sensors are capable of sensing an
of the ES/GASS method, we show how ES can be used to enemy’s relative position and velocity. In our experiments
optimize a complex policy in a large-scale multi-agent envi- we assumed that each agent is able to sense the nearest
ronment. We use the SCRIMMAGE multi-agent simulation 5 enemies for a total of 5 ∗ 7 = 35 dimensions of enemy
environment [20] as it allows us to quickly and in parallel data (7 states = [relative xyz position, distance, and relative
simulate complex multi-agent scenarios. We populate our xyz velocities]). The sensors also provide information about
simulation with 6DoF fixed-wing aircraft and quadcopters home and enemy base relative headings and distances (an
with dynamics models having 10 and 12 states, respectively. additional 8 states). With the addition of the agent’s own state
These dynamics models allow for full ranges of motion (9 states), the policy’s observation input o(t) has a dimension
within realistic operating regimes. Stochastic disturbances of 102. These input states are fed into the agent’s policy: a
in the form of wind and control noise are modeled as neural network f (o(t); θ ) with 3 fully connected layers with
additive Gaussian noise. Ground and mid-air collisions can sizes 200, 200, and 50, which outputs 3 numbers representing
occur which result in the aircraft being destroyed. We also a desired relative heading [xre f , yre f , zre f ]. Each agent’s neural
incorporate a weapons module which allows for targeting network has more than 70,000 parameters. Each agent uses
and firing at an enemy within a fixed cone projecting from the same neural network parameters as its teammates, but
the aircraft’s nose. The probability of a hit depends on the since each agent encounters a different observation at each
distance to the target and the total area presented by the target timestep, the output of each agent’s neural network policy
to the attacker. This area is based on the wireframe model will be unique. It may also be possible to learn unique
of the aircraft and its relative pose. For more details, see our policies for each agent; we leave this for future work.
code and the SCRIMMAGE simulator documentation. With safety being a large concern in UAV flight, we
We consider the case where each agent uses its own policy design the policy to take into account safety and control
to compute its own controls, but where the parameters of the considerations. The relative heading output from the neural
policies are the same for all agents. This allows each agent network policy is intended to be used by a PID controller

IEEE Symposium Series on Computational Intelligence SSCI 2018 2219


to track the heading. The PID controller provides low-
level control commands u(t) to the aircraft (thrust, aileron,  (6
elevator, rudder). However, to prevent cases where the neural &(0

network policy guides the aircraft into crashing into the

6FRUH
ground or allies, etc., we override the neural network heading 

with an avoidance heading if the aircraft is about to collide 


with something. This helps to focus the learning process

on how to intelligently interact with the environment and
allies rather than learning how to avoid obvious mistakes. 
     
Furthermore, by designing the policy in a structured and ,WHUDWLRQ
interpretable way, it will be easier to take the learned policy
directly from simulation into the real world. Since the (a) Scores from randomly perturbed parameter policies (training score)
neural network component of the policy does not produce
low-level commands, it is invariant to different low-level  (6
controllers, dynamics, PID gains, etc. This aids in learning &(0

more transferrable policies for real-world applications.

6FRUH

III. E XPERIMENTS

We consider two scenarios: a base attack scenario where
a team of 50 fixed wing aircraft must attack an enemy base 
defended by 20 quadcopters, and a team competitive task 
where two teams concurrently learn to defeat each other. In      
,WHUDWLRQ
both tasks we use the following reward function:
(b) Scores from updated parameter policies (testing score)
J = 10 × (#kills) + 50 × (#collisions with enemy base)
− 1e − 5 × (distance from enemy base at end of episode) Fig. 4: Scores per iteration for the base attack task. Top is
(7) scores earned by the randomly perturbed policies during the
course of training (with parameters θk + γεi ), while bottom
The reward function encourages air-to-air combat, as well is scores earned by the newly updated policy parameters
as suicide attacks against the enemy base (e.g. a swarm of θk . Scores in the top are in general lower because they
cheap, disposable drones carrying payloads). The last term result from policies which are parameterized by randomly
encourages the aircraft to move towards the enemy during perturbed values. Red indicates scores earned by the Evolu-
the initial phases of learning. tion Strategies algorithm, blue indicates the Cross Entropy
A. Base Attack Task Method. Bold line is the median, shaded areas are 25/75
quartile bounds. While improvement in scores plateaus after
about 1500 iterations in both the ES and CEM methods, ES
outperforms CEM by a healthy margin.

3). The quadcopters use a hand-crafted policy where in the


absence of an enemy, they spread themselves out evenly to
cover the base. In the presence of an enemy they target
the closest enemy, match that enemy’s altitude, and fire
repeatedly. We used Nk = 300, γ = 0.02, a time step of 0.1
seconds, and total episode length of 200 seconds. Initial
positions of both teams were randomized in a fixed area at
opposite ends of the arena. Training took two days with full
parallelization on a machine equipped with a Xeon Phi CPU
Fig. 3: Overhead view of the base attack task in progress.
(244 threads).
The goal of the blue fixed wing team (lower left) is to attack
We found that over the course of training the fixed-
the red base (red dot, upper right) while avoiding or shooting
wing team learned a policy where they quickly form a V-
down the red quadcopter guards which are surrounding the
formation and approach the base. Some aircraft suicide-
red base. The quadcopters are programmed to stay within
attack the enemy base while others begin dog-fighting (see
a fixed distance from the base and to fire upon any nearby
supplementary video1 ). We also compared our implementa-
enemies.
tion of the ES method against the well-known cross-entropy
method (CEM). CEM performs significantly worse than ES
In this scenario a team of 50 fixed-wing aircraft must
attack an enemy base defended by 20 quadcopters (Figure 1 http://https://goo.gl/dWvQi7

2220 IEEE Symposium Series on Computational Intelligence SSCI 2018


(Figure 4). We hypothesize this is because CEM throws out 
a significant fraction of sampled parameters and therefore

obtains a worse estimate of the gradient of (3). Comparison
against other full second-order methods such as CMA-ES or 

6FRUH
the full second-order GASS algorithm is unrealistic due to
the large number of parameters in the neural network and 

the prohibitive computational difficulties with computing the  7HDP


covariances of those parameters. 7HDP

B. Two Team Competitive Match        
,WHUDWLRQ

(a) Scores from randomly perturbed parameter policies (training score)








6FRUH


 7HDP
7HDP

       
,WHUDWLRQ

Fig. 5: Overhead view of two team competitive match in (b) Scores from updated parameter policies (testing score)
progress. The goal of both teams are the same: to defeat Fig. 6: Scores per iteration for the team competition task. Top
all enemy planes and attack the enemy base while suffering is scores earned by the randomly perturbed policies during
minimum losses. Red lines show successful weapons hits. the course of training (with parameters θk + γεi ), while bot-
tom is scores earned by the newly updated policy parameters
The second scenario we consider is where two teams each θk . Scores in the top are in general lower because they
equipped with their own unique policies for their agents result from policies which are parameterized by randomly
learn concurrently to defeat their opponent (Figure 5). At perturbed values. Red and blue curves show scores for team
each iteration, Nk = 300 simulations are spawned, each 1 and 2 respectively. Bold line is the median, shaded areas
with a different random perturbation, and with each team are 25/75 quartile bounds. The scores of both teams improve
having a different perturbation. The updates for each policy equally quickly and reach a Nash equilibrium where both
are calculated based on the scores received from playing teams annihilate one another.
the opponent’s perturbed policies. The result is that each
team learns to defeat a wide range of opponent behaviors
at each iteration. We observed that the behavior of the networks, first-order approximations of the covariance up-
two teams quickly approached a Nash equilibrium where dates can be used effectively. Other types of parameters such
both sides try to defeat the maximum number of opponent as PID gains may benefit from the full covariance update,
aircraft in order to prevent higher-scoring suicide attacks (see since the behavior of the system may be more sensitive their
supplementary video). The end result is a stalemate with perturbations. Future work will include experiments with
both teams annihilating each other, ending with tied scores optimizing these mixed parameterizations, e.g. optimizing
(Figure 6). We hypothesize that more varied behavior could both neural network weights and PID gains. Other directions
be learned by having each team compete against some past of investigation could include optimizing unique policies
enemy team behaviors or by building a library of policies for each agent in the team, optimizing heterogeneous teams
from which to select from, as frequently discussed by the of agents with different policies, or changing policies for
evolutionary computation community [23]. different scenarios/objectives. Yet another direction would
IV. C ONCLUSION be comparing other evolutionary computation strategies for
training neural networks, including methods which use a
We have shown that Evolution Strategies are applicable for
more diverse population [24], or more genetic algorithm-type
learning policies with many tens or hundreds of thousands
heuristics [25].
of parameters for a wide range of complex tasks in both the
competitive and cooperative multi-agent setting. By showing
the connection between ES and more well-understood model-
based stochastic search methods, we are able to gain insight
into algorithm design. When learning parameters for neural

IEEE Symposium Series on Computational Intelligence SSCI 2018 2221


R EFERENCES
[1] Y. Li, “Deep Reinforcement Learning: An Overview,” ArXiv e-prints,
Jan. 2017.
[2] K. O. Stanley, B. D. Bryant, and R. Miikkulainen, “Real-time neu-
roevolution in the nero video game,” IEEE transactions on evolution-
ary computation, vol. 9, no. 6, pp. 653–668, 2005.
[3] O. J. Coleman et al., “Evolving neural networks for visual process-
ing,” Undergraduate Honours Thesis (Bachelor of Computer Science,
University of New South Wales School of Computer Science and
Engineering), 2010.
[4] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, “Evolution
Strategies as a Scalable Alternative to Reinforcement Learning,” ArXiv
e-prints, 2017.
[5] I. Rechenberg, “Cybernetic solution path of an experimental problem,”
Royal Aircraft Establishment Library Translation 1122, 1965.
[6] ——, “Evolutionsstrategien,” Simulationsmethoden in der Medizin und
Biologie, pp. 83–114, 1978.
[7] H.-P. Schwefel, Numerische Optimierung von Computer-Modellen
mittels der Evolutionsstrategie: mit einer vergleichenden Einführung
in die Hill-Climbing-und Zufallsstrategie. Birkhäuser, 1977.
[8] H.-G. Beyer and H.-P. Schwefel, “Evolution strategies–a comprehen-
sive introduction,” Natural computing, vol. 1, no. 1, pp. 3–52, 2002.
[9] J. Hu, “Model-based stochastic search methods,” in Handbook of
Simulation Optimization. Springer, 2015, pp. 319–340.
[10] S. Mannor, R. Rubinstein, and Y. Gat, “The cross entropy method for
fast policy search,” in Machine Learning-International Workshop Then
Conference-, vol. 20, no. 2, 2003, Conference Proceedings, p. 512.
[11] N. Hansen, “The CMA evolution strategy: A tutorial,” CoRR, vol.
abs/1604.00772, 2016.
[12] E. Zhou and J. Hu, “Gradient-based adaptive stochastic search for non-
differentiable optimization,” IEEE Transactions on Automatic Control,
vol. 59, no. 7, pp. 1818–1832, 2014.
[13] L. Panait and S. Luke, “Cooperative multi-agent learning: The state of
the art,” Autonomous Agents and Multi-Agent Systems, vol. 11, no. 3,
pp. 387–434, 2005.
[14] J. B. Rawlings and B. T. Stewart, “Coordinating multiple optimization-
based controllers: New opportunities and challenges,” Journal of
Process Control, vol. 18, no. 9, pp. 839–845, 2008.
[15] W. Al-Gherwi, H. Budman, and A. Elkamel, “Robust distributed model
predictive control: A review and recent developments,” The Canadian
Journal of Chemical Engineering, vol. 89, no. 5, pp. 1176–1190, 2011.
[16] G. B. Lamont, J. N. Slear, and K. Melendez, “UAV swarm mission
planning and routing using multi-objective evolutionary algorithms,” in
IEEE Symposium Computational Intelligence in Multicriteria Decision
Making, no. Mcdm, 2007, Conference Proceedings, pp. 10–20.
[17] A. R. Yu, B. B. Thompson, and R. J. Marks, “Competitive evolution of
tactical multiswarm dynamics,” IEEE Transactions on Systems, Man,
and Cybernetics Part A:Systems and Humans, vol. 43, no. 3, pp. 563–
569, 2013.
[18] D. D. Fan, E. Theodorou, and J. Reeder, “Evolving cost functions
for model predictive control of multi-agent uav combat swarms,” in
Proceedings of the Genetic and Evolutionary Computation Conference
Companion, ser. GECCO ’17. New York, NY, USA: ACM, 2017,
pp. 55–56.
[19] U. Gaerther, “UAV swarm tactics: an agent-based simulation and
Markov process analysis,” Ph.D. dissertation, Naval Postgraduate
School, 2015.
[20] K. J. DeMarco. (2018) SCRIMMAGE multi-agent robotics simulator.
[Online]. Available: http://www.scrimmagesim.org/
[21] D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, and
J. Schmidhuber, “Natural evolution strategies.” Journal of Machine
Learning Research, vol. 15, no. 1, pp. 949–980, 2014.
[22] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tion,” arXiv preprint arXiv:1412.6980, 2014.
[23] K. O. Stanley and R. Miikkulainen, “Competitive coevolution through
evolutionary complexification,” Journal of Artificial Intelligence Re-
search, vol. 21, pp. 63–100, 2004.
[24] E. Conti, V. Madhavan, F. Petroski Such, J. Lehman, K. O. Stanley,
and J. Clune, “Improving Exploration in Evolution Strategies for Deep
Reinforcement Learning via a Population of Novelty-Seeking Agents,”
ArXiv e-prints, 2017.
[25] F. Petroski Such, V. Madhavan, E. Conti, J. Lehman, K. O. Stanley, and
J. Clune, “Deep Neuroevolution: Genetic Algorithms Are a Competi-
tive Alternative for Training Deep Neural Networks for Reinforcement
Learning,” ArXiv e-prints, 2017.

2222 IEEE Symposium Series on Computational Intelligence SSCI 2018

You might also like