0% found this document useful (0 votes)

4 views

Model-Based Stochastic Search For Large Scale Optimization of

Uploaded by

ynader2001

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Model-Based Stochastic Search For Large Scale Optimization of

Uploaded by

ynader2001

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Model-Based Stochastic Search for Large Scale Optimization of

Multi-Agent UAV Swarms

David D. Fan1 , Evangelos A. Theodorou2 , and John Reeder3

Abstract— Recent work from the reinforcement learning

community has shown that Evolution Strategies are a fast and
scalable alternative to other reinforcement learning methods.
In this paper we show that Evolution Strategies are a special
case of model-based stochastic search methods. This class of
algorithms has nice asymptotic convergence properties and
known convergence rates. We show how these methods can
be used to solve both cooperative and competitive multi-
agent problems in an efficient manner. We demonstrate the
effectiveness of this approach on two complex multi-agent UAV
swarm combat scenarios: where a team of fixed wing aircraft
must attack a well-defended base, and where two teams of
agents go head to head to defeat each other† .
I. I NTRODUCTION
Reinforcement Learning is concerned with maximizing
rewards from an environment through repeated interactions
and trial and error. Such methods often rely on various
approximations of the Bellman equation and include value Fig. 1: The SCRIMMAGE multi-agent simulation environ-
function approximation, policy gradient methods, and more ment. In this scenario, blue team fixed-wing agents attack
[1]. The Evolutionary Computation community, on the other red team quadcopter defenders. The two white lines indicate
hand, have developed a suite of methods for black box missed shots taken by the quadcopters at the blue team craft.
optimization and heuristic search [2]. Such methods have We use the SCRIMMAGE platform to perform experiments
been used to optimize the structure of neural networks for on a variety of scenarios.
vision tasks, for instance [3].
Recently, Salimans et al. have shown that a particular
variant of evolutionary computation methods, termed Evo- solve a problem with: 1) a more complex and decipherable
lution Strategies (ES) are a fast and scalable alternative to policy architecture which allows for safety considerations; 2)
other reinforcement learning approaches, solving the difficult a large-scale simulated environment with many interacting
humanoid MuJoCo task in 10 minutes [4]. ES are a class elements; 3) multiple sources of stochasticity including vari-
of algorithms with a long history in the evolutionary com- ations in initial conditions, disturbances, etc.; and 4) sparse
putation community beginning in Germany in the 19560’s rewards which only occur at the very end of a long episode.
(see [5], [6], [7], and for a review [8]) . Salimans et al. A common critique of evolutionary computation algo-
argue that ES has several benefits over other reinforcement rithms is a lack of convergence analysis or guarantees. Of
learning methods: 1) The need to backpropagate gradients course, for problems with non-differentiable and non-convex
through a policy is avoided, which opens up a wider class objective functions, analysis will always be difficult. Nev-
of policy parameterizations; 2) ES methods are massively ertheless, we show that the Evolution Strategies algorithm
parallelizable, which allows for scaling up learning to larger, used by [4] is a special case of a class of model-based
more complex problems; 3) ES often finds policies which stochastic search methods known as Gradient-Based Adap-
are more robust than other reinforcement learning methods; tive Stochastic Search (GASS) [9]. This class of methods
and 4) ES are better at assigning credit to changes in the generalizes many stochastic search methods such as the well-
policy over longer timescales, which enables solving tasks known Cross Entropy Method (CEM) [10], CMA-ES [11],
with longer time horizons and sparse rewards. In this work etc. By casting a non-differentiable, non-convex optimization
we leverage all four of these advantages by using ES to problem as a gradient descent problem, one can arrive at nice
1 Inst. for Robotics and Intelligent Machines, Georgia Institute of asymptotic convergence properties and known convergence
Technology. [email protected] rates [12].
2 Dept. of Aerospace Engineering, Georgia Institute of Technology.
With more confidence in the convergence of Evolution
[email protected] Strategies, we demonstrate how ES can be used to efficiently
3 SPAWAR Systems Center Pacific, San Diego, CA, USA.
[email protected] solve both cooperative and competitive large-scale multi-
† Video at http://goo.gl/dWvQi7 agent problems. Many approaches to solving multi-agent

978-1-5386-9276-9/18/$31.00 2018
c IEEE 2216
problems rely on hand-designed and hand-tuned algorithms mean and variance ω = [μ, σ ]). Then the expectation of J(θ )
(see [13] for a review). One such example, distributed Model over the distribution f (θ ; ω) will always be less than the
Predictive Control, relies on independent MPC controllers optimal value of J, i.e.
on each agent with some level of coordination between
them [14], [15]. These controllers require hand-designing J(θ ) f (θ ; ω)dθ ≤ J(θ ∗ ) (2)
Θ
dynamics models, cost functions, feedback gains, etc. and
The idea of Gradient-based Adaptive Stochastic Search
require expert domain knowledge. Additionally, scaling these
(GASS) is that one can perform a search in the space
methods up to more complex problems continues to be
of parameters of the distribution Ω rather than Θ, for a
an issue. Evolutionary algorithms have also been tried as
distribution which maximizes the expectation in (2):
a solution to multi-agent problems; usually with smaller,
simpler environments, and policies with low complexity [16], ω ∗ = arg max J(θ ) f (θ ; ω)dθ (3)
[17]. Recently, a hybrid approach combining MPC and the ω∈Ω Θ
use of genetic algorithms to evolve the cost function for a Maximizing this expectation corresponds to finding a distri-
hand-tuned MPC controller has been demonstrated for a UAV bution which is maximally distributed around the optimal θ .
swarm combat scenario [18]. However, unlike maximizing (1), this objective function can
In this work we demonstrate the effectiveness of our now be made continuous and differentiable with respect to
approach on two complex multi-agent UAV swarm combat ω. With some assumptions on the form of the distribution,
scenarios: where a team of fixed wing aircraft must attack the gradient with respect to ω can be pushed inside the
a well-defended base, and where two teams of agents go expectation.
head to head to defeat each other. Such scenarios have been The GASS algorithm presented by [12] is applicable to
previously considered in simulated environments with less the exponential family of probability densities:
fidelity and complexity [19], [18]. We leverage the compu-
tational efficiency and flexibility of the recently developed f (θ ; ω) = exp{ω T (θ ) − φ (θ )} (4)

SCRIMMAGE multi-agent simulator for our experiments where φ (θ ) = ln exp(ω T (θ ))dθ , and T (θ ) is the vector
(Figure 1) [20]. We compare the performance of ES against of sufficient statistics. Since we are concerned with showing
the Cross Entropy Method. We also show for the competitive the connection with ES which uses parameter perturbations
scenario how the policy learns over time to coordinate a sampled with Gaussian noise, we assume that f (θ ; ω) is
strategy in response to an enemy learning to do the same. Gaussian. Furthermore, since we are concerned with learning
We make our code freely available for use (https:// a large number of parameters (i.e. weights in a neural
github.com/ddfan/swarm_evolve). network), we assume an independent Gaussian distribution
over each parameter. Then, T (θ ) = [θ , θ 2 ] ∈ R2n and ω =
II. P ROBLEM F ORMULATION
[μ/σ 2 , −1/nσ 2 ] ∈ R2n , where μ and σ are vectors of the
We can pose our problem as the non-differentiable, non- mean and standard deviation corresponding to the distribu-
convex optimization problem tion of each parameter, respectively.
θ ∗ = arg max J(θ ) (1)
θ ∈Θ Algorithm 1 Gradient-Based Adaptive Stochastic Search
where Θ ⊂ Rn , a nonempty compact set, is the space Require: Learning rate αk , sample size Nk , initial policy
of solutions, and J(θ ) is a non-differentiable, non-convex parameters ω0 = [μ0 , σ02 ] , smoothing function S(), small
real-valued objective function J : Θ → R. θ could be any constant γ > 0.
combination of decision variables of our problem, including 1: for k = 0, 1, · · · do
neural network weights, PID gains, hardware design param- iid
2: Sample θki ∼ f (θ ; ωk ), i = 1, 2, · · · , Nk .
eters, etc. which affect the outcome of the returns J. For 3: Compute returns wik = S(J(θki )) for i = 1, · · · , Nk .
reinforcement learning problems θ usually represents the 4: Compute variance terms Vk = V̂k + γI, eq (5),(6)
parameters of the policy and J is an implicit function of the 5: Calculate normalizer η = ∑i=1
Nk
wik .
sequential application of the policy to the environment. We 6: Update ωk+1 :
first review how this problem can be solved using Gradient- Nk i
θk μ
Based Adaptive Stochastic Search methods and then show 7: ωk+1 ← ωk + αk η1 Vk−1 ∑ wik −
i=1
(θki )2 σ 2 + μ2
how the ES algorithm is a special case of these methods.
8: end for
A. Gradient-Based Adaptive Stochastic Search
The goal of model-based stochastic search methods is We present the GASS algorithm for this specific set of
to cast the non-differentiable optimization problem (1) as probability models (Algorithm 1), although the analysis for
a differentiable one by specifying a probabilistic model convergence holds for the more general exponential family
(hence ”model-based”) from which to sample [12]. Let this of distributions. For each iteration k, The GASS algorithm
iid
model be p(θ |ω) = f (θ ; ω), ω ∈ Ω, where ω is a parameter involves drawing Nk samples of parameters θki ∼ f (θ ; ωk ), i =
which defines the probability distribution (e.g. for Gaussian 1, 2, · · · , Nk . These parameters are then used to sample the
distributions, the distribution is fully parameterized by the return function J(θki ). The returns are fed through a shaping

IEEE Symposium Series on Computational Intelligence SSCI 2018 2217

function S(·) : R → R+ and then used to calculate an update ii) The sample size Nk = ξ
√N0 k , where ξ > 0; also αk and
on the model parameters ωk+1 . Nk jointly satisfy α/ Nk = O(k−β ).
The shaping function S(·) is required to be nondecreasing iii) T (θ ) is bounded on Θ.
and bounded from above and below for bounded inputs, iv) If

ω ∗ is a local maximum of (3), the Hessian of
with the lower bound away from 0. Additionally, the set Θ J(θ ) f (θ ; ω)dθ is continuous and symmetric negative
{arg maxθ ∈Θ S(J(θ ))} must be a nonempty subset of the definite in a neighborhood of ω ∗ .
set of solutions of the original problem {arg maxθ ∈Θ J(θ )}. Theorem 1
The shaping function can be used to adjust the explo- Assume that Assumption 1 holds. Let αk = α0 /kα for 0 <
ration/exploitation trade-off or help deal with outliers when α < 1. Let Nk = N0 kτ−α where τ > 2α is a constant. Then
sampling. The original analysis of GASS assumes a more the sequence {ωk } generated by
general form of Sk (·) where S can change at each iteration. √ Algorithm 1 converges to a
limit set w.p.1. with rate O(1/ kτ ).
For simplicity we assume here it is deterministic and un-
changing per iteration. B. Evolutionary Strategies
GASS can be considered a second-order gradient method We now review the ES algorithm proposed by [4] and
and requires estimating the variance of the sampled param- show how it is a first-order approximation of the GASS
eters: algorithm. The ES algorithm consists of the same two
phases as GASS: 1) Randomly perturb parameters with noise
1 Nk
V̂k = ∑ T (θki )T (θki )
Nk − 1 i=1
sampled from a Gaussian distribution. 2) Calculate returns
and calculate an update to the parameters. The algorithm is
Nk Nk
1 outlined in Algorithm 2. Once returns are calculated, they are
− 2
Nk − Nk i=1
∑ kT (θ i
) ∑ T (θki ) . (5) sent through a function S(·) which performs fitness shaping
i=1
[21]. Salimans et al. used a rank transformation function for
In practice if the size of the parameter space Θ is large, as is S(·) which they argue reduced the influence of outliers at
the case in neural networks, this variance matrix will be of each iteration and helped to avoid local optima. It is clear that
size 2n × 2n and will be costly to compute. In our work we
approximate V̂k with independent calculations of the variance Algorithm 2 Evolution Strategies
on the parameters of each independent Gaussian. With a Require: Learning rate αk , noise standard deviation γ, initial
slight abuse of notation, consider θ̃ki as a scalar element of policy parameters θ0 , smoothing function S().
θki . We then have, for each scalar element θ̃ki a 2 × 2 variance 1: for k = 0, 1, · · · do
matrix: 2: Sample ε1 , · · · , εn ∼ N (0, I n×n )
3: Compute returns wik = S(J(θk + γεi )) for i = 1, · · · , Nk

1 Nk θ̃ki
V̂k = ∑ (θ̃ki )2 θ̃ki (θ̃ki )2
Nk − 1 i=1
4: Update θk+1 :
Nk
θk+1 ← θk + αk N1 γ ∑i=1 wik εi
Nk
5:
k
Nk
1 θ̃ki 6: end for
− 2 ∑
Nk − Nk i=1 (θ̃ki )2
∑ θ̃ki (θ̃ki )2 . (6)
i=1
the ES algorithm is a sub-case of the GASS algorithm when
Theorem 1 shows that GASS produces a sequence of the sampling distribution is a point distribution. We can also
ωk that converges to a limit set which specifies a set of recover the ES algorithm by ignoring the variance terms on
distributions that maximize (3). Distributions in this set will line 7 in Algorithm 1. Instead of the normalizing term η, ES
specify how to choose θ ∗ to ultimately maximize (1). As uses the number of samples Nk . The small constant in GASS
with most non-convex optimization algorithms, we are not γ becomes the variance term in the ES algorithm. The update
guaranteed to arrive at the global maximum, but using prob- rule in Algorithm 2 involves multiplying the scaled returns by
abilistic models and careful choice of the shaping function the noise, which is exactly θki − μ in Algorithm 1. We see that
should help avoid early convergence into suboptimal local ES enjoys the same asymptotic convergence rates offered by
maximum. The proof relies on casting the update rule in the the analysis of GASS. While GASS is a second-order method
form of a generalized Robbins-Monro algorithm (see [12], and ES is only a first-order method, in practice ES uses
Thms 1 and 2). Theorem 1 also specifies convergence rates approximate second-order gradient descent methods which
in terms of the number of iterations k, the number of samples adapt the learning rate in order to speed up and stabilize
per iteration Nk , and the learning rate αk . In practice Theorem learning. Examples of these methods include ADAM, RM-
1 implies the need to carefully balance the increase in the SProp, SGD with momentum, etc., which have been shown
number of samples per iteration and the decrease in learning to perform very well for neural networks. Therefore we can
rate as iterations progress. treat ES a first-order approximation of the full second-order
Assumption 1 variance updates which GASS uses. In our experiments we
i) The learning rate αk > 0, αk → 0 as k → ∞, and use ADAM [22] to adapt the learning rate for each parameter.
∑∞
k=0 αk = ∞. As similarly reported in [4], when using adaptive learning

2218 IEEE Symposium Series on Computational Intelligence SSCI 2018

ůůǇƐƚĂƚĞƐ

‫ݔ‬௥௘௙
‫ݕ‬௥௘௙
‫ݖ‬௥௘௙ ^ĂĨĞƚǇ
н
෍ W/
>ŽŐŝĐ ‫ݑ‬ሺ‫ݐ‬ሻ ‫ݔ‬ሺ‫ݐ‬ሻ
Ԧ
Ͳ

݂ሺ‫݋‬ሺ‫ݐ‬ሻǢ
Ԧ ߠሻ
^ĞŶƐŝŶŐ ‫݋‬ሺ‫ݐ‬ሻ
Ԧ EĞƵƌĂůEĞƚǁŽƌŬ

Fig. 2: Diagram of the structured policy used by each agent to navigate and execute team behaviors. Nearby ally states
(which are fully observable) and enemy states (which are partially observable), base locations, and the agent’s own state
are fed into a neural network which produces a reference target in relative xyz coordinates. The reference target is fed into
the safety logic block which checks for invalid targets which would cause collisions with neighbors or the ground. It then
updates this reference target to produce a “safe” reference target which is fed to the PID controller. The PID controller
tracks the reference target and provides low-level controls for the agent (thrust, aileron, elevator, rudder).

rates we found little improvement over adapting the variance to control itself in a decentralized manner, while allowing
of the sampling distribution. We hypothesize that a first order for beneficial group behaviors to emerge. Furthermore, we
method with adaptive learning rates is sufficient for achieving assume that friendly agents can communicate to share states
good performance when optimizing neural networks. For with each other (see Figure 2). Because we have a large
other types of policy parameterizations however, the full number of agents (up to 50 per team), to keep communication
second-order treatment of GASS may be more useful. It is costs lower we only allow agents to share information locally,
also possible to mix and match which parameters require a i.e. agents close to each other have access to each other’s
full variance update and which can be updated with a first- states. In our experiments we allow each agent to sense the
order approximate method. We use the rank transformation states of the closest 5 friendly agents for a total of 5∗10 = 50
function for S(·) and keep Nk constant. incoming state messages.
Additionally, each agent is equipped with sensors to detect
C. Learning Structured Policies for Multi-Agent Problems enemy agents. Full state observability is not available here,
Now that we are more confident about the convergence instead we assume that sensors are capable of sensing an
of the ES/GASS method, we show how ES can be used to enemy’s relative position and velocity. In our experiments
optimize a complex policy in a large-scale multi-agent envi- we assumed that each agent is able to sense the nearest
ronment. We use the SCRIMMAGE multi-agent simulation 5 enemies for a total of 5 ∗ 7 = 35 dimensions of enemy
environment [20] as it allows us to quickly and in parallel data (7 states = [relative xyz position, distance, and relative
simulate complex multi-agent scenarios. We populate our xyz velocities]). The sensors also provide information about
simulation with 6DoF fixed-wing aircraft and quadcopters home and enemy base relative headings and distances (an
with dynamics models having 10 and 12 states, respectively. additional 8 states). With the addition of the agent’s own state
These dynamics models allow for full ranges of motion (9 states), the policy’s observation input o(t) has a dimension
within realistic operating regimes. Stochastic disturbances of 102. These input states are fed into the agent’s policy: a
in the form of wind and control noise are modeled as neural network f (o(t); θ ) with 3 fully connected layers with
additive Gaussian noise. Ground and mid-air collisions can sizes 200, 200, and 50, which outputs 3 numbers representing
occur which result in the aircraft being destroyed. We also a desired relative heading [xre f , yre f , zre f ]. Each agent’s neural
incorporate a weapons module which allows for targeting network has more than 70,000 parameters. Each agent uses
and firing at an enemy within a fixed cone projecting from the same neural network parameters as its teammates, but
the aircraft’s nose. The probability of a hit depends on the since each agent encounters a different observation at each
distance to the target and the total area presented by the target timestep, the output of each agent’s neural network policy
to the attacker. This area is based on the wireframe model will be unique. It may also be possible to learn unique
of the aircraft and its relative pose. For more details, see our policies for each agent; we leave this for future work.
code and the SCRIMMAGE simulator documentation. With safety being a large concern in UAV flight, we
We consider the case where each agent uses its own policy design the policy to take into account safety and control
to compute its own controls, but where the parameters of the considerations. The relative heading output from the neural
policies are the same for all agents. This allows each agent network policy is intended to be used by a PID controller

IEEE Symposium Series on Computational Intelligence SSCI 2018 2219

to track the heading. The PID controller provides low-
level control commands u(t) to the aircraft (thrust, aileron, (6
elevator, rudder). However, to prevent cases where the neural &(0

network policy guides the aircraft into crashing into the

6FRUH
ground or allies, etc., we override the neural network heading

with an avoidance heading if the aircraft is about to collide

with something. This helps to focus the learning process

on how to intelligently interact with the environment and
allies rather than learning how to avoid obvious mistakes.

Furthermore, by designing the policy in a structured and ,WHUDWLRQ
interpretable way, it will be easier to take the learned policy
directly from simulation into the real world. Since the (a) Scores from randomly perturbed parameter policies (training score)
neural network component of the policy does not produce
low-level commands, it is invariant to different low-level (6
controllers, dynamics, PID gains, etc. This aids in learning &(0

more transferrable policies for real-world applications.

6FRUH

III. E XPERIMENTS

We consider two scenarios: a base attack scenario where
a team of 50 ﬁxed wing aircraft must attack an enemy base
defended by 20 quadcopters, and a team competitive task
where two teams concurrently learn to defeat each other. In
,WHUDWLRQ
both tasks we use the following reward function:
(b) Scores from updated parameter policies (testing score)
J = 10 × (#kills) + 50 × (#collisions with enemy base)
− 1e − 5 × (distance from enemy base at end of episode) Fig. 4: Scores per iteration for the base attack task. Top is
(7) scores earned by the randomly perturbed policies during the
course of training (with parameters θk + γεi ), while bottom
The reward function encourages air-to-air combat, as well is scores earned by the newly updated policy parameters
as suicide attacks against the enemy base (e.g. a swarm of θk . Scores in the top are in general lower because they
cheap, disposable drones carrying payloads). The last term result from policies which are parameterized by randomly
encourages the aircraft to move towards the enemy during perturbed values. Red indicates scores earned by the Evolu-
the initial phases of learning. tion Strategies algorithm, blue indicates the Cross Entropy
A. Base Attack Task Method. Bold line is the median, shaded areas are 25/75
quartile bounds. While improvement in scores plateaus after
about 1500 iterations in both the ES and CEM methods, ES
outperforms CEM by a healthy margin.

3). The quadcopters use a hand-crafted policy where in the

absence of an enemy, they spread themselves out evenly to
cover the base. In the presence of an enemy they target
the closest enemy, match that enemy’s altitude, and fire
repeatedly. We used Nk = 300, γ = 0.02, a time step of 0.1
seconds, and total episode length of 200 seconds. Initial
positions of both teams were randomized in a fixed area at
opposite ends of the arena. Training took two days with full
parallelization on a machine equipped with a Xeon Phi CPU
Fig. 3: Overhead view of the base attack task in progress.
(244 threads).
The goal of the blue fixed wing team (lower left) is to attack
We found that over the course of training the fixed-
the red base (red dot, upper right) while avoiding or shooting
wing team learned a policy where they quickly form a V-
down the red quadcopter guards which are surrounding the
formation and approach the base. Some aircraft suicide-
red base. The quadcopters are programmed to stay within
attack the enemy base while others begin dog-fighting (see
a fixed distance from the base and to fire upon any nearby
supplementary video1 ). We also compared our implementa-
enemies.
tion of the ES method against the well-known cross-entropy
method (CEM). CEM performs significantly worse than ES
In this scenario a team of 50 fixed-wing aircraft must
attack an enemy base defended by 20 quadcopters (Figure 1 http://https://goo.gl/dWvQi7

2220 IEEE Symposium Series on Computational Intelligence SSCI 2018

(Figure 4). We hypothesize this is because CEM throws out
a signiﬁcant fraction of sampled parameters and therefore

obtains a worse estimate of the gradient of (3). Comparison
against other full second-order methods such as CMA-ES or

6FRUH
the full second-order GASS algorithm is unrealistic due to
the large number of parameters in the neural network and

the prohibitive computational difﬁculties with computing the 7HDP

covariances of those parameters. 7HDP

B. Two Team Competitive Match
,WHUDWLRQ

(a) Scores from randomly perturbed parameter policies (training score)

6FRUH

7HDP
7HDP

,WHUDWLRQ

Fig. 5: Overhead view of two team competitive match in (b) Scores from updated parameter policies (testing score)
progress. The goal of both teams are the same: to defeat Fig. 6: Scores per iteration for the team competition task. Top
all enemy planes and attack the enemy base while suffering is scores earned by the randomly perturbed policies during
minimum losses. Red lines show successful weapons hits. the course of training (with parameters θk + γεi ), while bot-
tom is scores earned by the newly updated policy parameters
The second scenario we consider is where two teams each θk . Scores in the top are in general lower because they
equipped with their own unique policies for their agents result from policies which are parameterized by randomly
learn concurrently to defeat their opponent (Figure 5). At perturbed values. Red and blue curves show scores for team
each iteration, Nk = 300 simulations are spawned, each 1 and 2 respectively. Bold line is the median, shaded areas
with a different random perturbation, and with each team are 25/75 quartile bounds. The scores of both teams improve
having a different perturbation. The updates for each policy equally quickly and reach a Nash equilibrium where both
are calculated based on the scores received from playing teams annihilate one another.
the opponent’s perturbed policies. The result is that each
team learns to defeat a wide range of opponent behaviors
at each iteration. We observed that the behavior of the networks, ﬁrst-order approximations of the covariance up-
two teams quickly approached a Nash equilibrium where dates can be used effectively. Other types of parameters such
both sides try to defeat the maximum number of opponent as PID gains may beneﬁt from the full covariance update,
aircraft in order to prevent higher-scoring suicide attacks (see since the behavior of the system may be more sensitive their
supplementary video). The end result is a stalemate with perturbations. Future work will include experiments with
both teams annihilating each other, ending with tied scores optimizing these mixed parameterizations, e.g. optimizing
(Figure 6). We hypothesize that more varied behavior could both neural network weights and PID gains. Other directions
be learned by having each team compete against some past of investigation could include optimizing unique policies
enemy team behaviors or by building a library of policies for each agent in the team, optimizing heterogeneous teams
from which to select from, as frequently discussed by the of agents with different policies, or changing policies for
evolutionary computation community [23]. different scenarios/objectives. Yet another direction would
IV. C ONCLUSION be comparing other evolutionary computation strategies for
training neural networks, including methods which use a
We have shown that Evolution Strategies are applicable for
more diverse population [24], or more genetic algorithm-type
learning policies with many tens or hundreds of thousands
heuristics [25].
of parameters for a wide range of complex tasks in both the
competitive and cooperative multi-agent setting. By showing
the connection between ES and more well-understood model-
based stochastic search methods, we are able to gain insight
into algorithm design. When learning parameters for neural

IEEE Symposium Series on Computational Intelligence SSCI 2018 2221

R EFERENCES
[1] Y. Li, “Deep Reinforcement Learning: An Overview,” ArXiv e-prints,
Jan. 2017.
[2] K. O. Stanley, B. D. Bryant, and R. Miikkulainen, “Real-time neu-
roevolution in the nero video game,” IEEE transactions on evolution-
ary computation, vol. 9, no. 6, pp. 653–668, 2005.
[3] O. J. Coleman et al., “Evolving neural networks for visual process-
ing,” Undergraduate Honours Thesis (Bachelor of Computer Science,
University of New South Wales School of Computer Science and
Engineering), 2010.
[4] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, “Evolution
Strategies as a Scalable Alternative to Reinforcement Learning,” ArXiv
e-prints, 2017.
[5] I. Rechenberg, “Cybernetic solution path of an experimental problem,”
Royal Aircraft Establishment Library Translation 1122, 1965.
[6] ——, “Evolutionsstrategien,” Simulationsmethoden in der Medizin und
Biologie, pp. 83–114, 1978.
[7] H.-P. Schwefel, Numerische Optimierung von Computer-Modellen
mittels der Evolutionsstrategie: mit einer vergleichenden Einführung
in die Hill-Climbing-und Zufallsstrategie. Birkhäuser, 1977.
[8] H.-G. Beyer and H.-P. Schwefel, “Evolution strategies–a comprehen-
sive introduction,” Natural computing, vol. 1, no. 1, pp. 3–52, 2002.
[9] J. Hu, “Model-based stochastic search methods,” in Handbook of
Simulation Optimization. Springer, 2015, pp. 319–340.
[10] S. Mannor, R. Rubinstein, and Y. Gat, “The cross entropy method for
fast policy search,” in Machine Learning-International Workshop Then
Conference-, vol. 20, no. 2, 2003, Conference Proceedings, p. 512.
[11] N. Hansen, “The CMA evolution strategy: A tutorial,” CoRR, vol.
abs/1604.00772, 2016.
[12] E. Zhou and J. Hu, “Gradient-based adaptive stochastic search for non-
differentiable optimization,” IEEE Transactions on Automatic Control,
vol. 59, no. 7, pp. 1818–1832, 2014.
[13] L. Panait and S. Luke, “Cooperative multi-agent learning: The state of
the art,” Autonomous Agents and Multi-Agent Systems, vol. 11, no. 3,
pp. 387–434, 2005.
[14] J. B. Rawlings and B. T. Stewart, “Coordinating multiple optimization-
based controllers: New opportunities and challenges,” Journal of
Process Control, vol. 18, no. 9, pp. 839–845, 2008.
[15] W. Al-Gherwi, H. Budman, and A. Elkamel, “Robust distributed model
predictive control: A review and recent developments,” The Canadian
Journal of Chemical Engineering, vol. 89, no. 5, pp. 1176–1190, 2011.
[16] G. B. Lamont, J. N. Slear, and K. Melendez, “UAV swarm mission
planning and routing using multi-objective evolutionary algorithms,” in
IEEE Symposium Computational Intelligence in Multicriteria Decision
Making, no. Mcdm, 2007, Conference Proceedings, pp. 10–20.
[17] A. R. Yu, B. B. Thompson, and R. J. Marks, “Competitive evolution of
tactical multiswarm dynamics,” IEEE Transactions on Systems, Man,
and Cybernetics Part A:Systems and Humans, vol. 43, no. 3, pp. 563–
569, 2013.
[18] D. D. Fan, E. Theodorou, and J. Reeder, “Evolving cost functions
for model predictive control of multi-agent uav combat swarms,” in
Proceedings of the Genetic and Evolutionary Computation Conference
Companion, ser. GECCO ’17. New York, NY, USA: ACM, 2017,
pp. 55–56.
[19] U. Gaerther, “UAV swarm tactics: an agent-based simulation and
Markov process analysis,” Ph.D. dissertation, Naval Postgraduate
School, 2015.
[20] K. J. DeMarco. (2018) SCRIMMAGE multi-agent robotics simulator.
[Online]. Available: http://www.scrimmagesim.org/
[21] D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, and
J. Schmidhuber, “Natural evolution strategies.” Journal of Machine
Learning Research, vol. 15, no. 1, pp. 949–980, 2014.
[22] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tion,” arXiv preprint arXiv:1412.6980, 2014.
[23] K. O. Stanley and R. Miikkulainen, “Competitive coevolution through
evolutionary complexiﬁcation,” Journal of Artiﬁcial Intelligence Re-
search, vol. 21, pp. 63–100, 2004.
[24] E. Conti, V. Madhavan, F. Petroski Such, J. Lehman, K. O. Stanley,
and J. Clune, “Improving Exploration in Evolution Strategies for Deep
Reinforcement Learning via a Population of Novelty-Seeking Agents,”
ArXiv e-prints, 2017.
[25] F. Petroski Such, V. Madhavan, E. Conti, J. Lehman, K. O. Stanley, and
J. Clune, “Deep Neuroevolution: Genetic Algorithms Are a Competi-
tive Alternative for Training Deep Neural Networks for Reinforcement
Learning,” ArXiv e-prints, 2017.

2222 IEEE Symposium Series on Computational Intelligence SSCI 2018

Managerial Economics 7th Edition Keat Test Bank 1
100% (65)
Managerial Economics 7th Edition Keat Test Bank 1
18 pages
Guided Notes For Solving Systems by Graphing
No ratings yet
Guided Notes For Solving Systems by Graphing
5 pages
Predictive Machines with Uncertainty Quantification (2022)
No ratings yet
Predictive Machines with Uncertainty Quantification (2022)
18 pages
A Comparison of Deterministic and Probabilistic Optimization Algorithms For Nonsmooth Simulation-Based Optimization
No ratings yet
A Comparison of Deterministic and Probabilistic Optimization Algorithms For Nonsmooth Simulation-Based Optimization
11 pages
14 (ML) Kok - Maxplus
No ratings yet
14 (ML) Kok - Maxplus
40 pages
Evolution Strategies As A Scalable Alternative To Reinforcement Learning
No ratings yet
Evolution Strategies As A Scalable Alternative To Reinforcement Learning
13 pages
Differential Evolution Using A Neighborhood-Based PDF
No ratings yet
Differential Evolution Using A Neighborhood-Based PDF
28 pages
Comparison and Selection
No ratings yet
Comparison and Selection
10 pages
Differential Evolution Using A Neighborhood-Based Mutation Operator
No ratings yet
Differential Evolution Using A Neighborhood-Based Mutation Operator
42 pages
Deep Gaussian Covariance Network
No ratings yet
Deep Gaussian Covariance Network
14 pages
A Review of Surrogate Assisted Multiobjective EA
No ratings yet
A Review of Surrogate Assisted Multiobjective EA
15 pages
A Modified Real Coded Genetic Algorithm For Constrained
No ratings yet
A Modified Real Coded Genetic Algorithm For Constrained
26 pages
A Genetic Algorithm For Minimax Optimization Problems
No ratings yet
A Genetic Algorithm For Minimax Optimization Problems
5 pages
s10618-020-00727-3
No ratings yet
s10618-020-00727-3
49 pages
Gradient Estimation Using Stochastic Computation Graphs
No ratings yet
Gradient Estimation Using Stochastic Computation Graphs
13 pages
Backjump-Based Backtracking For Constraint Satisfaction Problems
No ratings yet
Backjump-Based Backtracking For Constraint Satisfaction Problems
42 pages
Approximation Models in Optimization Functions: Alan D Iaz Manr Iquez
No ratings yet
Approximation Models in Optimization Functions: Alan D Iaz Manr Iquez
25 pages
Graph Clustering
No ratings yet
Graph Clustering
38 pages
A Provably Time-Efficient Parallel Implementation of Full Speculation
No ratings yet
A Provably Time-Efficient Parallel Implementation of Full Speculation
46 pages
Expert Systems With Applications
No ratings yet
Expert Systems With Applications
19 pages
Multidimensional Newton-Raphson Consensus For Distributed Convex Optimization
No ratings yet
Multidimensional Newton-Raphson Consensus For Distributed Convex Optimization
6 pages
Mirja Lili 2017
No ratings yet
Mirja Lili 2017
29 pages
Reinforcement Learning For Robust Missile Autopilot Design
No ratings yet
Reinforcement Learning For Robust Missile Autopilot Design
10 pages
Eigenoption Discovery Through The Deep Successor Representation
No ratings yet
Eigenoption Discovery Through The Deep Successor Representation
22 pages
A Sparse Matrix Approach To Reverse Mode Automatic
No ratings yet
A Sparse Matrix Approach To Reverse Mode Automatic
10 pages
A Hybrid Particle Swarm Evolutionary Algorithm For Constrained Multi-Objective Optimization Jingxuan Wei
No ratings yet
A Hybrid Particle Swarm Evolutionary Algorithm For Constrained Multi-Objective Optimization Jingxuan Wei
18 pages
Change Detection by New DSMT Decision Rule and ICM With Constraints: Application To Argan Land Cover
No ratings yet
Change Detection by New DSMT Decision Rule and ICM With Constraints: Application To Argan Land Cover
6 pages
Robust Optimization With Simulated Annealing
No ratings yet
Robust Optimization With Simulated Annealing
13 pages
1590_swad_domain_generalization_by_
No ratings yet
1590_swad_domain_generalization_by_
14 pages
Steepest Descent Algorithms For Optimization Under Unitary Matrix Constraint PDF
No ratings yet
Steepest Descent Algorithms For Optimization Under Unitary Matrix Constraint PDF
14 pages
Perold-LargeScalePortfolioOptimization-1984
No ratings yet
Perold-LargeScalePortfolioOptimization-1984
19 pages
Dorado-Sevilla2021 Chapter AnInteractiveFrameworkToCompar
No ratings yet
Dorado-Sevilla2021 Chapter AnInteractiveFrameworkToCompar
16 pages
Dechter Constraint Processing
No ratings yet
Dechter Constraint Processing
40 pages
IUST-v12n3p335-en
No ratings yet
IUST-v12n3p335-en
31 pages
JComp Phys
No ratings yet
JComp Phys
29 pages
Multi Scenario-Multi Criteria
No ratings yet
Multi Scenario-Multi Criteria
25 pages
Australasian Data Science and Machine Learning Conference
No ratings yet
Australasian Data Science and Machine Learning Conference
15 pages
On Guaranteed Capture in Multi-Player Reach-Avoid Games Via Coverage Control
No ratings yet
On Guaranteed Capture in Multi-Player Reach-Avoid Games Via Coverage Control
6 pages
Two Approaches of The Construction of Guaranteeing Cost Minimax Strategies
No ratings yet
Two Approaches of The Construction of Guaranteeing Cost Minimax Strategies
9 pages
Application of DSmT-ICM With Adaptive Decision Rule To Supervised Classi Cation in Multisource Remote Sensing
No ratings yet
Application of DSmT-ICM With Adaptive Decision Rule To Supervised Classi Cation in Multisource Remote Sensing
11 pages
TT06
No ratings yet
TT06
12 pages
Towards Utilitarian Combinatorial Assignment With Deep Learning
No ratings yet
Towards Utilitarian Combinatorial Assignment With Deep Learning
7 pages
A New Approach in Dynamic Traveling Salesman Problem: A Hybrid of Ant Colony Optimization and Descending Gradient
No ratings yet
A New Approach in Dynamic Traveling Salesman Problem: A Hybrid of Ant Colony Optimization and Descending Gradient
9 pages
laguel_paper_6_rph
No ratings yet
laguel_paper_6_rph
25 pages
Autoregressive Diffusion Model
No ratings yet
Autoregressive Diffusion Model
23 pages
A3C-GS Adaptive Moment Gradient Sharing With Locks For Asynchronous ActorCritic Agents
No ratings yet
A3C-GS Adaptive Moment Gradient Sharing With Locks For Asynchronous ActorCritic Agents
15 pages
Adaptive DE
No ratings yet
Adaptive DE
6 pages
Formula-Free Finite Abstractions For Linear Temporal Verification of Stochastic Hybrid Systems
No ratings yet
Formula-Free Finite Abstractions For Linear Temporal Verification of Stochastic Hybrid Systems
10 pages
Pgagenplan (Ieee Trans) 2
No ratings yet
Pgagenplan (Ieee Trans) 2
8 pages
2019 - IGD Indicator-Based Evolutionary Algorithm For Many-Objective Optimization Problems
No ratings yet
2019 - IGD Indicator-Based Evolutionary Algorithm For Many-Objective Optimization Problems
15 pages
Parallelizing Stochastic Gradient Descent for Least Squares Regression mini-batching, averaging, and model misspecification
No ratings yet
Parallelizing Stochastic Gradient Descent for Least Squares Regression mini-batching, averaging, and model misspecification
39 pages
Katharina Tluk v. Toschanowitz, Barbara Hammer and Helge Ritter - Mapping The Design Space of Reinforcement Learning Problems - A Case Study
No ratings yet
Katharina Tluk v. Toschanowitz, Barbara Hammer and Helge Ritter - Mapping The Design Space of Reinforcement Learning Problems - A Case Study
11 pages
Differential_Privacy_in_Distributed_Optimization_With_Gradient_Tracking
No ratings yet
Differential_Privacy_in_Distributed_Optimization_With_Gradient_Tracking
16 pages
Sampling-Based Optimal Motion Planning
No ratings yet
Sampling-Based Optimal Motion Planning
7 pages
Metrics for ergodicity and design of ergodic dynamics for multi-agent systems
No ratings yet
Metrics for ergodicity and design of ergodic dynamics for multi-agent systems
11 pages
FERNNDEZDUQUE-DYNAMICTOPOLOGICALLOGIC-2012
No ratings yet
FERNNDEZDUQUE-DYNAMICTOPOLOGICALLOGIC-2012
22 pages
Partial Evaluation Strategies For Expensive Evolutionary Constrained Optimization PDF
No ratings yet
Partial Evaluation Strategies For Expensive Evolutionary Constrained Optimization PDF
15 pages
Hybrid Backtracking Bounded by Tree-Decomposition of Constraint Networks.
No ratings yet
Hybrid Backtracking Bounded by Tree-Decomposition of Constraint Networks.
35 pages
Coordinated Motion Planning: Reconfiguring A Swarm of Labeled Robots With Bounded Stretch
No ratings yet
Coordinated Motion Planning: Reconfiguring A Swarm of Labeled Robots With Bounded Stretch
32 pages
Gar Ste 94 A
No ratings yet
Gar Ste 94 A
12 pages
Machine Learning Algorithms in Web Page Classification
No ratings yet
Machine Learning Algorithms in Web Page Classification
9 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Chapter 2
No ratings yet
Chapter 2
31 pages
Modelling and Control of Biomedical Systems
No ratings yet
Modelling and Control of Biomedical Systems
5 pages
Final: CS 188 Fall 2013 Introduction To Artificial Intelligence
No ratings yet
Final: CS 188 Fall 2013 Introduction To Artificial Intelligence
23 pages
Inverse Distance Weighting
No ratings yet
Inverse Distance Weighting
10 pages
Quadratic Equations
No ratings yet
Quadratic Equations
31 pages
Chapter Two Time Series Regression
No ratings yet
Chapter Two Time Series Regression
7 pages
Unit 3 & 4 Question Bank
No ratings yet
Unit 3 & 4 Question Bank
5 pages
Prof. Radhakant Padhi Prof. Radhakant Padhi Prof. Radhakant Padhi Prof. Radhakant Padhi
No ratings yet
Prof. Radhakant Padhi Prof. Radhakant Padhi Prof. Radhakant Padhi Prof. Radhakant Padhi
11 pages
Swarm Intelligence
No ratings yet
Swarm Intelligence
49 pages
Binary Division Attack For Elliptic Curve Discrete
No ratings yet
Binary Division Attack For Elliptic Curve Discrete
16 pages
AI Mini Report
No ratings yet
AI Mini Report
4 pages
Trip Generation
No ratings yet
Trip Generation
20 pages
Machine Learning Bangalore City University 2024
No ratings yet
Machine Learning Bangalore City University 2024
5 pages
Data Science Master Class 2023
No ratings yet
Data Science Master Class 2023
8 pages
Introduction To Linear Programming
100% (1)
Introduction To Linear Programming
27 pages
Unit 6 - Activity 1 - Continuous Probability Distribution Worksheet
100% (1)
Unit 6 - Activity 1 - Continuous Probability Distribution Worksheet
3 pages
Activity Name Normal Time Normal Cost Crash Time Crash Cost
No ratings yet
Activity Name Normal Time Normal Cost Crash Time Crash Cost
3 pages
Opengamma Local Vol - 008
No ratings yet
Opengamma Local Vol - 008
1 page
5021 Solutions 10
No ratings yet
5021 Solutions 10
7 pages
Assignment 4
No ratings yet
Assignment 4
2 pages
Cryptography and Data Security - Lecture - 1 PDF
No ratings yet
Cryptography and Data Security - Lecture - 1 PDF
20 pages
LECTURE NOTE Covariant Derivative
No ratings yet
LECTURE NOTE Covariant Derivative
4 pages
4 QEM Process Capability
No ratings yet
4 QEM Process Capability
6 pages
Mini Project
No ratings yet
Mini Project
25 pages
Atmosphere 11 01241 PDF
No ratings yet
Atmosphere 11 01241 PDF
15 pages
Zhang 2022 J. Phys. Conf. Ser. 2203 012065
No ratings yet
Zhang 2022 J. Phys. Conf. Ser. 2203 012065
7 pages
2021 Zimsec Ms (Revised) Paper 2 (A'level Statistics) Share
No ratings yet
2021 Zimsec Ms (Revised) Paper 2 (A'level Statistics) Share
29 pages
Inference in First-Order Logic
No ratings yet
Inference in First-Order Logic
13 pages

Model-Based Stochastic Search For Large Scale Optimization of

Uploaded by

Model-Based Stochastic Search For Large Scale Optimization of

Uploaded by

Model-Based Stochastic Search for Large Scale Optimization of

Multi-Agent UAV Swarms

Abstract— Recent work from the reinforcement learning

IEEE Symposium Series on Computational Intelligence SSCI 2018 2217

2218 IEEE Symposium Series on Computational Intelligence SSCI 2018

IEEE Symposium Series on Computational Intelligence SSCI 2018 2219

with an avoidance heading if the aircraft is about to collide 

3). The quadcopters use a hand-crafted policy where in the

2220 IEEE Symposium Series on Computational Intelligence SSCI 2018

the prohibitive computational difﬁculties with computing the  7HDP

(a) Scores from randomly perturbed parameter policies (training score)

IEEE Symposium Series on Computational Intelligence SSCI 2018 2221

2222 IEEE Symposium Series on Computational Intelligence SSCI 2018

You might also like

with an avoidance heading if the aircraft is about to collide

the prohibitive computational difﬁculties with computing the 7HDP