Model-Based Stochastic Search For Large Scale Optimization of
Model-Based Stochastic Search For Large Scale Optimization of
978-1-5386-9276-9/18/$31.00 2018
c IEEE 2216
problems rely on hand-designed and hand-tuned algorithms mean and variance ω = [μ, σ ]). Then the expectation of J(θ )
(see [13] for a review). One such example, distributed Model over the distribution f (θ ; ω) will always be less than the
Predictive Control, relies on independent MPC controllers optimal value of J, i.e.
on each agent with some level of coordination between
them [14], [15]. These controllers require hand-designing J(θ ) f (θ ; ω)dθ ≤ J(θ ∗ ) (2)
Θ
dynamics models, cost functions, feedback gains, etc. and
The idea of Gradient-based Adaptive Stochastic Search
require expert domain knowledge. Additionally, scaling these
(GASS) is that one can perform a search in the space
methods up to more complex problems continues to be
of parameters of the distribution Ω rather than Θ, for a
an issue. Evolutionary algorithms have also been tried as
distribution which maximizes the expectation in (2):
a solution to multi-agent problems; usually with smaller,
simpler environments, and policies with low complexity [16], ω ∗ = arg max J(θ ) f (θ ; ω)dθ (3)
[17]. Recently, a hybrid approach combining MPC and the ω∈Ω Θ
use of genetic algorithms to evolve the cost function for a Maximizing this expectation corresponds to finding a distri-
hand-tuned MPC controller has been demonstrated for a UAV bution which is maximally distributed around the optimal θ .
swarm combat scenario [18]. However, unlike maximizing (1), this objective function can
In this work we demonstrate the effectiveness of our now be made continuous and differentiable with respect to
approach on two complex multi-agent UAV swarm combat ω. With some assumptions on the form of the distribution,
scenarios: where a team of fixed wing aircraft must attack the gradient with respect to ω can be pushed inside the
a well-defended base, and where two teams of agents go expectation.
head to head to defeat each other. Such scenarios have been The GASS algorithm presented by [12] is applicable to
previously considered in simulated environments with less the exponential family of probability densities:
fidelity and complexity [19], [18]. We leverage the compu-
tational efficiency and flexibility of the recently developed f (θ ; ω) = exp{ω T (θ ) − φ (θ )} (4)
SCRIMMAGE multi-agent simulator for our experiments where φ (θ ) = ln exp(ω T (θ ))dθ , and T (θ ) is the vector
(Figure 1) [20]. We compare the performance of ES against of sufficient statistics. Since we are concerned with showing
the Cross Entropy Method. We also show for the competitive the connection with ES which uses parameter perturbations
scenario how the policy learns over time to coordinate a sampled with Gaussian noise, we assume that f (θ ; ω) is
strategy in response to an enemy learning to do the same. Gaussian. Furthermore, since we are concerned with learning
We make our code freely available for use (https:// a large number of parameters (i.e. weights in a neural
github.com/ddfan/swarm_evolve). network), we assume an independent Gaussian distribution
over each parameter. Then, T (θ ) = [θ , θ 2 ] ∈ R2n and ω =
II. P ROBLEM F ORMULATION
[μ/σ 2 , −1/nσ 2 ] ∈ R2n , where μ and σ are vectors of the
We can pose our problem as the non-differentiable, non- mean and standard deviation corresponding to the distribu-
convex optimization problem tion of each parameter, respectively.
θ ∗ = arg max J(θ ) (1)
θ ∈Θ Algorithm 1 Gradient-Based Adaptive Stochastic Search
where Θ ⊂ Rn , a nonempty compact set, is the space Require: Learning rate αk , sample size Nk , initial policy
of solutions, and J(θ ) is a non-differentiable, non-convex parameters ω0 = [μ0 , σ02 ] , smoothing function S(), small
real-valued objective function J : Θ → R. θ could be any constant γ > 0.
combination of decision variables of our problem, including 1: for k = 0, 1, · · · do
neural network weights, PID gains, hardware design param- iid
2: Sample θki ∼ f (θ ; ωk ), i = 1, 2, · · · , Nk .
eters, etc. which affect the outcome of the returns J. For 3: Compute returns wik = S(J(θki )) for i = 1, · · · , Nk .
reinforcement learning problems θ usually represents the 4: Compute variance terms Vk = V̂k + γI, eq (5),(6)
parameters of the policy and J is an implicit function of the 5: Calculate normalizer η = ∑i=1
Nk
wik .
sequential application of the policy to the environment. We 6: Update ωk+1 :
first review how this problem can be solved using Gradient- Nk i
θk μ
Based Adaptive Stochastic Search methods and then show 7: ωk+1 ← ωk + αk η1 Vk−1 ∑ wik −
i=1
(θki )2 σ 2 + μ2
how the ES algorithm is a special case of these methods.
8: end for
A. Gradient-Based Adaptive Stochastic Search
The goal of model-based stochastic search methods is We present the GASS algorithm for this specific set of
to cast the non-differentiable optimization problem (1) as probability models (Algorithm 1), although the analysis for
a differentiable one by specifying a probabilistic model convergence holds for the more general exponential family
(hence ”model-based”) from which to sample [12]. Let this of distributions. For each iteration k, The GASS algorithm
iid
model be p(θ |ω) = f (θ ; ω), ω ∈ Ω, where ω is a parameter involves drawing Nk samples of parameters θki ∼ f (θ ; ωk ), i =
which defines the probability distribution (e.g. for Gaussian 1, 2, · · · , Nk . These parameters are then used to sample the
distributions, the distribution is fully parameterized by the return function J(θki ). The returns are fed through a shaping
ݔ
ݕ
ݖ ^ĂĨĞƚLJ
н
W/
>ŽŐŝĐ ݑሺݐሻ ݔሺݐሻ
Ԧ
Ͳ
݂ሺሺݐሻǢ
Ԧ ߠሻ
^ĞŶƐŝŶŐ ሺݐሻ
Ԧ EĞƵƌĂůEĞƚǁŽƌŬ
Fig. 2: Diagram of the structured policy used by each agent to navigate and execute team behaviors. Nearby ally states
(which are fully observable) and enemy states (which are partially observable), base locations, and the agent’s own state
are fed into a neural network which produces a reference target in relative xyz coordinates. The reference target is fed into
the safety logic block which checks for invalid targets which would cause collisions with neighbors or the ground. It then
updates this reference target to produce a “safe” reference target which is fed to the PID controller. The PID controller
tracks the reference target and provides low-level controls for the agent (thrust, aileron, elevator, rudder).
rates we found little improvement over adapting the variance to control itself in a decentralized manner, while allowing
of the sampling distribution. We hypothesize that a first order for beneficial group behaviors to emerge. Furthermore, we
method with adaptive learning rates is sufficient for achieving assume that friendly agents can communicate to share states
good performance when optimizing neural networks. For with each other (see Figure 2). Because we have a large
other types of policy parameterizations however, the full number of agents (up to 50 per team), to keep communication
second-order treatment of GASS may be more useful. It is costs lower we only allow agents to share information locally,
also possible to mix and match which parameters require a i.e. agents close to each other have access to each other’s
full variance update and which can be updated with a first- states. In our experiments we allow each agent to sense the
order approximate method. We use the rank transformation states of the closest 5 friendly agents for a total of 5∗10 = 50
function for S(·) and keep Nk constant. incoming state messages.
Additionally, each agent is equipped with sensors to detect
C. Learning Structured Policies for Multi-Agent Problems enemy agents. Full state observability is not available here,
Now that we are more confident about the convergence instead we assume that sensors are capable of sensing an
of the ES/GASS method, we show how ES can be used to enemy’s relative position and velocity. In our experiments
optimize a complex policy in a large-scale multi-agent envi- we assumed that each agent is able to sense the nearest
ronment. We use the SCRIMMAGE multi-agent simulation 5 enemies for a total of 5 ∗ 7 = 35 dimensions of enemy
environment [20] as it allows us to quickly and in parallel data (7 states = [relative xyz position, distance, and relative
simulate complex multi-agent scenarios. We populate our xyz velocities]). The sensors also provide information about
simulation with 6DoF fixed-wing aircraft and quadcopters home and enemy base relative headings and distances (an
with dynamics models having 10 and 12 states, respectively. additional 8 states). With the addition of the agent’s own state
These dynamics models allow for full ranges of motion (9 states), the policy’s observation input o(t) has a dimension
within realistic operating regimes. Stochastic disturbances of 102. These input states are fed into the agent’s policy: a
in the form of wind and control noise are modeled as neural network f (o(t); θ ) with 3 fully connected layers with
additive Gaussian noise. Ground and mid-air collisions can sizes 200, 200, and 50, which outputs 3 numbers representing
occur which result in the aircraft being destroyed. We also a desired relative heading [xre f , yre f , zre f ]. Each agent’s neural
incorporate a weapons module which allows for targeting network has more than 70,000 parameters. Each agent uses
and firing at an enemy within a fixed cone projecting from the same neural network parameters as its teammates, but
the aircraft’s nose. The probability of a hit depends on the since each agent encounters a different observation at each
distance to the target and the total area presented by the target timestep, the output of each agent’s neural network policy
to the attacker. This area is based on the wireframe model will be unique. It may also be possible to learn unique
of the aircraft and its relative pose. For more details, see our policies for each agent; we leave this for future work.
code and the SCRIMMAGE simulator documentation. With safety being a large concern in UAV flight, we
We consider the case where each agent uses its own policy design the policy to take into account safety and control
to compute its own controls, but where the parameters of the considerations. The relative heading output from the neural
policies are the same for all agents. This allows each agent network policy is intended to be used by a PID controller
6FRUH
ground or allies, etc., we override the neural network heading
6FRUH
III. E XPERIMENTS
We consider two scenarios: a base attack scenario where
a team of 50 fixed wing aircraft must attack an enemy base
defended by 20 quadcopters, and a team competitive task
where two teams concurrently learn to defeat each other. In
,WHUDWLRQ
both tasks we use the following reward function:
(b) Scores from updated parameter policies (testing score)
J = 10 × (#kills) + 50 × (#collisions with enemy base)
− 1e − 5 × (distance from enemy base at end of episode) Fig. 4: Scores per iteration for the base attack task. Top is
(7) scores earned by the randomly perturbed policies during the
course of training (with parameters θk + γεi ), while bottom
The reward function encourages air-to-air combat, as well is scores earned by the newly updated policy parameters
as suicide attacks against the enemy base (e.g. a swarm of θk . Scores in the top are in general lower because they
cheap, disposable drones carrying payloads). The last term result from policies which are parameterized by randomly
encourages the aircraft to move towards the enemy during perturbed values. Red indicates scores earned by the Evolu-
the initial phases of learning. tion Strategies algorithm, blue indicates the Cross Entropy
A. Base Attack Task Method. Bold line is the median, shaded areas are 25/75
quartile bounds. While improvement in scores plateaus after
about 1500 iterations in both the ES and CEM methods, ES
outperforms CEM by a healthy margin.
6FRUH
the full second-order GASS algorithm is unrealistic due to
the large number of parameters in the neural network and
6FRUH
7HDP
7HDP
,WHUDWLRQ
Fig. 5: Overhead view of two team competitive match in (b) Scores from updated parameter policies (testing score)
progress. The goal of both teams are the same: to defeat Fig. 6: Scores per iteration for the team competition task. Top
all enemy planes and attack the enemy base while suffering is scores earned by the randomly perturbed policies during
minimum losses. Red lines show successful weapons hits. the course of training (with parameters θk + γεi ), while bot-
tom is scores earned by the newly updated policy parameters
The second scenario we consider is where two teams each θk . Scores in the top are in general lower because they
equipped with their own unique policies for their agents result from policies which are parameterized by randomly
learn concurrently to defeat their opponent (Figure 5). At perturbed values. Red and blue curves show scores for team
each iteration, Nk = 300 simulations are spawned, each 1 and 2 respectively. Bold line is the median, shaded areas
with a different random perturbation, and with each team are 25/75 quartile bounds. The scores of both teams improve
having a different perturbation. The updates for each policy equally quickly and reach a Nash equilibrium where both
are calculated based on the scores received from playing teams annihilate one another.
the opponent’s perturbed policies. The result is that each
team learns to defeat a wide range of opponent behaviors
at each iteration. We observed that the behavior of the networks, first-order approximations of the covariance up-
two teams quickly approached a Nash equilibrium where dates can be used effectively. Other types of parameters such
both sides try to defeat the maximum number of opponent as PID gains may benefit from the full covariance update,
aircraft in order to prevent higher-scoring suicide attacks (see since the behavior of the system may be more sensitive their
supplementary video). The end result is a stalemate with perturbations. Future work will include experiments with
both teams annihilating each other, ending with tied scores optimizing these mixed parameterizations, e.g. optimizing
(Figure 6). We hypothesize that more varied behavior could both neural network weights and PID gains. Other directions
be learned by having each team compete against some past of investigation could include optimizing unique policies
enemy team behaviors or by building a library of policies for each agent in the team, optimizing heterogeneous teams
from which to select from, as frequently discussed by the of agents with different policies, or changing policies for
evolutionary computation community [23]. different scenarios/objectives. Yet another direction would
IV. C ONCLUSION be comparing other evolutionary computation strategies for
training neural networks, including methods which use a
We have shown that Evolution Strategies are applicable for
more diverse population [24], or more genetic algorithm-type
learning policies with many tens or hundreds of thousands
heuristics [25].
of parameters for a wide range of complex tasks in both the
competitive and cooperative multi-agent setting. By showing
the connection between ES and more well-understood model-
based stochastic search methods, we are able to gain insight
into algorithm design. When learning parameters for neural