0% found this document useful (0 votes)
28 views10 pages

Evolving Image Classifiers at Scale

This paper discusses the use of evolutionary algorithms to automatically discover neural network architectures for image classification, specifically on the CIFAR-10 and CIFAR-100 datasets. The authors demonstrate that their approach can achieve competitive accuracies of 94.6% and 77.0%, respectively, without human intervention, by employing novel mutation operators and scaling computation to unprecedented levels. The study emphasizes the simplicity and repeatability of their method, highlighting its potential to advance automated model discovery in deep learning.

Uploaded by

tarikasali120
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views10 pages

Evolving Image Classifiers at Scale

This paper discusses the use of evolutionary algorithms to automatically discover neural network architectures for image classification, specifically on the CIFAR-10 and CIFAR-100 datasets. The authors demonstrate that their approach can achieve competitive accuracies of 94.6% and 77.0%, respectively, without human intervention, by employing novel mutation operators and scaling computation to unprecedented levels. The study emphasizes the simplicity and repeatability of their method, highlighting its potential to advance automated model discovery in deep learning.

Uploaded by

tarikasali120
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Large-Scale Evolution of Image Classifiers

Esteban Real 1 Sherry Moore 1 Andrew Selle 1 Saurabh Saxena 1


Yutaka Leon Suematsu 2 Jie Tan 1 Quoc V. Le 1 Alexey Kurakin 1

Abstract It is therefore not surprising that in recent years, tech-


Neural networks have proven effective at solv- niques to automatically discover these architectures have
ing difficult problems but designing their archi- been gaining popularity (Bergstra & Bengio, 2012; Snoek
tectures can be challenging, even for image clas- et al., 2012; Han et al., 2015; Baker et al., 2016; Zoph
sification problems alone. Our goal is to min- & Le, 2016). One of the earliest such “neuro-discovery”
imize human participation, so we employ evo- methods was neuro-evolution (Miller et al., 1989; Stanley
lutionary algorithms to discover such networks & Miikkulainen, 2002; Stanley, 2007; Bayer et al., 2009;
automatically. Despite significant computational Stanley et al., 2009; Breuel & Shafait, 2010; Pugh & Stan-
requirements, we show that it is now possible to ley, 2013; Kim & Rigazio, 2015; Zaremba, 2015; Fernando
evolve models with accuracies within the range et al., 2016; Morse & Stanley, 2016). Despite the promising
of those published in the last year. Specifi- results, the deep learning community generally perceives
cally, we employ simple evolutionary techniques evolutionary algorithms to be incapable of matching the
at unprecedented scales to discover models for accuracies of hand-designed models (Verbancsics & Har-
the CIFAR-10 and CIFAR-100 datasets, start- guess, 2013; Baker et al., 2016; Zoph & Le, 2016). In this
ing from trivial initial conditions and reaching paper, we show that it is possible to evolve such competi-
accuracies of 94.6% (95.6% for ensemble) and tive models today, given enough computational power.
77.0%, respectively. To do this, we use novel and We used slightly-modified known evolutionary algorithms
intuitive mutation operators that navigate large and scaled up the computation to unprecedented levels, as
search spaces; we stress that no human participa- far as we know. This, together with a set of novel and
tion is required once evolution starts and that the intuitive mutation operators, allowed us to reach compet-
output is a fully-trained model. Throughout this itive accuracies on the CIFAR-10 dataset. This dataset
work, we place special emphasis on the repeata- was chosen because it requires large networks to reach
bility of results, the variability in the outcomes high accuracies, thus presenting a computational challenge.
and the computational requirements. We also took a small first step toward generalization and
evolved networks on the CIFAR-100 dataset. In transi-
tioning from CIFAR-10 to CIFAR-100, we did not mod-
1. Introduction ify any aspect or parameter of our algorithm. Our typical
Neural networks can successfully perform difficult tasks neuro-evolution outcome on CIFAR-10 had a test accuracy
where large amounts of training data are available (He with µ = 94.1%, σ = 0.4% @ 9 × 1019 FLOPs, and our
et al., 2015; Weyand et al., 2016; Silver et al., 2016; Wu top model (by validation accuracy) had a test accuracy of
et al., 2016). Discovering neural network architectures, 94.6% @ 4×1020 FLOPs. Ensembling the validation-top
however, remains a laborious task. Even within the spe- 2 models from each population reaches a test accuracy of
cific problem of image classification, the state of the art 95.6%, at no additional training cost. On CIFAR-100, our
was attained through many years of focused investigation single experiment resulted in a test accuracy of 77.0% @
by hundreds of researchers (Krizhevsky et al. (2012); Si- 2 × 1020 FLOPs. As far as we know, these are the most
monyan & Zisserman (2014); Szegedy et al. (2015); He accurate results obtained on these datasets by automated
et al. (2016); Huang et al. (2016a), among many others). discovery methods that start from trivial initial conditions.
1
Google Brain, Mountain View, California, USA 2 Google Re- Throughout this study, we placed special emphasis on the
search, Mountain View, California, USA. Correspondence to: Es- simplicity of the algorithm. In particular, it is a “one-
teban Real <ereal@[Link]>. shot” technique, producing a fully trained neural network
requiring no post-processing. It also has few impactful
Proceedings of the 34 th International Conference on Machine meta-parameters (i.e. parameters not optimized by the al-
Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017
by the author(s). gorithm). Starting out with poor-performing models with
Large-Scale Evolution

Table 1. Comparison with single-model hand-designed architectures. The “C10+” and “C100+” columns indicate the test accuracy on
the data-augmented CIFAR-10 and CIFAR-100 datasets, respectively. The “Reachable?” column denotes whether the given hand-
designed model lies within our search space. An entry of “–” indicates that no value was reported. The † indicates a result reported by
Huang et al. (2016b) instead of the original author. Much of this table was based on that presented in Huang et al. (2016a).

S TUDY PARAMS . C10+ C100+ R EACHABLE ?

M AXOUT (G OODFELLOW ET AL ., 2013) – 90.7% 61.4% NO


N ETWORK IN N ETWORK (L IN ET AL ., 2013) – 91.2% – NO
A LL -CNN (S PRINGENBERG ET AL ., 2014) 1.3 M 92.8% 66.3% Y ES
D EEPLY S UPERVISED (L EE ET AL ., 2015) – 92.0% 65.4% NO
H IGHWAY (S RIVASTAVA ET AL ., 2015) 2.3 M 92.3% 67.6% NO
R ES N ET (H E ET AL ., 2016) 1.7 M 93.4% 72.8%† Y ES
E VOLUTION ( OURS ) 5.4 M 94.6% N/A
40.4 M 77.0%
W IDE R ES N ET 28-10 (Z AGORUYKO & KOMODAKIS , 2016) 36.5 M 96.0% 80.0% Y ES
W IDE R ES N ET 40-10+ D / O (Z AGORUYKO & KOMODAKIS , 2016) 50.7 M 96.2% 81.7% NO
D ENSE N ET (H UANG ET AL ., 2016 A ) 25.6 M 96.7% 82.8% NO

no convolutions, the algorithm must evolve complex con- 2015; Fernando et al., 2016). For example, the CPPN
volutional neural networks while navigating a fairly unre- (Stanley, 2007; Stanley et al., 2009) allows for the evolu-
stricted search space: no fixed depth, arbitrary skip con- tion of repeating features at different scales. Also, Kim
nections, and numerical parameters that have few restric- & Rigazio (2015) use an indirect encoding to improve the
tions on the values they can take. We also paid close atten- convolution filters in an initially highly-optimized fixed ar-
tion to result reporting. Namely, we present the variabil- chitecture.
ity in our results in addition to the top value, we account
Research on weight evolution is still ongoing (Morse &
for researcher degrees of freedom (Simmons et al., 2011),
Stanley, 2016) but the broader machine learning commu-
we study the dependence on the meta-parameters, and we
nity defaults to back-propagation for optimizing neural net-
disclose the amount of computation necessary to reach the
work weights (Rumelhart et al., 1988). Back-propagation
main results. We are hopeful that our explicit discussion of
and evolution can be combined as in Stanley et al. (2009),
computation cost could spark more study of efficient model
where only the structure is evolved. Their algorithm fol-
search and training. Studying model performance normal-
lows an alternation of architectural mutations and weight
ized by computational investment allows consideration of
back-propagation. Similarly, Breuel & Shafait (2010) use
economic concepts like opportunity cost.
this approach for hyper-parameter search. Fernando et al.
(2016) also use back-propagation, allowing the trained
2. Related Work weights to be inherited through the structural modifica-
tions.
Neuro-evolution dates back many years (Miller et al.,
1989), originally being used only to evolve the weights The above studies create neural networks that are small in
of a fixed architecture. Stanley & Miikkulainen (2002) comparison to the typical modern architectures used for im-
showed that it was advantageous to simultaneously evolve age classification (He et al., 2016; Huang et al., 2016a).
the architecture using the NEAT algorithm. NEAT has Their focus is on the encoding or the efficiency of the evo-
three kinds of mutations: (i) modify a weight, (ii) add a lutionary process, but not on the scale. When it comes to
connection between existing nodes, or (iii) insert a node images, some neuro-evolution results reach the computa-
while splitting an existing connection. It also has a mech- tional scale required to succeed on the MNIST dataset (Le-
anism for recombining two models into one and a strategy Cun et al., 1998). Yet, modern classifiers are often tested
to promote diversity known as fitness sharing (Goldberg on realistic images, such as those in the CIFAR datasets
et al., 1987). Evolutionary algorithms represent the models (Krizhevsky & Hinton, 2009), which are much more chal-
using an encoding that is convenient for their purpose— lenging. These datasets require large models to achieve
analogous to nature’s DNA. NEAT uses a direct encoding: high accuracy.
every node and every connection is stored in the DNA. The
Non-evolutionary neuro-discovery methods have been
alternative paradigm, indirect encoding, has been the sub-
more successful at tackling realistic image data. Snoek
ject of much neuro-evolution research (Gruau, 1993; Stan-
et al. (2012) used Bayesian optimization to tune 9
ley et al., 2009; Pugh & Stanley, 2013; Kim & Rigazio,
hyper-parameters for a fixed-depth architecture, reach-
Large-Scale Evolution

Table 2. Comparison with automatically discovered architectures. The “C10+” and “C100+” contain the test accuracy on the data-
augmented CIFAR-10 and CIFAR-100 datasets, respectively. An entry of “–” indicates that the information was not reported or is not
known to us. For Zoph & Le (2016), we quote the result with the most similar search space to ours, as well as their best result. Please
refer to Table 1 for hand-designed results, including the state of the art. “Discrete params.” means that the parameters can be picked
from a handful of values only (e.g. strides ∈ {1, 2, 4}).

S TUDY S TARTING P OINT C ONSTRAINTS P OST-P ROCESSING PARAMS . C10+ C100+

BAYESIAN 3 LAYERS FIXED ARCHITECTURE , NO NONE – 90.5% –


(S NOEK SKIPS
ET AL ., 2012)

Q-L EARNING – DISCRETE PARAMS ., MAX . TUNE , RETRAIN 11.2 M 93.1% 72.9%
(BAKER NUM . LAYERS , NO SKIPS
ET AL ., 2016)

RL (Z OPH & 20 LAYERS , 50% DISCRETE PARAMS ., SMALL GRID 2.5 M 94.0% –
L E , 2016) SKIPS EXACTLY 20 LAYERS SEARCH , RETRAIN

RL (Z OPH & 39 LAYERS , 2 POOL DISCRETE PARAMS ., ADD MORE FILTERS , 37.0 M 96.4% –
L E , 2016) LAYERS AT 13 AND EXACTLY 39 LAYERS , 2 SMALL GRID
26, 50% SKIPS POOL LAYERS AT 13 AND 26 SEARCH , RETRAIN

5.4 M 94.6%
E VOLUTION SINGLE LAYER , POWER - OF -2 STRIDES N ONE 40.4 M 77.0%
( OURS ) ZERO CONVS .
ENSEMB . 95.6%

ing a new state of the art at the time. Zoph & two (Bergstra & Bengio, 2012).
Le (2016) used reinforcement learning on a deeper
Our approach builds on previous work, with some im-
fixed-length architecture. In their approach, a neu-
portant differences. We explore large model-architecture
ral network—the “discoverer”—constructs a convolutional
search spaces starting with basic initial conditions to avoid
neural network—the “discovered”—one layer at a time. In
priming the system with information about known good
addition to tuning layer parameters, they add and remove
strategies for the specific dataset at hand. Our encoding
skip connections. This, together with some manual post-
is different from the neuro-evolution methods mentioned
processing, gets them very close to the (current) state of
above: we use a simplified graph as our DNA, which is
the art. (Additionally, they surpassed the state of the art on
transformed to a full neural network graph for training and
a sequence-to-sequence problem.) Baker et al. (2016) use
evaluation (Section 3). Some of the mutations acting on
Q-learning to also discover a network one layer at a time,
this DNA are reminiscent of NEAT. However, instead of
but in their approach, the number of layers is decided by
single nodes, one mutation can insert whole layers—i.e.
the discoverer. This is a desirable feature, as it would allow
tens to hundreds of nodes at a time. We also allow for
a system to construct shallow or deep solutions, as may be
these layers to be removed, so that the evolutionary process
the requirements of the dataset at hand. Different datasets
can simplify an architecture in addition to complexifying it.
would not require specially tuning the algorithm. Compar-
Layer parameters are also mutable, but we do not prescribe
isons among these methods are difficult because they ex-
a small set of possible values to choose from, to allow for
plore very different search spaces and have very different
a larger search space. We do not use fitness sharing. We
initial conditions (Table 2).
report additional results using recombination, but for the
Tangentially, there has also been neuro-evolution work on most part, we used mutation only. On the other hand, we
LSTM structure (Bayer et al., 2009; Zaremba, 2015), but do use back-propagation to optimize the weights, which
this is beyond the scope of this paper. Also related to this can be inherited across mutations. Together with a learn-
work is that of Saxena & Verbeek (2016), who embed con- ing rate mutation, this allows the exploration of the space
volutions with different parameters into a species of “super- of learning rate schedules, yielding fully trained models
network” with many parallel paths. Their algorithm then at the end of the evolutionary process (Section 3). Ta-
selects and ensembles paths in the super-network. Finally, bles 1 and 2 compare our approach with hand-designed ar-
canonical approaches to hyper-parameter search are grid chitectures and with other neuro-discovery techniques, re-
search (used in Zagoruyko & Komodakis (2016), for ex- spectively.
ample) and random search, the latter being the better of the
Large-Scale Evolution

3. Methods lutional network, two of the dimensions of the tensor rep-


resent the spatial coordinates of the image and the third is
3.1. Evolutionary Algorithm a number of channels. Activation functions are applied at
To automatically search for high-performing neural net- the vertices and can be either (i) batch-normalization (Ioffe
work architectures, we evolve a population of models. & Szegedy, 2015) with rectified linear units (ReLUs) or (ii)
Each model—or individual—is a trained architecture. The plain linear units. The graph’s edges represent identity con-
model’s accuracy on a separate validation dataset is a mea- nections or convolutions and contain the mutable numeri-
sure of the individual’s quality or fitness. During each evo- cal parameters defining the convolution’s properties. When
lutionary step, a computer—a worker—chooses two indi- multiple edges are incident on a vertex, their spatial scales
viduals at random from this population and compares their or numbers of channels may not coincide. However, the
fitnesses. The worst of the pair is immediately removed vertex must have a single size and number of channels for
from the population—it is killed. The best of the pair is its activations. The inconsistent inputs must be resolved.
selected to be a parent, that is, to undergo reproduction. Resolution is done by choosing one of the incoming edges
By this we mean that the worker creates a copy of the par- as the primary one. We pick this primary edge to be the
ent and modifies this copy by applying a mutation, as de- one that is not a skip connection. The activations coming
scribed below. We will refer to this modified copy as the from the non-primary edges are reshaped through zeroth-
child. After the worker creates the child, it trains this child, order interpolation in the case of the size and through trun-
evaluates it on the validation set, and puts it back into the cation/padding in the case of the number of channels, as in
population. The child then becomes alive—i.e. free to act He et al. (2016). In addition to the graph, the learning-rate
as a parent. Our scheme, therefore, uses repeated pairwise value is also stored in the DNA.
competitions of random individuals, which makes it an ex- A child is similar but not identical to the parent because of
ample of tournament selection (Goldberg & Deb, 1991). the action of a mutation. In each reproduction event, the
Using pairwise comparisons instead of whole population worker picks a mutation at random from a predetermined
operations prevents workers from idling when they finish set. The set contains the following mutations:
early. Code and more detail about the methods described
below can be found in Supplementary Section S1. • A LTER - LEARNING - RATE (sampling details below).
• I DENTITY (effectively means “keep training”).
Using this strategy to search large spaces of complex im- • R ESET- WEIGHTS (sampled as in He et al. (2015), for
age models requires considerable computation. To achieve example).
scale, we developed a massively-parallel, lock-free infras- • I NSERT- CONVOLUTION (inserts a convolution at a ran-
tructure. Many workers operate asynchronously on differ- dom location in the “convolutional backbone”, as in Fig-
ent computers. They do not communicate directly with ure 1. The inserted convolution has 3 × 3 filters, strides
each other. Instead, they use a shared file-system, where of 1 or 2 at random, number of channels same as input.
the population is stored. The file-system contains direc- May apply batch-normalization and ReLU activation or
tories that represent the individuals. Operations on these none at random).
individuals, such as the killing of one, are represented as • R EMOVE - CONVOLUTION.
atomic renames on the directory2 . Occasionally, a worker • A LTER - STRIDE (only powers of 2 are allowed).
may concurrently modify the individual another worker is • A LTER - NUMBER - OF - CHANNELS (of random conv.).
operating on. In this case, the affected worker simply gives • F ILTER - SIZE (horizontal or vertical at random, on ran-
up and tries again. The population size is 1000 individuals, dom convolution, odd values only).
unless otherwise stated. The number of workers is always • I NSERT- ONE - TO - ONE (inserts a one-to-one/identity
1
4 of the population size. To allow for long run-times with connection, analogous to insert-convolution mutation).
a limited amount of space, dead individuals’ directories are • A DD - SKIP (identity between random layers).
frequently garbage-collected. • R EMOVE - SKIP (removes random skip).

3.2. Encoding and Mutations These specific mutations were chosen for their similarity
to the actions that a human designer may take when im-
Individual architectures are encoded as a graph that we proving an architecture. This may clear the way for hybrid
refer to as the DNA. In this graph, the vertices represent evolutionary–hand-design methods in the future. The prob-
rank-3 tensors or activations. As is standard for a convo- abilities for the mutations were not tuned in any way.
2
The use of the file-name string to contain key information A mutation that acts on a numerical parameter chooses the
about the individual was inspired by Breuel & Shafait (2010), and
new value at random around the existing value. All sam-
it speeds up disk access enormously. In our case, the file name
contains the state of the individual (alive, dead, training, etc.). pling is from uniform distributions. For example, a muta-
tion acting on a convolution with 10 output channels will
Large-Scale Evolution

result in a convolution having between 5 and 20 output 3.5. Computation cost


channels (that is, half to twice the original value). All val-
To estimate computation costs, we identified the basic
ues within the range are possible. As a result, the models
TensorFlow (TF) operations used by our model training
are not constrained to a number of filters that is known to
and validation, like convolutions, generic matrix multipli-
work well. The same is true for all other parameters, yield-
cations, etc. For each of these TF operations, we esti-
ing a “dense” search space. In the case of the strides, this
mated the theoretical number of floating-point operations
applies to the log-base-2 of the value, to allow for activa-
(FLOPs) required. This resulted in a map from TF opera-
tion shapes to match more easily3 . In principle, there is also
tion to FLOPs, which is valid for all our experiments.
no upper limit to any of the parameters. All model depths
are attainable, for example. Up to hardware constraints, the For each individual within an evolution experiment, we
search space is unbounded. The dense and unbounded na- compute the total FLOPs incurred by the TF operations in
ture of the parameters result in the exploration of a truly its architecture over one batch of examples, both during its
large set of possible architectures. training (Ft FLOPs) and during its validation (Fv FLOPs).
Then we assign to the individual the cost Ft Nt + Fv Nv ,
3.3. Initial Conditions where Nt and Nv are the number of training and validation
batches, respectively. The cost of the experiment is then
Every evolution experiment begins with a population of the sum of the costs of all its individuals.
simple individuals, all with a learning rate of 0.1. They
are all very bad performers. Each initial individual consti- We intend our FLOPs measurement as a coarse estimate
tutes just a single-layer model with no convolutions. This only. We do not take into account input/output, data prepro-
conscious choice of poor initial conditions forces evolution cessing, TF graph building or memory-copying operations.
to make the discoveries by itself. The experimenter con- Some of these unaccounted operations take place once per
tributes mostly through the choice of mutations that demar- training run or once per step and some have a component
cate a search space. Altogether, the use of poor initial con- that is constant in the model size (such as disk-access la-
ditions and a large search space limits the experimenter’s tency or input data cropping). We therefore expect the esti-
impact. In other words, it prevents the experimenter from mate to be more useful for large architectures (for example,
“rigging” the experiment to succeed. those with many convolutions).

3.4. Training and Validation 3.6. Weight Inheritance


Training and validation is done on the CIFAR-10 dataset. We need architectures that are trained to completion within
This dataset consists of 50,000 training examples and an evolution experiment. If this does not happen, we are
10,000 test examples, all of which are 32 x 32 color images forced to retrain the best model at the end, possibly hav-
labeled with 1 of 10 common object classes (Krizhevsky & ing to explore its hyper-parameters. Such extra explo-
Hinton, 2009). 5,000 of the training examples are held out ration tends to depend on the details of the model being
in a validation set. The remaining 45,000 examples consti- retrained. On the other hand, 25,600 steps are not enough
tute our actual training set. The training set is augmented to fully train each individual. Training a large model to
as in He et al. (2016). The CIFAR-100 dataset has the same completion is prohibitively slow for evolution. To resolve
number of dimensions, colors and examples as CIFAR-10, this dilemma, we allow the children to inherit the par-
but uses 100 classes, making it much more challenging. ents’ weights whenever possible. Namely, if a layer has
matching shapes, the weights are preserved. Consequently,
Training is done with TensorFlow (Abadi et al., 2016), us- some mutations preserve all the weights (like the identity or
ing SGD with a momentum of 0.9 (Sutskever et al., 2013), a learning-rate mutations), some preserve none (the weight-
batch size of 50, and a weight decay of 0.0001. Each train- resetting mutation), and most preserve some but not all. An
ing runs for 25,600 steps, a value chosen to be brief enough example of the latter is the filter-size mutation: only the fil-
so that each individual could be trained in a few seconds to ters of the convolution being mutated will be discarded.
a few hours, depending on model size. The loss function is
the cross-entropy. Once training is complete, a single eval-
3.7. Reporting Methodology
uation on the validation set provides the accuracy to use as
the individual’s fitness. Ensembling was done by majority To avoid over-fitting, neither the evolutionary algorithm nor
voting during the testing evaluation. The models used in the neural network training ever see the testing set. Each
the ensemble were selected by validation accuracy. time we refer to “the best model”, we mean the model with
3 the highest validation accuracy. However, we always report
For integer DNA parameters, we actually store and mutate a
floating-point value. This allows multiple small mutations to have the test accuracy. This applies not only to the choice of the
a cumulative effect in spite of integer round-off. best individual within an experiment, but also to the choice
Large-Scale Evolution

of the best experiment. Moreover, we only include ex- We also ran a partial control where the weight-inheritance
periments that we managed to reproduce, unless explicitly mechanism is disabled. This run also results in a lower
noted. Any statistical analysis was fully decided upon be- accuracy (92.2 %) in the same amount of time (Figure 2),
fore seeing the results of the experiment reported, to avoid using 9×1019 FLOPs. This shows that weight inheritance
tailoring our analysis to our experimental data (Simmons is important in the process.
et al., 2011).
Finally, we applied our neuro-evolution algorithm, with-
out any changes and with the same meta-parameters, to
4. Experiments and Results CIFAR-100. Our only experiment reached an accuracy
of 77.0 %, using 2 × 1020 FLOPs. We did not attempt
We want to answer the following questions:
other datasets. Table 1 shows that both the CIFAR-10
• Can a simple one-shot evolutionary process start from and CIFAR-100 results are competitive with modern hand-
trivial initial conditions and yield fully trained models designed networks.
that rival hand-designed architectures?
• What are the variability in outcomes, the parallelizabil- 5. Analysis
ity, and the computation cost of the method?
• Can an algorithm designed iterating on CIFAR-10 be ap- Meta-parameters. We observe that populations evolve
plied, without any changes at all, to CIFAR-100 and still until they plateau at some local optimum (Figure 2). The
produce competitive models? fitness (i.e. validation accuracy) value at this optimum
varies between experiments (Figure 2, inset). Since not all
We used the algorithm in Section 3 to perform several ex-
experiments reach the highest possible value, some popu-
periments. Each experiment evolves a population in a few
lations are getting “trapped” at inferior local optima. This
days, typified by the example in Figure 1. The figure also
entrapment is affected by two important meta-parameters
contains examples of the architectures discovered, which
(i.e. parameters that are not optimized by the algorithm).
turn out to be surprisingly simple. Evolution attempts skip
These are the population size and the number of training
connections but frequently rejects them.
steps per individual. Below we discuss them and consider
To get a sense of the variability in outcomes, we repeated their relationship to local optima.
the experiment 5 times. Across all 5 experiment runs, the
Effect of population size. Larger populations explore the
best model by validation accuracy has a testing accuracy of
space of models more thoroughly, and this helps reach bet-
94.6 %. Not all experiments reach the same accuracy, but
ter optima (Figure 3, left). Note, in particular, that a pop-
they get close (µ = 94.1%, σ = 0.4). Fine differences in the
ulation of size 2 can get trapped at very low fitness values.
experiment outcome may be somewhat distinguishable by
Some intuition about this can be gained by considering the
validation accuracy (correlation coefficient = 0.894). The
fate of a super-fit individual, i.e. an individual such that any
total amount of computation across all 5 experiments was
one architectural mutation reduces its fitness (even though
4×1020 FLOPs (or 9×1019 FLOPs on average per exper-
a sequence of many mutations may improve it). In the case
iment). Each experiment was distributed over 250 parallel
of a population of size 2, if the super-fit individual wins
workers (Section 3.1). Figure 2 shows the progress of the
once, it will win every time. After the first win, it will pro-
experiments in detail.
duce a child that is one mutation away. By definition of
As a control, we disabled the selection mechanism, thereby super-fit, therefore, this child is inferior4 . Consequently,
reproducing and killing random individuals. This is the in the next round of tournament selection, the super-fit in-
form of random search that is most compatible with our dividual competes against its child and wins again. This
infrastructure. The probability distributions for the pa- cycle repeats forever and the population is trapped. Even if
rameters are implicitly determined by the mutations. This a sequence of two mutations would allow for an “escape”
control only achieves an accuracy of 87.3 % in the same from the local optimum, such a sequence can never take
amount of run time on the same hardware (Figure 2). The place. This is only a rough argument to heuristically sug-
total amount of computation was 2×1017 FLOPs. The low gest why a population of size 2 is easily trapped. More
FLOP count is a consequence of random search generating generally, Figure 3 (left) empirically demonstrates a bene-
many small, inadequate models that train quickly but con- fit from an increase in population size. Theoretical analy-
sume roughly constant amounts of setup time (not included ses of this dependence are quite complex and assume very
in the FLOP count). We attempted to minimize this over- specific models of population dynamics; often larger pop-
head by avoiding unnecessary disk access operations, to no ulations are better at handling local optima, at least beyond
avail: too much overhead remains spent on a combination a size threshold (Weinreich & Chao (2005) and references
of neural network setup, data augmentation, and training 4
Except after identity or learning rate mutations, but these pro-
step initialization. duce a child with the same architecture as the parent.
Large-Scale Evolution

94.6
91.8

85.3

Input

Input Input C
test accuracy (%)

C C
C

C C + BN + R + BN + R
C + BN + R

BN + R C + BN + R + BN + R + BN + R + BN + R C + BN + R + BN + R + BN + R + BN + R

C + BN + R C + BN + R
C + BN + R

BN + R C + BN + R
C + BN + R

C + BN + R C + BN + R
C

C + BN + R + BN + R C
C + BN + R
Input
C C
C
22.6
Global Pool C + BN + R + BN + R C + BN + R
C + BN + R
Output Global Pool C + BN + R
Global Pool

Output Global Pool


Output

Output

0.9 28.1 70.2 wall time (hours) 256.2

Figure 1. Progress of an evolution experiment. Each dot represents an individual in the population. Blue dots (darker, top-right) are alive.
The rest have been killed. The four diagrams show examples of discovered architectures. These correspond to the best individual (right-
most) and three of its ancestors. The best individual was selected by its validation accuracy. Evolution sometimes stacks convolutions
without any nonlinearity in between (“C”, white background), which are mathematically equivalent to a single linear operation. Unlike
typical hand-designed architectures, some convolutions are followed by more than one nonlinear function (“C+BN +R+BN +R+...”,
orange background).

therein). such choices to others. In a separate experiment, we at-


Effect of number of training steps. The other meta- tempted recombining the trained weights from two parents
parameter is the number T of training steps for each indi- in the hope that each parent may have learned different
vidual. Accuracy increases with T (Figure 3, right). Larger concepts from the training data. In a third experiment,
T means an individual needs to undergo fewer identity mu- we recombined structures so that the child fused the ar-
tations to reach a given level of training. chitectures of both parents side-by-side, generating wide
models fast. While none of these approaches improved our
Escaping local optima. While we might increase popu-
recombination-free results, further study seems warranted.
lation size or number of steps to prevent a trapped popu-
lation from forming, we can also free an already trapped
population. For example, increasing the mutation rate or 6. Conclusion
resetting all the weights of a population (Figure 4) work
In this paper we have shown that (i) neuro-evolution is ca-
well but are quite costly (more details in Supplementary
pable of constructing large, accurate networks for two chal-
Section S3).
lenging and popular image classification benchmarks; (ii)
Recombination. None of the results presented so far neuro-evolution can do this starting from trivial initial con-
used recombination. However, we explored three forms of ditions while searching a very large space; (iii) the pro-
recombination in additional experiments. Following Tuson cess, once started, needs no experimenter participation; and
& Ross (1998), we attempted to evolve the mutation prob- (iv) the process yields fully trained models. Completely
ability distribution too. On top of this, we employed a re- training models required weight inheritance (Sections 3.6).
combination strategy by which a child could inherit struc- In contrast to reinforcement learning, evolution provides a
ture from one parent and mutation probabilities from an- natural framework for weight inheritance: mutations can
other. The goal was to allow individuals that progressed be constructed to guarantee a large degree of similarity be-
well due to good mutation choices to quickly propagate
Large-Scale Evolution

100.0 100
94.6 100

test accuracy (%)

test accuracy (%)


test accuracy (%)

150 250
50 75
2 10 43 1000 256 2560 25600
Evolution 94 population size training steps
Evolution w/o
weight inheritance
Random search Figure 3. Dependence on meta-parameters. In both graphs, each
92
circle represents the result of a full evolution experiment. Both
20.0 vertical axes show the test accuracy for the individual with the
0 wall-clock time (hours) 250
highest validation accuracy at the end of the experiment. All pop-
ulations evolved for the same total wall-clock time. There are 5
Figure 2. Repeatability of results and controls. In this plot, the data points at each horizontal axis value. LEFT: effect of pop-
vertical axis at wall-time t is defined as the test accuracy of the ulation size. To economize resources, in these experiments the
individual with the highest validation accuracy that became alive number of individual training steps is only 2560. Note how the ac-
at or before t. The inset magnifies a portion of the main graph. curacy increases with population size. RIGHT: effect of number
The curves show the progress of various experiments, as follows. of training steps per individual. Note how the accuracy increases
The top line (solid, blue) shows the mean test accuracy across 5 with more steps.
large-scale evolution experiments. The shaded area around this
top line has a width of ±2σ (clearer in inset). The next line down
(dashed, orange, main graph and inset) represents a single experi-
ment in which weight-inheritance was disabled, so every individ-
ual has to train from random weights. The lowest curve (dotted-
dashed) is a random-search control. All experiments occupied the
same amount and type of hardware. A small amount of noise in
the generalization from the validation to the test set explains why
the lines are not monotonically increasing. Note the narrow width
of the ±2σ area (main graph and inset), which shows that the high
accuracies obtained in evolution experiments are repeatable.

tween the original and mutated models—as we did. Evo-


lution also has fewer tunable meta-parameters with a fairly
predictable effect on the variance of the results, which can
be made small.
While we did not focus on reducing computation costs,
we hope that future algorithmic and hardware improvement
will allow more economical implementation. In that case, Figure 4. Escaping local optima in two experiments. We used
evolution would become an appealing approach to neuro- smaller populations and fewer training steps per individual (2560)
discovery for reasons beyond the scope of this paper. For to make it more likely for a population to get trapped and to re-
example, it “hits the ground running”, improving on arbi- duce resource usage. Each dot represents an individual. The verti-
trary initial models as soon as the experiment begins. The cal axis is the accuracy. TOP: example of a population of size 100
escaping a local optimum by using a period of increased mutation
mutations used can implement recent advances in the field
rate in the middle (Section 5). BOTTOM: example of a population
and can be introduced without having to restart an exper-
of size 50 escaping a local optimum by means of three consecu-
iment. Furthermore, recombination can merge improve- tive weight resetting events (Section 5). Details in Supplementary
ments developed by different individuals, even if they come Section S3.
from other populations. Moreover, it may be possible to
combine neuro-evolution with other automatic architecture
discovery methods.
Large-Scale Evolution

Acknowledgements Goodfellow, Ian J, Warde-Farley, David, Mirza, Mehdi,


Courville, Aaron C, and Bengio, Yoshua. Maxout net-
We wish to thank Vincent Vanhoucke, Megan Kacho- works. International Conference on Machine Learning,
lia, Rajat Monga, and especially Jeff Dean for their sup- 28:1319–1327, 2013.
port and valuable input; Geoffrey Hinton, Samy Ben-
gio, Thomas Breuel, Mark DePristo, Vishy Tirumalashetty, Gruau, Frederic. Genetic synthesis of modular neural net-
Martin Abadi, Noam Shazeer, Yoram Singer, Dumitru Er- works. In Proceedings of the 5th International Confer-
han, Pierre Sermanet, Xiaoqiang Zheng, Shan Carter and ence on Genetic Algorithms, pp. 318–325. Morgan Kauf-
Vijay Vasudevan for helpful discussions; Thomas Breuel, mann Publishers Inc., 1993.
Xin Pan and Andy Davis for coding contributions; and the
larger Google Brain team for help with TensorFlow and Han, Song, Pool, Jeff, Tran, John, and Dally, William.
training vision models. Learning both weights and connections for efficient neu-
ral network. In Advances in Neural Information Process-
ing Systems, pp. 1135–1143, 2015.
References
Abadi, Martı́n, Agarwal, Ashish, Barham, Paul, Brevdo, He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,
Eugene, Chen, Zhifeng, Citro, Craig, Corrado, Greg S, Jian. Delving deep into rectifiers: Surpassing human-
Davis, Andy, Dean, Jeffrey, Devin, Matthieu, et al. Ten- level performance on imagenet classification. In Pro-
sorflow: Large-scale machine learning on heterogeneous ceedings of the IEEE international conference on com-
distributed systems. arXiv preprint arXiv:1603.04467, puter vision, pp. 1026–1034, 2015.
2016.
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,
Baker, Bowen, Gupta, Otkrist, Naik, Nikhil, and Jian. Deep residual learning for image recognition. In
Raskar, Ramesh. Designing neural network archi- Proceedings of the IEEE Conference on Computer Vi-
tectures using reinforcement learning. arXiv preprint sion and Pattern Recognition, pp. 770–778, 2016.
arXiv:1611.02167, 2016.
Huang, Gao, Liu, Zhuang, Weinberger, Kilian Q, and
Bayer, Justin, Wierstra, Daan, Togelius, Julian, and van der Maaten, Laurens. Densely connected convo-
Schmidhuber, Jürgen. Evolving memory cell structures lutional networks. arXiv preprint arXiv:1608.06993,
for sequence learning. In International Conference on 2016a.
Artificial Neural Networks, pp. 755–764. Springer, 2009.
Huang, Gao, Sun, Yu, Liu, Zhuang, Sedra, Daniel, and
Bergstra, James and Bengio, Yoshua. Random search Weinberger, Kilian Q. Deep networks with stochastic
for hyper-parameter optimization. Journal of Machine depth. In European Conference on Computer Vision, pp.
Learning Research, 13(Feb):281–305, 2012. 646–661. Springer, 2016b.
Breuel, Thomas and Shafait, Faisal. Automlp: Simple, Ioffe, Sergey and Szegedy, Christian. Batch normalization:
effective, fully automated learning rate and size adjust- Accelerating deep network training by reducing internal
ment. In The Learning Workshop. Utah, 2010. covariate shift. arXiv preprint arXiv:1502.03167, 2015.
Fernando, Chrisantha, Banarse, Dylan, Reynolds, Mal- Kim, Minyoung and Rigazio, Luca. Deep clustered convo-
colm, Besse, Frederic, Pfau, David, Jaderberg, Max, lutional kernels. arXiv preprint arXiv:1503.01824, 2015.
Lanctot, Marc, and Wierstra, Daan. Convolution by evo-
lution: Differentiable pattern producing networks. In Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple
Proceedings of the 2016 on Genetic and Evolutionary layers of features from tiny images. 2009.
Computation Conference, pp. 109–116. ACM, 2016.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E.
Goldberg, David E and Deb, Kalyanmoy. A comparative Imagenet classification with deep convolutional neural
analysis of selection schemes used in genetic algorithms. networks. In Advances in Neural Information Processing
Foundations of genetic algorithms, 1:69–93, 1991. Systems, pp. 1097–1105, 2012.
Goldberg, David E, Richardson, Jon, et al. Genetic algo- LeCun, Yann, Cortes, Corinna, and Burges, Christo-
rithms with sharing for multimodal function optimiza- pher JC. The mnist database of handwritten digits, 1998.
tion. In Genetic algorithms and their applications: Pro-
ceedings of the Second International Conference on Ge- Lee, Chen-Yu, Xie, Saining, Gallagher, Patrick W, Zhang,
netic Algorithms, pp. 41–49. Hillsdale, NJ: Lawrence Zhengyou, and Tu, Zhuowen. Deeply-supervised nets.
Erlbaum, 1987. In AISTATS, volume 2, pp. 5, 2015.
Large-Scale Evolution

Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network in Stanley, Kenneth O. Compositional pattern producing net-
network. arXiv preprint arXiv:1312.4400, 2013. works: A novel abstraction of development. Genetic pro-
gramming and evolvable machines, 8(2):131–162, 2007.
Miller, Geoffrey F, Todd, Peter M, and Hegde, Shailesh U.
Designing neural networks using genetic algorithms. In Stanley, Kenneth O and Miikkulainen, Risto. Evolving
Proceedings of the third international conference on Ge- neural networks through augmenting topologies. Evo-
netic algorithms, pp. 379–384. Morgan Kaufmann Pub- lutionary Computation, 10(2):99–127, 2002.
lishers Inc., 1989.
Stanley, Kenneth O, D’Ambrosio, David B, and Gauci, Ja-
Morse, Gregory and Stanley, Kenneth O. Simple evo- son. A hypercube-based encoding for evolving large-
lutionary optimization can rival stochastic gradient de- scale neural networks. Artificial Life, 15(2):185–212,
scent in neural networks. In Proceedings of the 2016 on 2009.
Genetic and Evolutionary Computation Conference, pp.
477–484. ACM, 2016. Sutskever, Ilya, Martens, James, Dahl, George E, and Hin-
ton, Geoffrey E. On the importance of initialization and
Pugh, Justin K and Stanley, Kenneth O. Evolving mul- momentum in deep learning. ICML (3), 28:1139–1147,
timodal controllers with hyperneat. In Proceedings of 2013.
the 15th annual conference on Genetic and evolutionary
computation, pp. 735–742. ACM, 2013. Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet,
Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Du-
Rumelhart, David E, Hinton, Geoffrey E, and Williams, mitru, Vanhoucke, Vincent, and Rabinovich, Andrew.
Ronald J. Learning representations by back-propagating Going deeper with convolutions. In Proceedings of
errors. Cognitive Modeling, 5(3):1, 1988. the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1–9, 2015.
Saxena, Shreyas and Verbeek, Jakob. Convolutional neural
fabrics. In Advances In Neural Information Processing Tuson, Andrew and Ross, Peter. Adapting operator settings
Systems, pp. 4053–4061, 2016. in genetic algorithms. Evolutionary computation, 6(2):
161–184, 1998.
Silver, David, Huang, Aja, Maddison, Chris J, Guez,
Arthur, Sifre, Laurent, Van Den Driessche, George, Verbancsics, Phillip and Harguess, Josh. Generative
Schrittwieser, Julian, Antonoglou, Ioannis, Panneershel- neuroevolution for deep learning. arXiv preprint
vam, Veda, Lanctot, Marc, et al. Mastering the game of arXiv:1312.5355, 2013.
go with deep neural networks and tree search. Nature,
529(7587):484–489, 2016. Weinreich, Daniel M and Chao, Lin. Rapid evolutionary
escape by large populations from local fitness peaks is
Simmons, Joseph P, Nelson, Leif D, and Simonsohn, Uri. likely in nature. Evolution, 59(6):1175–1182, 2005.
False-positive psychology: Undisclosed flexibility in
data collection and analysis allows presenting anything Weyand, Tobias, Kostrikov, Ilya, and Philbin, James.
as significant. Psychological Science, 22(11):1359– Planet-photo geolocation with convolutional neural net-
1366, 2011. works. In European Conference on Computer Vision, pp.
37–55. Springer, 2016.
Simonyan, Karen and Zisserman, Andrew. Very deep con-
volutional networks for large-scale image recognition. Wu, Yonghui, Schuster, Mike, Chen, Zhifeng, Le, Quoc V.,
arXiv preprint arXiv:1409.1556, 2014. Norouzi, Mohammad, et al. Google’s neural machine
translation system: Bridging the gap between human and
Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. machine translation. arXiv preprint arXiv:1609.08144,
Practical bayesian optimization of machine learning al- 2016.
gorithms. In Advances in neural information processing
systems, pp. 2951–2959, 2012. Zagoruyko, Sergey and Komodakis, Nikos. Wide residual
networks. arXiv preprint arXiv:1605.07146, 2016.
Springenberg, Jost Tobias, Dosovitskiy, Alexey, Brox,
Thomas, and Riedmiller, Martin. Striving for sim- Zaremba, Wojciech. An empirical exploration of recurrent
plicity: The all convolutional net. arXiv preprint network architectures. 2015.
arXiv:1412.6806, 2014. Zoph, Barret and Le, Quoc V. Neural architecture
Srivastava, Rupesh Kumar, Greff, Klaus, and Schmid- search with reinforcement learning. arXiv preprint
huber, Jürgen. Highway networks. arXiv preprint arXiv:1611.01578, 2016.
arXiv:1505.00387, 2015.

You might also like