Evolving Image Classifiers at Scale
Evolving Image Classifiers at Scale
Table 1. Comparison with single-model hand-designed architectures. The “C10+” and “C100+” columns indicate the test accuracy on
the data-augmented CIFAR-10 and CIFAR-100 datasets, respectively. The “Reachable?” column denotes whether the given hand-
designed model lies within our search space. An entry of “–” indicates that no value was reported. The † indicates a result reported by
Huang et al. (2016b) instead of the original author. Much of this table was based on that presented in Huang et al. (2016a).
no convolutions, the algorithm must evolve complex con- 2015; Fernando et al., 2016). For example, the CPPN
volutional neural networks while navigating a fairly unre- (Stanley, 2007; Stanley et al., 2009) allows for the evolu-
stricted search space: no fixed depth, arbitrary skip con- tion of repeating features at different scales. Also, Kim
nections, and numerical parameters that have few restric- & Rigazio (2015) use an indirect encoding to improve the
tions on the values they can take. We also paid close atten- convolution filters in an initially highly-optimized fixed ar-
tion to result reporting. Namely, we present the variabil- chitecture.
ity in our results in addition to the top value, we account
Research on weight evolution is still ongoing (Morse &
for researcher degrees of freedom (Simmons et al., 2011),
Stanley, 2016) but the broader machine learning commu-
we study the dependence on the meta-parameters, and we
nity defaults to back-propagation for optimizing neural net-
disclose the amount of computation necessary to reach the
work weights (Rumelhart et al., 1988). Back-propagation
main results. We are hopeful that our explicit discussion of
and evolution can be combined as in Stanley et al. (2009),
computation cost could spark more study of efficient model
where only the structure is evolved. Their algorithm fol-
search and training. Studying model performance normal-
lows an alternation of architectural mutations and weight
ized by computational investment allows consideration of
back-propagation. Similarly, Breuel & Shafait (2010) use
economic concepts like opportunity cost.
this approach for hyper-parameter search. Fernando et al.
(2016) also use back-propagation, allowing the trained
2. Related Work weights to be inherited through the structural modifica-
tions.
Neuro-evolution dates back many years (Miller et al.,
1989), originally being used only to evolve the weights The above studies create neural networks that are small in
of a fixed architecture. Stanley & Miikkulainen (2002) comparison to the typical modern architectures used for im-
showed that it was advantageous to simultaneously evolve age classification (He et al., 2016; Huang et al., 2016a).
the architecture using the NEAT algorithm. NEAT has Their focus is on the encoding or the efficiency of the evo-
three kinds of mutations: (i) modify a weight, (ii) add a lutionary process, but not on the scale. When it comes to
connection between existing nodes, or (iii) insert a node images, some neuro-evolution results reach the computa-
while splitting an existing connection. It also has a mech- tional scale required to succeed on the MNIST dataset (Le-
anism for recombining two models into one and a strategy Cun et al., 1998). Yet, modern classifiers are often tested
to promote diversity known as fitness sharing (Goldberg on realistic images, such as those in the CIFAR datasets
et al., 1987). Evolutionary algorithms represent the models (Krizhevsky & Hinton, 2009), which are much more chal-
using an encoding that is convenient for their purpose— lenging. These datasets require large models to achieve
analogous to nature’s DNA. NEAT uses a direct encoding: high accuracy.
every node and every connection is stored in the DNA. The
Non-evolutionary neuro-discovery methods have been
alternative paradigm, indirect encoding, has been the sub-
more successful at tackling realistic image data. Snoek
ject of much neuro-evolution research (Gruau, 1993; Stan-
et al. (2012) used Bayesian optimization to tune 9
ley et al., 2009; Pugh & Stanley, 2013; Kim & Rigazio,
hyper-parameters for a fixed-depth architecture, reach-
Large-Scale Evolution
Table 2. Comparison with automatically discovered architectures. The “C10+” and “C100+” contain the test accuracy on the data-
augmented CIFAR-10 and CIFAR-100 datasets, respectively. An entry of “–” indicates that the information was not reported or is not
known to us. For Zoph & Le (2016), we quote the result with the most similar search space to ours, as well as their best result. Please
refer to Table 1 for hand-designed results, including the state of the art. “Discrete params.” means that the parameters can be picked
from a handful of values only (e.g. strides ∈ {1, 2, 4}).
Q-L EARNING – DISCRETE PARAMS ., MAX . TUNE , RETRAIN 11.2 M 93.1% 72.9%
(BAKER NUM . LAYERS , NO SKIPS
ET AL ., 2016)
RL (Z OPH & 20 LAYERS , 50% DISCRETE PARAMS ., SMALL GRID 2.5 M 94.0% –
L E , 2016) SKIPS EXACTLY 20 LAYERS SEARCH , RETRAIN
RL (Z OPH & 39 LAYERS , 2 POOL DISCRETE PARAMS ., ADD MORE FILTERS , 37.0 M 96.4% –
L E , 2016) LAYERS AT 13 AND EXACTLY 39 LAYERS , 2 SMALL GRID
26, 50% SKIPS POOL LAYERS AT 13 AND 26 SEARCH , RETRAIN
5.4 M 94.6%
E VOLUTION SINGLE LAYER , POWER - OF -2 STRIDES N ONE 40.4 M 77.0%
( OURS ) ZERO CONVS .
ENSEMB . 95.6%
ing a new state of the art at the time. Zoph & two (Bergstra & Bengio, 2012).
Le (2016) used reinforcement learning on a deeper
Our approach builds on previous work, with some im-
fixed-length architecture. In their approach, a neu-
portant differences. We explore large model-architecture
ral network—the “discoverer”—constructs a convolutional
search spaces starting with basic initial conditions to avoid
neural network—the “discovered”—one layer at a time. In
priming the system with information about known good
addition to tuning layer parameters, they add and remove
strategies for the specific dataset at hand. Our encoding
skip connections. This, together with some manual post-
is different from the neuro-evolution methods mentioned
processing, gets them very close to the (current) state of
above: we use a simplified graph as our DNA, which is
the art. (Additionally, they surpassed the state of the art on
transformed to a full neural network graph for training and
a sequence-to-sequence problem.) Baker et al. (2016) use
evaluation (Section 3). Some of the mutations acting on
Q-learning to also discover a network one layer at a time,
this DNA are reminiscent of NEAT. However, instead of
but in their approach, the number of layers is decided by
single nodes, one mutation can insert whole layers—i.e.
the discoverer. This is a desirable feature, as it would allow
tens to hundreds of nodes at a time. We also allow for
a system to construct shallow or deep solutions, as may be
these layers to be removed, so that the evolutionary process
the requirements of the dataset at hand. Different datasets
can simplify an architecture in addition to complexifying it.
would not require specially tuning the algorithm. Compar-
Layer parameters are also mutable, but we do not prescribe
isons among these methods are difficult because they ex-
a small set of possible values to choose from, to allow for
plore very different search spaces and have very different
a larger search space. We do not use fitness sharing. We
initial conditions (Table 2).
report additional results using recombination, but for the
Tangentially, there has also been neuro-evolution work on most part, we used mutation only. On the other hand, we
LSTM structure (Bayer et al., 2009; Zaremba, 2015), but do use back-propagation to optimize the weights, which
this is beyond the scope of this paper. Also related to this can be inherited across mutations. Together with a learn-
work is that of Saxena & Verbeek (2016), who embed con- ing rate mutation, this allows the exploration of the space
volutions with different parameters into a species of “super- of learning rate schedules, yielding fully trained models
network” with many parallel paths. Their algorithm then at the end of the evolutionary process (Section 3). Ta-
selects and ensembles paths in the super-network. Finally, bles 1 and 2 compare our approach with hand-designed ar-
canonical approaches to hyper-parameter search are grid chitectures and with other neuro-discovery techniques, re-
search (used in Zagoruyko & Komodakis (2016), for ex- spectively.
ample) and random search, the latter being the better of the
Large-Scale Evolution
3.2. Encoding and Mutations These specific mutations were chosen for their similarity
to the actions that a human designer may take when im-
Individual architectures are encoded as a graph that we proving an architecture. This may clear the way for hybrid
refer to as the DNA. In this graph, the vertices represent evolutionary–hand-design methods in the future. The prob-
rank-3 tensors or activations. As is standard for a convo- abilities for the mutations were not tuned in any way.
2
The use of the file-name string to contain key information A mutation that acts on a numerical parameter chooses the
about the individual was inspired by Breuel & Shafait (2010), and
new value at random around the existing value. All sam-
it speeds up disk access enormously. In our case, the file name
contains the state of the individual (alive, dead, training, etc.). pling is from uniform distributions. For example, a muta-
tion acting on a convolution with 10 output channels will
Large-Scale Evolution
of the best experiment. Moreover, we only include ex- We also ran a partial control where the weight-inheritance
periments that we managed to reproduce, unless explicitly mechanism is disabled. This run also results in a lower
noted. Any statistical analysis was fully decided upon be- accuracy (92.2 %) in the same amount of time (Figure 2),
fore seeing the results of the experiment reported, to avoid using 9×1019 FLOPs. This shows that weight inheritance
tailoring our analysis to our experimental data (Simmons is important in the process.
et al., 2011).
Finally, we applied our neuro-evolution algorithm, with-
out any changes and with the same meta-parameters, to
4. Experiments and Results CIFAR-100. Our only experiment reached an accuracy
of 77.0 %, using 2 × 1020 FLOPs. We did not attempt
We want to answer the following questions:
other datasets. Table 1 shows that both the CIFAR-10
• Can a simple one-shot evolutionary process start from and CIFAR-100 results are competitive with modern hand-
trivial initial conditions and yield fully trained models designed networks.
that rival hand-designed architectures?
• What are the variability in outcomes, the parallelizabil- 5. Analysis
ity, and the computation cost of the method?
• Can an algorithm designed iterating on CIFAR-10 be ap- Meta-parameters. We observe that populations evolve
plied, without any changes at all, to CIFAR-100 and still until they plateau at some local optimum (Figure 2). The
produce competitive models? fitness (i.e. validation accuracy) value at this optimum
varies between experiments (Figure 2, inset). Since not all
We used the algorithm in Section 3 to perform several ex-
experiments reach the highest possible value, some popu-
periments. Each experiment evolves a population in a few
lations are getting “trapped” at inferior local optima. This
days, typified by the example in Figure 1. The figure also
entrapment is affected by two important meta-parameters
contains examples of the architectures discovered, which
(i.e. parameters that are not optimized by the algorithm).
turn out to be surprisingly simple. Evolution attempts skip
These are the population size and the number of training
connections but frequently rejects them.
steps per individual. Below we discuss them and consider
To get a sense of the variability in outcomes, we repeated their relationship to local optima.
the experiment 5 times. Across all 5 experiment runs, the
Effect of population size. Larger populations explore the
best model by validation accuracy has a testing accuracy of
space of models more thoroughly, and this helps reach bet-
94.6 %. Not all experiments reach the same accuracy, but
ter optima (Figure 3, left). Note, in particular, that a pop-
they get close (µ = 94.1%, σ = 0.4). Fine differences in the
ulation of size 2 can get trapped at very low fitness values.
experiment outcome may be somewhat distinguishable by
Some intuition about this can be gained by considering the
validation accuracy (correlation coefficient = 0.894). The
fate of a super-fit individual, i.e. an individual such that any
total amount of computation across all 5 experiments was
one architectural mutation reduces its fitness (even though
4×1020 FLOPs (or 9×1019 FLOPs on average per exper-
a sequence of many mutations may improve it). In the case
iment). Each experiment was distributed over 250 parallel
of a population of size 2, if the super-fit individual wins
workers (Section 3.1). Figure 2 shows the progress of the
once, it will win every time. After the first win, it will pro-
experiments in detail.
duce a child that is one mutation away. By definition of
As a control, we disabled the selection mechanism, thereby super-fit, therefore, this child is inferior4 . Consequently,
reproducing and killing random individuals. This is the in the next round of tournament selection, the super-fit in-
form of random search that is most compatible with our dividual competes against its child and wins again. This
infrastructure. The probability distributions for the pa- cycle repeats forever and the population is trapped. Even if
rameters are implicitly determined by the mutations. This a sequence of two mutations would allow for an “escape”
control only achieves an accuracy of 87.3 % in the same from the local optimum, such a sequence can never take
amount of run time on the same hardware (Figure 2). The place. This is only a rough argument to heuristically sug-
total amount of computation was 2×1017 FLOPs. The low gest why a population of size 2 is easily trapped. More
FLOP count is a consequence of random search generating generally, Figure 3 (left) empirically demonstrates a bene-
many small, inadequate models that train quickly but con- fit from an increase in population size. Theoretical analy-
sume roughly constant amounts of setup time (not included ses of this dependence are quite complex and assume very
in the FLOP count). We attempted to minimize this over- specific models of population dynamics; often larger pop-
head by avoiding unnecessary disk access operations, to no ulations are better at handling local optima, at least beyond
avail: too much overhead remains spent on a combination a size threshold (Weinreich & Chao (2005) and references
of neural network setup, data augmentation, and training 4
Except after identity or learning rate mutations, but these pro-
step initialization. duce a child with the same architecture as the parent.
Large-Scale Evolution
94.6
91.8
85.3
Input
Input Input C
test accuracy (%)
C C
C
C C + BN + R + BN + R
C + BN + R
BN + R C + BN + R + BN + R + BN + R + BN + R C + BN + R + BN + R + BN + R + BN + R
C + BN + R C + BN + R
C + BN + R
BN + R C + BN + R
C + BN + R
C + BN + R C + BN + R
C
C + BN + R + BN + R C
C + BN + R
Input
C C
C
22.6
Global Pool C + BN + R + BN + R C + BN + R
C + BN + R
Output Global Pool C + BN + R
Global Pool
Output
Figure 1. Progress of an evolution experiment. Each dot represents an individual in the population. Blue dots (darker, top-right) are alive.
The rest have been killed. The four diagrams show examples of discovered architectures. These correspond to the best individual (right-
most) and three of its ancestors. The best individual was selected by its validation accuracy. Evolution sometimes stacks convolutions
without any nonlinearity in between (“C”, white background), which are mathematically equivalent to a single linear operation. Unlike
typical hand-designed architectures, some convolutions are followed by more than one nonlinear function (“C+BN +R+BN +R+...”,
orange background).
100.0 100
94.6 100
150 250
50 75
2 10 43 1000 256 2560 25600
Evolution 94 population size training steps
Evolution w/o
weight inheritance
Random search Figure 3. Dependence on meta-parameters. In both graphs, each
92
circle represents the result of a full evolution experiment. Both
20.0 vertical axes show the test accuracy for the individual with the
0 wall-clock time (hours) 250
highest validation accuracy at the end of the experiment. All pop-
ulations evolved for the same total wall-clock time. There are 5
Figure 2. Repeatability of results and controls. In this plot, the data points at each horizontal axis value. LEFT: effect of pop-
vertical axis at wall-time t is defined as the test accuracy of the ulation size. To economize resources, in these experiments the
individual with the highest validation accuracy that became alive number of individual training steps is only 2560. Note how the ac-
at or before t. The inset magnifies a portion of the main graph. curacy increases with population size. RIGHT: effect of number
The curves show the progress of various experiments, as follows. of training steps per individual. Note how the accuracy increases
The top line (solid, blue) shows the mean test accuracy across 5 with more steps.
large-scale evolution experiments. The shaded area around this
top line has a width of ±2σ (clearer in inset). The next line down
(dashed, orange, main graph and inset) represents a single experi-
ment in which weight-inheritance was disabled, so every individ-
ual has to train from random weights. The lowest curve (dotted-
dashed) is a random-search control. All experiments occupied the
same amount and type of hardware. A small amount of noise in
the generalization from the validation to the test set explains why
the lines are not monotonically increasing. Note the narrow width
of the ±2σ area (main graph and inset), which shows that the high
accuracies obtained in evolution experiments are repeatable.
Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network in Stanley, Kenneth O. Compositional pattern producing net-
network. arXiv preprint arXiv:1312.4400, 2013. works: A novel abstraction of development. Genetic pro-
gramming and evolvable machines, 8(2):131–162, 2007.
Miller, Geoffrey F, Todd, Peter M, and Hegde, Shailesh U.
Designing neural networks using genetic algorithms. In Stanley, Kenneth O and Miikkulainen, Risto. Evolving
Proceedings of the third international conference on Ge- neural networks through augmenting topologies. Evo-
netic algorithms, pp. 379–384. Morgan Kaufmann Pub- lutionary Computation, 10(2):99–127, 2002.
lishers Inc., 1989.
Stanley, Kenneth O, D’Ambrosio, David B, and Gauci, Ja-
Morse, Gregory and Stanley, Kenneth O. Simple evo- son. A hypercube-based encoding for evolving large-
lutionary optimization can rival stochastic gradient de- scale neural networks. Artificial Life, 15(2):185–212,
scent in neural networks. In Proceedings of the 2016 on 2009.
Genetic and Evolutionary Computation Conference, pp.
477–484. ACM, 2016. Sutskever, Ilya, Martens, James, Dahl, George E, and Hin-
ton, Geoffrey E. On the importance of initialization and
Pugh, Justin K and Stanley, Kenneth O. Evolving mul- momentum in deep learning. ICML (3), 28:1139–1147,
timodal controllers with hyperneat. In Proceedings of 2013.
the 15th annual conference on Genetic and evolutionary
computation, pp. 735–742. ACM, 2013. Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet,
Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Du-
Rumelhart, David E, Hinton, Geoffrey E, and Williams, mitru, Vanhoucke, Vincent, and Rabinovich, Andrew.
Ronald J. Learning representations by back-propagating Going deeper with convolutions. In Proceedings of
errors. Cognitive Modeling, 5(3):1, 1988. the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1–9, 2015.
Saxena, Shreyas and Verbeek, Jakob. Convolutional neural
fabrics. In Advances In Neural Information Processing Tuson, Andrew and Ross, Peter. Adapting operator settings
Systems, pp. 4053–4061, 2016. in genetic algorithms. Evolutionary computation, 6(2):
161–184, 1998.
Silver, David, Huang, Aja, Maddison, Chris J, Guez,
Arthur, Sifre, Laurent, Van Den Driessche, George, Verbancsics, Phillip and Harguess, Josh. Generative
Schrittwieser, Julian, Antonoglou, Ioannis, Panneershel- neuroevolution for deep learning. arXiv preprint
vam, Veda, Lanctot, Marc, et al. Mastering the game of arXiv:1312.5355, 2013.
go with deep neural networks and tree search. Nature,
529(7587):484–489, 2016. Weinreich, Daniel M and Chao, Lin. Rapid evolutionary
escape by large populations from local fitness peaks is
Simmons, Joseph P, Nelson, Leif D, and Simonsohn, Uri. likely in nature. Evolution, 59(6):1175–1182, 2005.
False-positive psychology: Undisclosed flexibility in
data collection and analysis allows presenting anything Weyand, Tobias, Kostrikov, Ilya, and Philbin, James.
as significant. Psychological Science, 22(11):1359– Planet-photo geolocation with convolutional neural net-
1366, 2011. works. In European Conference on Computer Vision, pp.
37–55. Springer, 2016.
Simonyan, Karen and Zisserman, Andrew. Very deep con-
volutional networks for large-scale image recognition. Wu, Yonghui, Schuster, Mike, Chen, Zhifeng, Le, Quoc V.,
arXiv preprint arXiv:1409.1556, 2014. Norouzi, Mohammad, et al. Google’s neural machine
translation system: Bridging the gap between human and
Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. machine translation. arXiv preprint arXiv:1609.08144,
Practical bayesian optimization of machine learning al- 2016.
gorithms. In Advances in neural information processing
systems, pp. 2951–2959, 2012. Zagoruyko, Sergey and Komodakis, Nikos. Wide residual
networks. arXiv preprint arXiv:1605.07146, 2016.
Springenberg, Jost Tobias, Dosovitskiy, Alexey, Brox,
Thomas, and Riedmiller, Martin. Striving for sim- Zaremba, Wojciech. An empirical exploration of recurrent
plicity: The all convolutional net. arXiv preprint network architectures. 2015.
arXiv:1412.6806, 2014. Zoph, Barret and Le, Quoc V. Neural architecture
Srivastava, Rupesh Kumar, Greff, Klaus, and Schmid- search with reinforcement learning. arXiv preprint
huber, Jürgen. Highway networks. arXiv preprint arXiv:1611.01578, 2016.
arXiv:1505.00387, 2015.