0% found this document useful (0 votes)
67 views

Regularized Evolution For Image Classifier Architecture Search

This document summarizes a research paper that proposes using an evolutionary algorithm called "aging evolution" or "regularized evolution" to automatically discover high-quality neural network architectures for image classification. The algorithm is applied to search the NASNet search space and discovers an architecture called AmoebaNet-A that achieves state-of-the-art accuracy on the ImageNet dataset, outperforming both human-designed architectures and architectures found by reinforcement learning. Compared to reinforcement learning, the evolutionary algorithm is able to find similar or better results using less computation, especially in the early stages of the search when resources are limited.

Uploaded by

Zhijun Wu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Regularized Evolution For Image Classifier Architecture Search

This document summarizes a research paper that proposes using an evolutionary algorithm called "aging evolution" or "regularized evolution" to automatically discover high-quality neural network architectures for image classification. The algorithm is applied to search the NASNet search space and discovers an architecture called AmoebaNet-A that achieves state-of-the-art accuracy on the ImageNet dataset, outperforming both human-designed architectures and architectures found by reinforcement learning. Compared to reinforcement learning, the evolutionary algorithm is able to find similar or better results using less computation, especially in the early stages of the search when resources are limited.

Uploaded by

Zhijun Wu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Regularized Evolution for Image Classifier Architecture Search

Esteban Real∗† and Alok Aggarwal† and Yanping Huang† and Quoc V. Le
Google Brain, Mountain View, California, USA

Equal contribution. ∗ Correspondence: [email protected]
arXiv:1802.01548v7 [cs.NE] 16 Feb 2019

Abstract (architectures) are kept, we propose to associate each geno-


type with an age, and bias the tournament selection to choose
The effort devoted to hand-crafting neural network image the younger genotypes. We will show that this change turns
classifiers has motivated the use of architecture search to dis-
out to make a difference. The connection to regularization
cover them automatically. Although evolutionary algorithms
have been repeatedly applied to neural network topologies, will be clarified in the Discussion section. Second, we imple-
the image classifiers thus discovered have remained inferior ment the simplest set of mutations that would allow evolving
to human-crafted ones. Here, we evolve an image classifier— in the NASNet search space [54]. This search space asso-
AmoebaNet-A—that surpasses hand-designs for the first time. ciates convolutional neural network architectures with small
To do this, we modify the tournament selection evolution- directed graphs in which vertices represent hidden states and
ary algorithm by introducing an age property to favor the labeled edges represent common network operations (such
younger genotypes. Matching size, AmoebaNet-A has com- as convolutions or pooling layers). Our mutation rules only
parable accuracy to current state-of-the-art ImageNet models alter architectures by randomly reconnecting the origin of
discovered with more complex architecture-search methods. edges to different vertices and by randomly relabeling the
Scaled to larger size, AmoebaNet-A sets a new state-of-the-
edges, covering the full search space.
art 83.9% top-1 / 96.6% top-5 ImageNet accuracy. In a con-
trolled comparison against a well known reinforcement learn- Searching in the NASNet space allows a controlled com-
ing algorithm, we give evidence that evolution can obtain re- parison between evolution and the original method for which
sults faster with the same hardware, especially at the earlier it was designed, reinforcement learning (RL). Thus, this pa-
stages of the search. This is relevant when fewer compute re- per presents the first comparative case study of architecture-
sources are available. Evolution is, thus, a simple method to search algorithms for the image classification task. Within
effectively discover high-quality architectures. this case study, we will demonstrate that evolution can at-
tain similar results with a simpler method, as will be shown
in the Discussion section. In particular, we will highlight that
Introduction in all our experiments evolution searched faster than RL and
Until recently, most state-of-the-art image classifier archi- random search, especially at the earlier stages, which is im-
tectures have been manually designed by human experts portant when experiments cannot be run for long times due
[20,23,24,28,45]. To speed up the process, researchers have to compute resource limitations.
looked into automated methods [2, 29, 32, 35, 36, 43, 47, 53]. Despite its simplicity, our approach works well in our
These methods are now collectively known as architecture- benchmark against RL. It also evolved a high-quality model,
search algorithms. A traditional approach is neuro-evolution which we name AmoebaNet-A. This model is competitive
of topologies [1, 33, 42]. Improved hardware now allows with the best image classifiers obtained by any other algo-
scaling up evolution to produce high-quality image classi- rithm today at similar sizes (82.8% top-1 / 96.1% top-5 Im-
fiers [30, 36, 47]. Yet, the architectures produced by evolu- ageNet accuracy). When scaled up, it sets a new state-of-
tionary algorithms / genetic programming have not reached the-art accuracy (83.9% top-1 / 96.6% top-5 ImageNet ac-
the accuracy of those directly designed by human experts. curacy)1 .
Here we evolve image classifiers that surpass hand-designs.
To do this, we make two additions to the standard evo- Related Work
lutionary process. First, we propose a change to the well-
established tournament selection evolutionary algorithm Review papers provide informative surveys of earlier [18,
[19] that we refer to as aging evolution or regularized evo- 49] and more recent [15] literature on image classifier ar-
lution. Whereas in tournament selection, the best genotypes chitecture search, including successful RL studies [2, 6, 29,
52–54] and evolutionary studies like those mentioned in
Accepted for publication at AAAI 2019, the Thirty-Third
1
AAAI Conference on Artificial Intelligence. After our submission, a recent preprint has further scaled up
A brief talk from Nov 2018 summarizes this paper at https: and retrained AmoebaNet-A to reach 84.3% top-1 / 97.0% top-5
//www.youtube.com/watch?v=MqYHo7BVzoE ImageNet accuracy [25].
the Introduction. Other methods have also been applied: structure indicated in Figure 1 (left): a feed-forward stack
cascade-correlation [16], boosting [10], hill-climbing [14], of Inception-like modules called cells. Each cell receives a
MCTS [34], SMBO [29, 31], and random search [4], and direct input from the previous cell (as depicted) and a skip
grid search [50]. Some methods even forewent the idea of input from the cell before it (Figure 1, middle). The cells in
independent architectures [38]. There is much architecture- the stack are of two types: the normal cell and the reduc-
search work beyond image classification too, but that is out- tion cell. All normal cells are constrained to have the same
side our scope. architecture, as are reduction cells, but the architecture of
Even though some methods stand out due to their effi- the normal cells is independent of that of the reduction cells.
ciency [35, 43], many approaches use large amounts of re- Other than this, the only difference between them is that ev-
sources. Several recent papers reduced the compute cost ery application of the reduction cell is followed by a stride
through progressive-complexity search stages [29], hyper- of 2 that reduces the image size, whereas normal cells pre-
nets [5], accuracy prediction [3, 13, 26], warm-starting and serve the image size. As can be seen in the figure, normal
ensembling [17], parallelization, reward shaping and early cells are arranged in three stacks of N cells. The goal of the
stopping [52] or Net2Net transformations [6]. Most of these architecture-search process is to discover the architectures
methods could in principle be applied to evolution too, but of the normal and reduction cells.
this is beyond the scope of this paper.
Softmax 7
A popular approach to evolution has been through gener-
6
ational algorithms, e.g. NEAT [42]. All models in the pop- +
Normal Cell xN avg sep
ulation must finish training before the next generation is Normal Cell 3x3 3x3
5
computed. Generational evolution becomes inefficient in a Reduction Cell 4
+
sep
none
distributed environment where a different machine is used Normal Cell sep
+
sep
3x3

to train each model: machines that train faster models fin- Normal Cell xN
5x5 3x3

ish earlier and must wait idle until all machines are ready. Normal Cell
Real-time algorithms address this issue, e.g. rtNEAT [41] Reduction Cell 2 3
+ +
and tournament selection [19]. Unlike the generational al- Normal Cell avg
3x3
max none
avg
3x3
3x3
gorithms, however, these discard models according to their Normal Cell xN

performance or do not discard them at all, resulting in mod-


Input Image 0 1
els that remain alive in the population for a long time—even
for the whole experiment. We will present evidence that the
finite lifetimes of aging evolution can give better results than Figure 1: NASNet Search Space [54]. LEFT: the full outer
direct tournament selection, while retaining its efficiency. structure (omitting skip inputs for clarity). MIDDLE: de-
An existing paper [22] uses a concept of age but in a very tailed view with the skip inputs. RIGHT: cell example. Dot-
different way than we do. In that paper, age is assigned to ted line demarcates a pairwise combination.
genes to divide a constant-size population into groups called
age-layers. Each layer contains individuals with genes of As depicted in Figure 1 (middle and right), each cell has
similar ages. Only after the genes have survived a certain two input activation tensors and one output. The very first
age-gap, they can make it to the next layer. The goal is to cell takes two copies of the input image. After that, the in-
restrict competition (the newly introduced genes cannot be puts are the outputs of the previous two cells.
immediately out-competed by highly-selected older ones). Both normal and reduction cells must conform to the fol-
Their algorithm requires the introduction of two additional lowing construction. The two cell input tensors are con-
meta-parameters (size of the age-gap and number of age- sidered hidden states “0” and “1”. More hidden states are
layers). In contrast, in our algorithm, an age is assigned to then constructed through pairwise combinations. A pairwise
the individuals (not the genes) and is only used to track combination is depicted in Figure 1 (right, inside dashed cir-
which is the oldest individual in the population. This per- cle). It consists in applying an operation (or op) to an ex-
mits removing such oldest individual at each cycle (keep- isting hidden state, applying another op to another existing
ing a constant population size). Our approach, therefore, is hidden state, and adding the results to produce a new hidden
in line with our goal of keeping the method as simple as state. Ops belong to a fixed set of common convnet oper-
possible. In particular, our method remains similar to nature ations such as convolutions and pooling layers. Repeating
(where the young are less likely to die than the very old) and hidden states or operations within a combination is permit-
it requires no additional meta-parameters. ted. In the cell example of Figure 1 (right), the first pairwise
combination applies a 3x3 average pool op to hidden state
Methods 0 and a 3x3 max pool op to hidden state 1, in order to pro-
duce hidden state 2. The next pairwise combination can now
This section contains a readable description of the methods. choose from hidden states 0, 1, and 2 to produce hidden state
The Methods Details section gives additional information. 3 (chose 0 and 1 in Figure 1), and so on. After exactly five
pairwise combinations, any hidden states that remain unused
Search Space (hidden states 5 and 6 in Figure 1) are concatenated to form
All experiments use the NASNet search space [54]. This is a the output of the cell (hidden state 7).
space of image classifiers, all of which have the fixed outer A given architecture is fully specified by the five pairwise
combinations that make up the normal cell and the five that in this paper we prefer a novel approach: killing the oldest
make up the reduction cell. Once the architecture is speci- model in the population—that is, removing from the popu-
fied, the model still has two free parameters that can be used lation the model that was trained the earliest (“remove dead
to alter its size (and its accuracy): the number of normal cells from left of pop” in Algorithm 1). This favors the newer
per stack (N) and the number of output filters of the convo- models in the population. We will refer to this approach as
lution ops (F). N and F are determined manually. aging evolution. In the context of architecture search, aging
evolution allows us to explore the search space more, instead
Evolutionary Algorithm of zooming in on good models too early, as non-aging evo-
The evolutionary method we used is summarized in Algo- lution would (see Discussion section for details).
rithm 1. It keeps a population of P trained models through- In practice, this algorithm is parallelized by distributing
out the experiment. The population is initialized with models the “while |history|” loop in Algorithm 1 over multiple
with random architectures (“while |population|” in Algo- workers. A full implementation can be found online.2 Intu-
rithm 1). All architectures that conform to the search space itively, the mutations can be thought of as providing explo-
described are possible and equally likely. ration, while the parent selection provides exploitation. The
parameter S controls the aggressiveness of the exploitation:
Algorithm 1 Aging Evolution S = 1 reduces to a type of random search and 2 ≤ S ≤ P
leads to evolution of varying greediness.
population ← empty queue . The population. New models are constructed by applying a mutation to
history ← ∅ . Will contain all models. existing models, transforming their architectures in ran-
while |population| < P do . Initialize population. dom ways. To navigate the NASNet search space described
model.arch ← R ANDOM A RCHITECTURE() above, we use two main mutations that we call the hidden
model.accuracy ← T RAINA ND E VAL(model.arch) state mutation and the op mutation. A third mutation, the
add model to right of population identity, is also possible. Only one of these mutations is ap-
add model to history plied in each cycle, choosing between them at random.
end while
while |history| < C do . Evolve for C cycles.
sample ← ∅ . Parent candidates. 5
Hidden State
5
while |sample| < S do + Mutation +
sep avg sep avg
candidate ← random element from population 7x7 3x3 7x7 3x3
. The element stays in the population.
add candidate to sample 2 3 4 2 3 4
end while
parent ← highest-accuracy model in sample
child.arch ← M UTATE(parent.arch)
child.accuracy ← T RAINA ND E VAL(child.arch) 5
Op
5
add child to right of population + Mutation +
sep avg sep
none
add child to history 7x7 3x3 7x7
remove dead from left of population . Oldest.
discard dead 2 3 4 2 3 4
end while
return highest-accuracy model in history Figure 2: Illustration of the two mutation types.

The hidden state mutation consists of first making a ran-


After this, evolution improves the initial population in cy-
dom choice of whether to modify the normal cell or the re-
cles (“while |history|” in Algorithm 1). At each cycle, it
duction cell. Once a cell is chosen, the mutation picks one of
samples S random models from the population, each drawn
the five pairwise combinations uniformly at random. Once
uniformly at random with replacement. The model with the
the pairwise combination is picked, one of the two elements
highest validation fitness within this sample is selected as the
of the pair is chosen uniformly at random. The chosen ele-
parent. A new architecture, called the child, is constructed
ment has one hidden state. This hidden state is now replaced
from the parent by the application of a transformation called
with another hidden state from within the cell, subject to the
a mutation. A mutation causes a simple and random modi-
constraint that no loops are formed (to keep the feed-forward
fication of the architecture and is described in detail below.
nature of the convnet). Figure 2 (top) shows an example.
Once the child architecture is constructed, it is then trained,
The op mutation behaves like the hidden state mutation
evaluated, and added to the population. This process is called
as far as choosing one of the two cells, one of the five pair-
tournament selection [19].
wise combinations, and one of the two elements of the pair.
It is common in tournament selection to keep the popu-
lation size fixed at the initial value P. This is often accom- 2
https://colab.research.google.com/github/
plished with an additional step within each cycle: discarding google-research/google-research/blob/master/
(or killing) the worst model in the random S-sample. We will evolution/regularized_evolution_algorithm/
refer to this approach as non-aging evolution. In contrast, regularized_evolution.ipynb
Then it differs in that it modifies the op instead of the hidden 100/25, 64/16, best was 100/25. The probability of the iden-
state. It does this by replacing the existing op with a random tity mutation was fixed at the small, arbitrary value of 0.05
choice from a fixed list of ops (see Methods Details). Fig- and was not tuned. Other mutation probabilities were uni-
ure 2 (bottom) shows an example. form, as described in the Methods. To optimize RL, started
with parameters already tuned in the baseline study and fur-
Baseline Algorithms ther optimized learning rate in 8 configurations: 0.00003,
Our main baseline is the application of RL to the same 0.00006, 0.00012, 0.0002, 0.0004, 0.0008, 0.0016, 0.0032;
search space. RL was implemented using the algorithm and best was 0.0008. To avoid selection bias, plots do not in-
code in the baseline study [54]. An LSTM controller out- clude optimization runs, as was decided a priori. Best few
puts the architectures, constructing the pairwise combina- (20) models were selected from each experiment and aug-
tions one at a time, and then gets a reward for each architec- mented to N=6/F=32, as in baseline study; batch 128, SGD
ture by training and evaluating it. More detail can be found with momentum rate 0.9, L2 weight decay 5 × 10−4 , ini-
in the baseline study. We also compared against random tial lr 0.024 with cosine decay, 600 epochs, Scheduled-
search (RS). In our RS implementation, each model is con- DropPath to 0.7 prob; auxiliary softmax with half-weight of
structed randomly so that all models in the search space are main softmax. For Table 1, we used N/F of 6/32 and 6/36.
equally likely, as in the initial population in the evolutionary For ImageNet table, N/F were 6/190 and 6/448 and stan-
algorithm. In other words, the models in RS experiments are dard training methods [44]: distributed sync SGD with 100
not constructed by mutating existing models, so as to make P100 GPUs; RMSProp optimizer with 0.9 decay and =0.1,
new models independent from previous ones. 4 × 10−5 weight decay, 0.1 label smoothing, auxiliary soft-
max weighted by 0.4; dropout probability 0.5; Scheduled-
Experimental Setup DropPath to 0.7 probability (as in baseline—note that this
trick only contributes 0.3% top-1 ImageNet acc.); 0.001
We ran controlled comparisons at scale, ensuring identical initial lr, decaying every 2 epochs by 0.97. Largest model
conditions for evolution, RL and random search (RS). In used N=6/F=448. F always refers to the number of filters of
particular, all methods used the same computer code for net- convolutions in the first stack; after each reduction cell, this
work construction, training and evaluation. Experiments al- number is doubled. Wherever applicable, we used the same
ways searched on the CIFAR-10 dataset [27]. conditions as the baseline study.
As in the baseline study, we first performed architecture
search over small models (i.e. small N and F) until 20k mod-
els were evaluated. After that, we used the model augmen-
Results
tation trick [54]: we took architectures discovered by the Comparison With RL and RS Baselines
search (e.g. the output of an evolutionary experiment) and Currently, reinforcement learning (RL) is the predominant
turn them into a full-size, accurate models. To accomplish method for architecture search. In fact, today’s state-of-
this, we enlarged the models by increasing N and F so the the-art image classifiers have been obtained by architec-
resulting model sizes would match the baselines, and we ture search with RL [29, 54]. Here we seek to compare
trained the enlarged models for a longer time on the CIFAR- our evolutionary approach against their RL algorithm. We
10 or the ImageNet classification datasets [12, 27]. For Ima- performed large-scale side-by-side architecture-search ex-
geNet, a stem was added at the input of the model to reduce periments on CIFAR-10. We first optimized the hyper-
the image size, as shown in Figure 5 (left). This is the same parameters of the two approaches independently (details in
procedure as in the baseline study. To produce the largest Methods Details section). Then we ran 5 repeats of each of
model (see last paragraph of Results section; not included the two algorithms—and also of random search (RS).
in tables), we increased N and F until we ran out of mem-
Figure 3 shows the model accuracy as the experiments
ory. Actual values of N and F for all models are listed in the
progress, highlighting that evolution yielded more accurate
Methods Details section.
models at the earlier stages, which could become important
in a resource-constrained regime where the experiments may
Methods Details have to be stopped early (for example, when 450 GPUs for
This section complements the Methods section with the de- 7 days is too much). At the later stages, if we allow to run
tails necessary to reproduce our experiments. Possible ops: for the full 20k models (as in the baseline study), evolution
none (identity); 3x3, 5x5 and 7x7 separable (sep.) convo- produced models with similar accuracy. Both evolution and
lutions (convs.); 3x3 average (avg.) pool; 3x3 max pool; RL compared favorably against RS. It is important to note
3x3 dilated (dil.) sep. conv.; 1x7 then 7x1 conv. Evolved that the vertical axis of Figure 3 does not present the com-
with P =100, S=25. CIFAR-10 dataset [27] with 5k withheld pute cost of the models, only their accuracy. Next, we will
examples for validation. Standard ImageNet dataset [12], consider their compute cost as well.
1.2M 331x331 images and 1k classes; 50k examples with- As in the baseline study, the architecture-search experi-
held for validation; standard validation set used for testing. ments above were performed over small models, to be able
During the search phase, each model trained for 25 epochs; to train them quicker. We then used the model augmenta-
N=3/F=24, 1 GPU. Each experiment ran on 450 K40 GPUs tion trick [54] by which we take an architecture discovered
for 20k models (approx. 7 days). To optimize evolution, we by the search (e.g. the output of an evolutionary experiment)
tried 5 configurations with P/S of: 100/2, 100/50, 20/20, and turn it into a full-size, accurate model, as described in
0.92 Evolution Figure 4 compares the augmented top models from the
three sets of experiments. It shows test accuracy and model

Top Testing Accuracy


compute cost. The latter is measured in FLOPs, by which
we mean the total count of operations in the forward pass,
RL so lower is better. Evolved architectures had higher accuracy
(and similar FLOPs) than those obtained with RS, and lower
FLOPs (and similar accuracy) than those obtained with RL.
Number of parameters showed similar behavior to FLOPs.
Therefore, evolution occupied the ideal relative position in
RS this graph within the scope of our case study.
0.890 Experiment Time (hours) 200 So far we have been comparing evolution with our repro-
duction of the experiments in the baseline study, but it is also
Figure 3: Time-course of 5 identical large-scale experiments informative to compare directly against the results reported
for each algorithm (evolution, RL, and RS), showing ac- by the baseline study. We select our evolved architecture
curacy before augmentation on CIFAR-10. All experiments with highest validation accuracy and call it AmoebaNet-A
were stopped when 20k models were evaluated, as done in (Figure 5). Table 1 compares its test accuracy with the top
the baseline study. Note this plot does not show the compute model of the baseline study, NASNet-A. Such a comparison
cost of models, which was higher for the RL ones. is not entirely controlled, as we have no way of ensuring
the network training code was identical and that the same
number of experiments were done to obtain the final model.
the Methods. The table summarizes the results of training AmoebaNet-A
at sizes comparable to a NASNet-A version, showing that
AmoebaNet-A is slightly more accurate (when matching
0.967 model size) or considerably smaller (when matching accu-
racy). We did not train our model at larger sizes on CIFAR-
Final Testing Accuracy

10. Instead, we moved to ImageNet to do further compar-


isons in the next section.

Evol. Table 1: CIFAR-10 testing set results for AmoebaNet-A,


RL compared to top model reported in the baseline study.
RS
0.957
0.75 Model Cost (GigaFLOPs) 1.35 Model # Params Test Error (%)
NASNet-A (baseline) 3.3 M 3.41
Figure 4: Final augmented models from 5 identical AmoebaNet-A (N=6, F=32) 2.6 M 3.40 ± 0.08
architecture-search experiments for each algorithm, on AmoebaNet-A (N=6, F=36) 3.2 M 3.34 ± 0.06
CIFAR-10. Each marker corresponds to the top models from
one experiment.

Softmax 7 7

Normal Cell xN
6
+
avg sep
Reduction Cell 3x3 3x3 4 5 6
+ + +
max sep sep avg sep 1x7
Normal Cell xN 4
3x3 7x7 7x7 3x3 3x3 7x1
+
sep sep 5
Reduction Cell 5x5 3x3 +
sep
none
3x3 3
xN
2
Normal Cell
3 + +
2 avg sep max max
+
Reduction Cell x2 + avg 3x3 3x3 3x3 3x3
avg max none
3x3
3x3 3x3
3x3 conv, stride 2

Input Image 0 1 0 1

Figure 5: AmoebaNet-A architecture. The overall model [54] (LEFT) and the AmoebaNet-A normal cell (MIDDLE) and
reduction cell (RIGHT).
Table 2: ImageNet classification results for AmoebaNet-A compared to hand-designs (top rows) and other automated methods
(middle rows). The evolved AmoebaNet-A architecture (bottom rows) reaches the current state of the art (SOTA) at similar
model sizes and sets a new SOTA at a larger size. All evolution-based approaches are marked with a ∗ . We omitted Squeeze-
and-Excite-Net because it was not benchmarked on the same ImageNet dataset version.

Model # Parameters # Multiply-Adds Top-1 / Top-5 Accuracy (%)


Incep-ResNet V2 [44] 55.8M 13.2B 80.4 / 95.3
ResNeXt-101 [48] 83.6M 31.5B 80.9 / 95.6
PolyNet [51] 92.0M 34.7B 81.3 / 95.8
Dual-Path-Net-131 [7] 79.5M 32.0B 81.5 / 95.8
GeNet-2 [47]∗ 156M – 72.1 / 90.4
Block-QNN-B [52]∗ – – 75.7 / 92.6
Hierarchical [30]∗ 64M – 79.7 / 94.8
NASNet-A [54] 88.9M 23.8B 82.7 / 96.2
PNASNet-5 [29] 86.1M 25.0B 82.9 / 96.2
AmoebaNet-A (N=6, F=190)∗ 86.7M 23.1B 82.8 / 96.1
AmoebaNet-A (N=6, F=448)∗ 469M 104B 83.9 / 96.6

ImageNet Results 0.7720 0.9951 0.0385

G-ImageNet Test Accuracy


Following the accepted standard, we compare our top

G-CIFAR Test Accuracy

MNIST Test Accuracy


model’s classification accuracy on the popular ImageNet
dataset against other top models from the literature. Again,
we use AmoebaNet-A, the model with the highest validation
accuracy on CIFAR-10 among our evolution experiments.
We highlight that the model was evolved on CIFAR-10 and Evol.

Evol.

Evol.

Evol.

Evol.
then transferred to ImageNet, so the evolved architecture
RL

RL

RL

RL

RL
cannot have overfit the ImageNet dataset. When re-trained
0.7500 0.9944 0.0330
on ImageNet, AmoebaNet-A performs comparably to the
baseline for the same number of parameters (Table 2, model 0.77 Evolution 0.77 Evolution
with F=190).
Top Testing Accuracy

Top Testing Accuracy


Finally, we focused on AmoebaNet-A exclusively and en-
larged it, setting a new state-of-the-art accuracy on Ima-
geNet of 83.9%/96.6% top-1/5 accuracy with 469M param-
eters (Table 2, model with F=448). Such high parameter RL
counts may be beneficial in training other models too but RL
we have not managed to do this yet.
0.660 # Models Searched 20k 0.690 # Models Searched 20k
Discussion
This section will suggest directions for future work, which Figure 6: TOP: Comparison of the final model accuracy
we will motivate by speculating about the evolutionary pro- in five different contexts, from left to right: G-CIFAR/SP-
cess and by summarizing additional minor results. The de- I, G-CIFAR/SP-II, G-CIFAR/SP-III, MNIST/SP-I and G-
tails of these minor results have been relegated to the supple- ImageNet/SP-I. Each circle marks the top test accuracy at
ments, as they are not necessary to understand or reproduce the end of one experiment. BOTTOM: Search progress of
our main results above. the experiments in the case of G-CIFAR/SP-II (LEFT, best
Scope of results. Some of our findings may be restricted for RL) and G-CIFAR/SP-III (RIGHT, best for evolution).
to the search spaces and datasets we used. A natural direc-
tion for future work is to extend the controlled comparison
to more search spaces, datasets, and tasks, to verify general-
ity, or to more algorithms. Supplement A presents prelimi- Algorithm speed. In our comparison study, Figure 3 sug-
nary results, performing evolutionary and RL searches over gested that both RL and evolution are approaching a com-
three search spaces (SP-I: same as in the Results section; mon accuracy asymptote. That raises the question of which
SP-II: like SP-I but with more possible ops; SP-III: like SP- algorithm gets there faster. The plots indicate that evolution
II but with more pairwise combinations) and three datasets reaches half-maximum accuracy in roughly half the time.
(gray-scale CIFAR-10, MNIST, and gray-scale ImageNet), We abstain, nevertheless, from further quantifying this ef-
at a small-compute scale (on CPU, F =8, N =1). Evolution fect since it depends strongly on how speed is measured (the
reached equal or better accuracy in all cases (Figure 6, top). number of models necessary to reach accuracy a depends on
a; the natural choice of a = amax /2 may be too low to be ing the algorithm to focus on it, reducing exploration. Under
informative; etc.). Algorithm speed may be more important aging evolution (AE), on the other hand, all models have
when exploring larger spaces, where reaching the optimum a short lifespan, so the population is wholly renewed fre-
can require more compute than is available. We saw an ex- quently, leading to more diversity and more exploration. In
ample of this in the SP-III space, where evolution stood out addition, another effect may be in play, which we describe
(Figure 6, bottom-right). Therefore, future work could ex- next. In AE, because models die quickly, the only way an
plore evolving on even larger spaces. architecture can remain in the population for a long time
Model speed. The speed of individual models produced is is by being passed down from parent to child through the
also relevant. Figure 4 demonstrated that evolved models are generations. Each time an architecture is inherited it must
faster (lower FLOPs). We speculate that asynchronous evo- be re-trained. If it produces an inaccurate model when re-
lution may be reducing the FLOPs because it is indirectly trained, that model is not selected by evolution and the ar-
optimizing for speed even when training for a fixed number chitecture disappears from the population. The only way for
of epochs: fast models may do well because they “repro- an architecture to remain in the population for a long time is
duce” quickly even if they initially lack the higher accuracy to re-train well repeatedly. In other words, AE can only im-
of their slower peers. Verifying this speculation could be the prove a population through the inheritance of architectures
subject of future work. As mentioned in the Related Work that re-train well. (In contrast, NAE can improve a popu-
section, in this work we only considered asynchronous al- lation by accumulating architectures/models that were lucky
gorithms (as opposed to generational evolutionary methods) when they trained the first time). That is, AE is forced to pay
to ensure high resource utilization. Future work may ex- attention to architectures rather than models. In other words,
plore how asynchronous and generational algorithms com- the addition of aging involves introducing additional infor-
pare with regard to model accuracy. mation to the evolutionary process: architectures should re-
Benefits of aging evolution. Aging evolution seemed ad- train well. This additional information prevents overfitting
vantageous in additional small-compute-scale experiments, to the training noise, which makes it a form of regulariza-
shown in Figure 7 and presented in more detail in Supple- tion in the broader mathematical sense3 . Regardless of the
ment B. These were carried out on CPU instead of GPU, and exact mechanism, in Supplement C we perform experiments
used a gray-scale version of CIFAR-10, to reduce compute to verify the plausibility of the conjecture that aging helps
requirements. In the supplement, we also show that these navigate noise. There we construct a toy search space where
results tend to hold when varying the dataset or the search the only difficulty is a noisy evaluation. If our conjecture is
space. true, AE should be better in that toy space too. We found this
to be the case. We leave further verification of the conjecture
0.772 to future work, noting that theoretical results may prove use-
ful here.
Simplicity of aging evolution. A desirable feature of evo-
lutionary algorithms is their simplicity. By design, the appli-
aging test accuracy

cation of a mutation causes a random change. The process


of constructing new architectures, therefore, is entirely ran-
dom. What makes evolution different from random search is
P=64,S=16 that only the good models are selected to be mutated. This
P=256,S=64 selection tends to improve the population over time. In this
P=256,S=4 sense, evolution is simply “random search plus selection”. In
P,S=16 outline, the process can be described briefly: “keep a popula-
P,S=64
Other tion of N models and proceed in cycles: at each cycle, copy-
0.751 mutate the best of S random models and kill the oldest in
0.751 non-aging test accuracy 0.772 the population”. Implementation-wise, we believe the meth-
ods of this paper are sufficient for a reader to understand
Figure 7: Small-compute-scale comparison between our ag- evolution. The sophisticated nature of the RL alternative in-
ing tournament selection variant and the non-aging variant, troduces complexity in its implementation: it requires back-
for different population sizes (P) and sample sizes (S), show- propagation and poses challenges to parallelization [37].
ing that aging tends to be beneficial (most markers are above Even different implementations of the same algorithm have
the y = x line). been shown to produce different results [21]. Finally, evolu-
tion is also simple in that it has few meta-parameters, most
Understanding aging evolution and regularization. We of which do not need tuning [36]. In our study, we only ad-
can speculate that aging may help navigate the training justed 2 meta-parameters and only through a handful of at-
noise in evolutionary experiments, as follows. Noisy training tempts (see Methods Details section). In contrast, note that
means that models may sometimes reach high accuracy just the RL baseline requires training an agent/controller which
by luck. In non-aging evolution (NAE, i.e. standard tourna- is often itself a neural network with many weights (such as
ment selection), such lucky models may remain in the popu-
lation for a long time—even for the whole experiment. One 3
https://en.wikipedia.org/wiki/
lucky model, therefore, can produce many children, caus- Regularization_(mathematics)
an LSTM), and its optimization has more meta-parameters tions that permit the application of evolution to the popu-
to adjust: learning rate schedule, greediness, batching, re- lar NASNet search space.
play buffer, etc. (These meta-parameters are all in addition • We presented the first controlled comparison of algo-
to the weights and training parameters of the image classi- rithms for image classifier architecture search in a case
fiers being searched, which are present in both approaches.) study of evolution, RL and random search. We showed
It is possible that through careful tuning, RL could be made that evolution had somewhat faster search speed and stood
to produce even better models than evolution, but such tun- out in the regime of scarcer resources / early stopping.
ing would likely involve running many experiments, mak- Evolution also matched RL in final model quality, em-
ing it more costly. Evolution did not require much tuning, as ploying a simpler method.
described. It is also possible that random search would pro-
duce equally good models if run for a very long time, which • We evolved AmoebaNet-A (Figure 5), a competitive im-
would be very costly. age classifier. On ImageNet, it is the first evolved model
Interpreting architecture search. Another important di- to surpass hand-designs. Matching size, AmoebaNet-A
rection for future work is that of analyzing architecture- has comparable accuracy to top image-classifiers discov-
search experiments (regardless of the algorithm used) to try ered with other architecture-search methods. At large size,
to discover new neural network design patterns. Anecdo- it sets a new state-of-the-art accuracy. We open-sourced
tally, for example, we found that architectures with high out- code and checkpoint.5 .
put vertex fan-in (number of edges into the output vertex)
tend to be favored in all our experiments. In fact, the mod- Acknowledgments
els in the final evolved populations have a mean fan-in value We wish to thank Megan Kacholia, Vincent Vanhoucke, Xi-
that is 3 standard deviations above what would be expected aoqiang Zheng and especially Jeff Dean for their support and
from randomly generated models. We verified this pattern valuable input; Chris Ying for his work helping tune Amoe-
by training various models with different fan-in values and baNet models and for his help with specialized hardware,
the results confirm that accuracy increases with fan-in, as Barret Zoph and Vijay Vasudevan for help with the code
had been found in ResNeXt [48]. Discovering broader pat- and experiments used in their paper [54], as well as Jiquan
terns may require designing search spaces specifically for Ngiam, Jacques Pienaar, Arno Eigenwillig, Jianwei Xie,
this purpose. Derek Murray, Gabriel Bender, Golnaz Ghiasi, Saurabh Sax-
Additional AmoebaNets. Using variants of the evo- ena and Jie Tan for other coding contributions; Jacques Pien-
lutionary process described, we obtained three additional aar, Luke Metz, Chris Ying and Andrew Selle for manuscript
models, which we named AmoebaNet-B, AmoebaNet-C, and comments, all the above and Patrick Nguyen, Samy Ben-
AmoebaNet-D. We describe these models and the process gio, Geoffrey Hinton, Risto Miikkulainen, Jeff Clune, Ken-
that led to them in detail in Supplement D, but we sum- neth Stanley, Yifeng Lu, David Dohan, David So, David Ha,
marize here. AmoebaNet-B was obtained through through Vishy Tirumalashetty, Yoram Singer, and Ruoming Pang for
platform-aware architecture search over a larger version of helpful discussions; and the larger Google Brain team.
the NASNet space. AmoebaNet-C is simply a model that
showed promise early on in the above experiments by reach- References
ing high accuracy with relatively few parameters; we men-
[1] P. J. Angeline, G. M. Saunders, and J. B. Pollack. An
tion it here for completeness, as it has been referenced in
evolutionary algorithm that constructs recurrent neu-
other work [11]. AmoebaNet-D was obtained by manually
ral networks. IEEE transactions on Neural Networks,
extrapolating the evolutionary process and optimizing the
1994.
resulting architecture for training speed. It is very efficient:
AmoebaNet-D won the Stanford DAWNBench competition [2] B. Baker, O. Gupta, N. Naik, and R. Raskar. Design-
for lowest training cost on ImageNet [9]. ing neural network architectures using reinforcement
learning. In ICLR, 2017.
Conclusion [3] B. Baker, O. Gupta, R. Raskar, and N. Naik. Accelerat-
ing neural architecture search using performance pre-
This paper used an evolutionary algorithm to discover image
diction. ICLR Workshop, 2017.
classifier architectures. Our contributions are the following:
[4] J. Bergstra and Y. Bengio. Random search for hyper-
• We proposed aging evolution, a variant of tournament se- parameter optimization. JMLR, 2012.
lection by which genotypes die according to their age, fa-
voring the young. This improved upon standard tourna- [5] A. Brock, T. Lim, J. M. Ritchie, and N. Weston.
ment selection while still allowing for efficiency at scale Smash: one-shot model architecture search through hy-
through asynchronous population updating. We open- pernetworks. In ICLR, 2018.
sourced the code.4 We also implemented simple muta- [6] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang. Ef-
ficient architecture search by network transformation.
4
https://colab.research.google.com/github/ In AAAI, 2018.
google-research/google-research/blob/master/
5
evolution/regularized_evolution_algorithm/ https://tfhub.dev/google/imagenet/
regularized_evolution.ipynb amoebanet_a_n18_f448/classification/1
[7] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. [25] Y. Huang, Y. Cheng, D. Chen, H. Lee, J. Ngiam, Q. V.
Dual path networks. In NIPS, 2017. Le, and Z. Chen. Gpipe: Efficient training of giant neu-
[8] D. Ciregan, U. Meier, and J. Schmidhuber. Multi- ral networks using pipeline parallelism. arXiv preprint
column deep neural networks for image classification. arXiv:1811.06965, 2018.
In CVPR, 2012. [26] A. Klein, S. Falkner, J. T. Springenberg, and F. Hut-
ter. Learning curve prediction with bayesian neural
[9] C. Coleman, D. Kang, D. Narayanan, L. Nardi,
networks. ICLR, 2017.
T. Zhao, J. Zhang, P. Bailis, K. Olukotun, C. Re,
and M. Zaharia. Analysis of dawnbench, a time-to- [27] A. Krizhevsky and G. Hinton. Learning multiple layers
accuracy machine learning performance benchmark. of features from tiny images. Master’s thesis, Dept. of
arXiv preprint arXiv:1806.01427, 2018. Computer Science, U. of Toronto, 2009.
[10] C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, and [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Ima-
S. Yang. Adanet: Adaptive structural learning of artifi- genet classification with deep convolutional neural net-
cial neural networks. In ICML, 2017. works. In NIPS, 2012.
[11] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and [29] C. Liu, B. Zoph, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei,
Q. V. Le. Autoaugment: Learning augmentation poli- A. Yuille, J. Huang, and K. Murphy. Progressive neural
cies from data. arXiv, 2018. architecture search. ECCV, 2018.
[30] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and
[12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and
K. Kavukcuoglu. Hierarchical representations for ef-
L. Fei-Fei. Imagenet: A large-scale hierarchical image
ficient architecture search. In ICLR, 2018.
database. In CVPR, 2009.
[31] H. Mendoza, A. Klein, M. Feurer, J. T. Springenberg,
[13] T. Domhan, J. T. Springenberg, and F. Hutter. Speed- and F. Hutter. Towards automatically-tuned neural net-
ing up automatic hyperparameter optimization of deep works. In Workshop on Automatic Machine Learning,
neural networks by extrapolation of learning curves. In 2016.
IJCAI, 2017.
[32] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal,
[14] T. Elsken, J.-H. Metzen, and F. Hutter. Simple and ef- D. Fink, O. Francon, B. Raju, A. Navruzyan, N. Duffy,
ficient architecture search for convolutional neural net- and B. Hodjat. Evolving deep neural networks. arXiv,
works. ICLR Workshop, 2017. 2017.
[15] T. Elsken, J. H. Metzen, and F. Hutter. Neural archi- [33] G. F. Miller, P. M. Todd, and S. U. Hegde. Designing
tecture search: A survey. arXiv, 2018. neural networks using genetic algorithms. In ICGA,
[16] S. E. Fahlman and C. Lebiere. The cascade-correlation 1989.
learning architecture. In NIPS, 1990. [34] R. Negrinho and G. Gordon. Deeparchitect: Automati-
[17] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, cally designing and training deep architectures. arXiv,
M. Blum, and F. Hutter. Efficient and robust automated 2017.
machine learning. In NIPS, 2015. [35] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean.
[18] D. Floreano, P. Dürr, and C. Mattiussi. Neuroevolu- Faster discovery of neural architectures by searching
tion: from architectures to learning. Evolutionary In- for paths in a large model. ICLR Workshop, 2018.
telligence, 2008. [36] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu,
Q. Le, and A. Kurakin. Large-scale evolution of image
[19] D. E. Goldberg and K. Deb. A comparative analysis of
classifiers. In ICML, 2017.
selection schemes used in genetic algorithms. FOGA,
1991. [37] T. Salimans, J. Ho, X. Chen, and I. Sutskever. Evolu-
tion strategies as a scalable alternative to reinforcement
[20] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning. arXiv, 2017.
learning for image recognition. In CVPR, 2016.
[38] S. Saxena and J. Verbeek. Convolutional neural fab-
[21] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Pre- rics. In NIPS, 2016.
cup, and D. Meger. Deep reinforcement learning that
matters. AAAI, 2018. [39] J. P. Simmons, L. D. Nelson, and U. Simonsohn. False-
positive psychology: Undisclosed flexibility in data
[22] G. S. Hornby. Alps: the age-layered population struc- collection and analysis allows presenting anything as
ture for reducing the problem of premature conver- significant. Psychological Science, 2011.
gence. In GECCO, 2006.
[40] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever,
[23] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation and R. Salakhutdinov. Dropout: A simple way to pre-
networks. CVPR, 2018. vent neural networks from overfitting. JMLR, 2014.
[24] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der [41] K. O. Stanley, B. D. Bryant, and R. Miikkulainen.
Maaten. Densely connected convolutional networks. Real-time neuroevolution in the nero video game.
In CVPR, 2017. TEVC, 2005.
[42] K. O. Stanley and R. Miikkulainen. Evolving neural
networks through augmenting topologies. Evol. Com-
put., 2002.
[43] M. Suganuma, S. Shirakawa, and T. Nagao. A ge-
netic programming approach to designing convolu-
tional neural network architectures. In GECCO, 2017.
[44] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Inception-v4, inception-resnet and the impact of resid-
ual connections on learning. In AAAI, 2017.
[45] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-
novich. Going deeper with convolutions. In CVPR,
2015.
[46] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus.
Regularization of neural networks using dropconnect.
In ICML, 2013.
[47] L. Xie and A. Yuille. Genetic CNN. In ICCV, 2017.
[48] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Ag-
gregated residual transformations for deep neural net-
works. In CVPR, 2017.
[49] X. Yao. Evolving artificial neural networks. IEEE,
1999.
[50] S. Zagoruyko and N. Komodakis. Wide residual net-
works. In BMVC, 2016.
[51] X. Zhang, Z. Li, C. C. Loy, and D. Lin. Polynet: A
pursuit of structural diversity in very deep networks.
In CVPR, 2017.
[52] Z. Zhong, J. Yan, and C.-L. Liu. Practical network
blocks design with q-learning. In AAAI, 2018.
[53] B. Zoph and Q. V. Le. Neural architecture search with
reinforcement learning. In ICLR, 2016.
[54] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learn-
ing transferable architectures for scalable image recog-
nition. In CVPR, 2018.
Supplement A: Evolution and Reinforcement Learning

Motivation Findings
We first optimized the meta-parameters for evolution and for
In this supplement, we will extend the comparison between
RL by running experiments with each algorithm, repeatedly,
evolution and reinforcement learning (RL) from the Results
under each condition (Figure A-1a). We then compared the
Section. Evolutionary algorithms and RL have been applied
algorithms in 5 different contexts by swapping the dataset or
recently to the field of architecture search. Yet, comparison
the search space (Figure A-1b). Evolution was either better
is difficult because studies tend to use novel search spaces,
than or equal to RL, with statistical significance. The best
preventing direct attribution of the results to the algorithm.
contexts for evolution and for RL are shown in more de-
For example, the search space may be small instead of the
tail in Figures A-1c and A-1d, respectively. They show the
algorithm being fast. The picture is blurred further by the
progress of 5 repeats of each algorithm. The initial speed
use of different training techniques that affect model accu-
of evolution is noticeable, especially in the largest search
racy [8, 40, 46], different definitions of FLOPs that affect
space (SP-III). Figures A-1f and A-1g illustrate the top ar-
model compute cost6 and different hardware platforms that
chitectures from SP-I and SP-III, respectively. Regardless of
affect algorithm run-time7 . Accounting for all these factors,
context, Figure A-1e indicates that accuracy under evolution
we will compare the two approaches in a variety of image
increases significantly faster than RL at the initial stage. This
classification contexts. To achieve statistical confidence, we
stage was not accelerated by higher RL learning rates.
will present repeated experiments without sampling bias.
Outcome
Setup The main text provides a comparison between algorithms for
image classifier architecture search in the context of the SP-
All evolution and RL experiments used the NASNet search
I search space on CIFAR-10, at scale. This supplement ex-
space design [54]. Within this design, we define three con-
tends those results, varying the dataset and the search space
crete search spaces that differ in the number of pairwise
by running many small experiments, confirming the conclu-
combinations (C) and in the number of ops allowed (see
sions of the main text.
Methods Section). In order of increasing size, we will refer
to them as SP-I (e.g. Figure A-1f), SP-II, and SP-III (e.g.
Figure A-1g). SP-I is the exact variant used in the main
text and in the study that we use as our baseline [54]. SP-
II increases the allowed ops from 8 to 19 (identity; 1x1 and
3x3 convs.; 3x3, 5x5 and 7x7 sep. convs.; 2x2 and 3x3 avg.
pools; 2x2 min pool.; 2x2 and 3x3 max pools; 3x3, 5x5 and
7x7 dil. sep. convs.; 1x3 then 3x1 conv.; 1x7 then 7x1 conv.;
3x3 dil. conv. with rates 2, 4 and 6). SP-III allows for larger
tree structures within the cells (C=15, same 19 ops).
The evolutionary algorithm is the same as that in the main
text. The RL algorithm is the one used in the baseline study.
We chose this baseline because, when we began, it had ob-
tained the most accurate results on CIFAR-10, a popular
dataset for image classifier architecture search.
We ran evolution and RL experiments for comparison pur-
poses at different compute scales, always ensuring both ap-
proaches used identical conditions. In particular, evolution
and RL used the same code for network construction, train-
ing and evaluation. The experiments in this supplement were
performed at a smaller compute scale than in the main text,
to reduce resource usage: we used gray-scale versions of
popular datasets (e.g. “G-Imagenet” instead of ImageNet),
we ran on CPU instead of GPU and trained relatively small
models (F=8, see Methods Details in main text) for only 4
epochs. Where unstated, the experiments ran on SP-I and
G-CIFAR.

6
For example, see https://stackoverflow.com/
questions/329174/what-is-flop-s-and-is-it-
a-good-measure-of-performance.
7
A Tesla P100 can be twice as fast as a K40, for example.
0.802 0.7720 0.9951 0.0385
lr=0.000025
lr=0.00005
lr=0.0001
lr=0.0002
lr=0.0004
lr=0.0008
lr=0.0016
lr=0.0032
lr=0.0064
lr=0.0128

G-ImageNet MTA @ 20k


G-CIFAR MTA @ 20k

MNIST MTA @ 20k


G-CIFAR MVA

P=1024,S=256
P=1024,S=64

P=1024,S=16
P=256,S=64
P=128,S=64
P=256,S=32
P=128,S=32

P=256,S=16
P=128,S=16

P=1024,S=4
P=64,S=32

P=64,S=16
P=32,S=16
P=256,S=8
P=128,S=8

P=256,S=4
P=128,S=4

Evol.

Evol.

Evol.

Evol.

Evol.
P=64,S=8
P=32,S=8
P=16,S=8

P=64,S=4
P=32,S=4
P=16,S=4
P,S=1024
P,S=256

RL

RL

RL

RL

RL
P,S=64

P,S=32

P,S=16

P,S=4
0.7500 0.9944 0.0330
0.760
(a) (b)
0.77 Evolution 0.77 Evolution 0.7630 0.9951 0.0340

G-ImageNet MTA @ 5k
G-CIFAR MTA @ 5k

MNIST MTA @ 5k
G-CIFAR MTA

G-CIFAR MTA

RL
RL

Evol.

Evol.

Evol.

Evol.

Evol.
RL

RL

RL

RL

RL
0.66 m 0.69 m
0 20000 0 20000 0.7000 0.9939 0.0270
(c) (d) (e)

0 = sep. 3x3
1 = sep. 5x5
2 = sep. 7X7
3 = none
4 = avg. 3x3
5 = max 3x3
6 = dil. 3x3
7 = 1x7+7x1

(f) (g)

Figure A-1: Evolution and RL in different contexts. Plots show repeated evolution (orange) and RL (blue) experiments side-by-
side. (a) Summary of hyper-parameter optimization experiments on G-CIFAR. We swept the learning rate (lr) for RL (left) and
the population size (P) and sample size (S) for evolution (right). We ran 5 experiments (circles) for each scenario. The vertical
axis measures the mean validation accuracy (MVA) of the top 100 models in an experiment. Superposed on the raw data are
± 2 SEM error bars. From these results, we selected best meta-parameters to use in the remainder of this figure. (b) We assessed
robustness by running the same experiments in 5 different contexts, spanning different datasets and search spaces: G-CIFAR/SP-
I, G-CIFAR/SP-II, G-CIFAR/SP-III, MNIST/SP-I and G-ImageNet/SP-I, shown from left to right. These experiments ran to 20k
models. The vertical axis measures the mean testing accuracy (MTA) of the top 100 models (selected by validation accuracy).
(c) and (d) show a detailed view of the progress of the experiments in the G-CIFAR/SP-II and G-CIFAR/SP-III contexts,
respectively. The horizontal axes indicate the number of models (m) produced as the experiment progresses. (e) Resource-
constrained settings may require stopping experiments early. At 5k models, evolution performs better than RL in all 5 contexts.
(f) and (g) show a stack of normal cells of the best model found for G-CIFAR in the SP-I and SP-III search spaces, respectively.
The “h” labels some of the hidden states. The ops (“avg 3x3”, etc.) are listed in full form in the text. Data flows from left
to right. See the baseline study for a detailed description of these diagrams. In (f), N=3 (see Methods section), so the cell is
replicated three times; i.e. the left two-thirds of the diagram (grayed out) are constrained to mirror the right third. This is in
contrast with the vastly larger SP-III search space of (g), where a bigger, unconstrained construct without replication (N=1) is
explored.
Supplement B: Aging and Non-Aging Evolution

Motivation Additionally, we performed three experiments compar-


ing AE and NAE at scale, under the same conditions as in
In this supplement, we will extend the comparison between
the main text. The results, which can be seen in Figure B-
aging evolution (AE) and standard tournament selection /
2, provide some verification that observations from smaller
non-aging evolution (NAE). As was described in the Meth-
CPU experiments in the previous paragraph generalize to the
ods Section, the evolutionary algorithm used in this paper
large-compute regime.
keeps the population size constant by always removing the
oldest model whenever a new one is added; we will refer to
this algorithm as AE. A recent paper used a similar method 0.916
but kept the population size constant by removing the worst
model in each tournament [36]; we will refer to that algo-
rithm as NAE. This supplement will show how these two
algorithms compare in a variety of contexts.

MTA
P=64, S=16 (AE)
Setup P=256, S=4 (NAE)
The search spaces and datasets were the same as in Supple- P=1000, S=2 (NAE)
ment A.
0.8960 m 20000
Findings
Figure B-2: A comparison of AE and NAE at scale. These
0.7730 0.9951 0.0422 experiments use the same conditions as the main text (in-
NAE

cluding dataset, search space, resources and duration). From


Grayscale-ImageNet MTA

top to bottom: an AE experiment with good AE meta-


Grayscale-CIFAR MTA

parameters from Supplement A, an analogous NAE exper-


MNIST MTA

iment, and an NAE experiment with the meta-parameters


used in a recent study [36]. These accuracy values are not
meaningful in absolute terms, as the models need to be aug-
mented to reach their maximum accuracy, as described in
NAE

NAE

NAE

NAE
AE

AE

AE

AE

AE

the Methods Section).


0.7500 0.9944 0.0320

Figure B-1: A comparison of NAE and AE under 5 Outcome


different contexts, spanning different datasets and search The Discussion Section in the main text suggested that AE
spaces: G-CIFAR/SP-I, G-CIFAR/SP-II, G-CIFAR/SP-III, tends to perform better than NAE across various parame-
MNIST/SP-I and G-ImageNet/SP-I, shown from left to ters for one fixed search space–dataset context. Such ro-
right. For each context, we show the final MTA of a bustness is desirable for computationally demanding archi-
few NAE and a few AE experiments (circles) in adjacent tecture search experiments, where we cannot always afford
columns. We superpose ± 2 SEM error bars, where SEM many runs to optimize the meta-parameters. This supple-
denotes the standard error of the mean. The first context ment extends those results to show that the conclusion holds
contains many repeats with identical meta-parameters and across various contexts.
their MTA values seem normally distributed (Shapiro–Wilks
test). Under this normality assumption, the error bars repre-
sent 95% confidence intervals.

We performed experiments in 5 different search space–


dataset contexts. In each context, we ran several repeats of
evolutionary search using NAE and AE (Figure B-1). Under
4 of the 5 contexts, AE resulted in statistically significant
higher accuracy at the end of the runs, on average. The ex-
ception was the G-ImageNet search space, where the exper-
iments were extremely short due to the compute demands of
training on so much data using only CPUs. Interestingly, in
the two contexts where the search space was bigger (SP-II
and SP-III), all AE runs did better than all NAE runs.
Supplement C: Aging Evolution in Toy Search Space

Motivation NAE (Figure C-1).


As indicated in the Discussion Section, we suspect that ag-
ing may help navigate the noisy evaluation in an evolution 1.00
experiment. We leave verification of this suspicion to future
work, but for motivation we provide here a sanity check for
it. We construct a toy search space in which the only diffi-
culty is a noisy evaluation. Within this toy search space, we

SA
will see that aging evolution outperforms non-aging evolu-
tion.
AE
Setup NAE
The toy search space we use here does not involve any neural 0.65
networks. The goal is to evolve solutions to a very simple, 6 log2 D 9
single-optimum, D-dimensional, noisy optimization prob-
lem with a signal-to-noise ratio matching that of our neuro- Figure C-1: Results in the toy search space. The graph sum-
evolution experiments. marizes thousands of evolutionary search simulations. The
The search space used is the set of vertices of a D- vertical axis measures the simulated accuracy (SA) and the
dimensional unit cube. A specific vertex is “analogous” to horizontal axis the dimensionality (D) of the problem, a
a neural network architecture in a real experiment. A ver- measure of its difficulty. For each D, we optimized the meta-
tex can be represented as a sequence of its coordinates (0s parameters for NAE and AE independently. To do this, we
and 1s)—a bit-string. In other words, this bit-string consti- carried out 100 simulations for each meta-parameter combi-
tutes a simulated architecture. In a real experiment, training nation and averaged the outcomes. We plot here the optima
and evaluating an architecture yields a noisy accuracy. Like- found, together with ± 2 SEM error bars. The graph shows
wise, in this toy search space, we assign a noisy simulated that in this toy search space, AE is never worse and is sig-
accuracy (SA) to each cube vertex. The SA is the fraction nificantly better for larger D (note the broad range of the
of coordinates that are zero, plus a small amount of Gaus- vertical axis).
sian noise (µ = 0, σ = 0.01, matching the observed noise
for neural networks). Thus, the goal is to get close to the
optimum, the origin. The sample complexity used was 10k. Outcome
This space is helpful because an experiment completes in
milliseconds. The findings provide circumstantial evidence in favor of our
This optimization problem can be seen as a simplifica- suspicion that aging may help navigate noise (Discussion
tion of the evolutionary search for the minimum of a multi- Section), suggesting that attempting to verify this with more
dimensional integer-valued paraboloid with bounded sup- generality may be an interesting direction for future work.
port, where the mutations treat the values along each co-
ordinate categorically. If we restrict the domain along each
direction to the set {0, 1}, we reduce the problem to the unit
cube described above. The paraboloid’s value at a cube cor-
ner is just the number of coordinates that are not zero. We
mention this connection because searching for the minimum
of a paraboloid seems like a more natural choice for a triv-
ial problem (“trivial” compared to architecture search). The
simpler unit cube version, however, was chosen because it
permits faster computation.
We stress that these simulations are not intended to truly
mimic architecture search experiments over the space of
neural networks. We used them only as a testing ground
for techniques that evolve solutions in the presence of noisy
evaluations.

Findings
We found that optimized NAE and AE perform similarly in
low-dimensional problems, which are easier. As the dimen-
sionality (D) increases, AE becomes relatively better than
Supplement D: Additional AmoebaNets

Motivation lection of the top model was as follows. We picked from the
In the Discussion Section, we briefly mentioned three experiment K=100 models. To do this, we binned the mod-
additional models, AmoebaNet-B, AmoebaNet-C, and els by their number of parameters to cover the range, using
AmoebaNet-D. While all three used the aging evolu- B bins. From each bin, we took the top K/B models by vali-
tion algorithm presented the main text, there were some dation accuracy. We then augmented all models to N=6 and
differences in the experimental setups: AmoebaNet-B F=32 and selected the one with the top validation accuracy.
was obtained through platform-aware architecture search; AmoebaNet-C was discovered in the experiments de-
AmoebaNet-C was selected with a pareto-optimal criterion; scribed in the main text (see Methods and Methods Details
and AmoebaNet-D involved multi-stage search, including sections). Instead of selecting the highest validation accu-
manual extrapolation of the evolutionary process. Below we racy at the end of the experiments (as done in the main
describe each of these models and the methods that produced text), we picked a promising model while the experiments
them. were still ongoing. This was done entirely for expediency,
to be able to study a model while we waited for the search
Setup to complete. AmoebaNet-C was promising in that it stood
AmoebaNet-B was evolved by running experiments directly out in a pareto-optimal sense: it was a high-accuracy outlier
on Google TPUv2 hardware, since this was the target plat- for its relatively small number of parameters. As opposed to
form for its final evaluation. In the main text, the architecture all other architectures, AmoebaNet-C was selected based on
had been discovered on GPU but the largest model was eval- CIFAR-10 test accuracy, because it was intended to only be
uated on TPUs. In contrast, here we perform the full process benchmarked on ImageNet. This process was less methodi-
on TPUs. This architecture-aware approach allows the evo- cal than the one used in the main text but because the model
lutionary search to optimize even hardware-dependent as- has been cited in the literature, we include it here for com-
pects of the final accuracy, such as optimizations carried out pleteness.
by the compiler. The search setup was as in the main text, ex- AmoebaNet-D was obtained by manually modifying
cept that it used the larger SP-II space of Supplement A and AmoebaNet-B by extrapolating evolution. To do this, we
trained larger models (F=32) for longer (50 epochs). The se- studied the progress of an experiment and identified which

7 7 7
Softmax
+ + +
+ 1x7 1x7
avg 1x1 avg
1x1 none 1x1
3x3 + + + 7x1 7x1 3x3
avg sep sep
Normal Cell xN sep
+ 3x3 3x3 3x3
none
sep
3x3
avg
3x3
none +
3x3
none 1x1
+
max
none
3x3
+ + 1
Reduction Cell max
3x3
1x1
1x1
+
sep
avg
3x3
sep
3x3 none
+
sep
3x3 3x3
1 +
max
1x1
3x3
1 ...
Normal Cell xN ...
...

0 0 0
Reduction Cell
7 7 7

+ + + +
+ + 1x7 max avg
Normal Cell xN avg
3x3
1x1 +
sep
sep
7x7
sep
3x3
max
3x3
sep
3x3
sep
5x5
sep
5x5
7x1 3x3 3x3
1x1
none
+
max
none 2x2
3x3

+ +
+ + max sep
dil max none 3x3
Reduction Cell x2 5x5 3x3 none 3x3 3x3 7x7

1 +
max max
2x2 3x3
1
3x3 conv, stride 2 max
+
max
max
+
max ... 1
2x2 3x3
3x3 3x3
...
...

Input Image 0 0 0

Figure D-1: Architectures of overall model and cells. From left to right: outline of the overall model [54] and diagrams for the
cell architectures discovered by evolution: AmoebaNet-B, AmoebaNet-C, and AmoebaNet-D. The three normal cells are on
the top row and the three reduction cells are on the bottom row. The labeled activations or hidden states correspond to the cell
inputs (“0” and “1”) and the cell output (“7”).
mutations were still causing improvements in fitness at the
later stages of the process. By inspection, we found these
mutations to be: replacing a 3x3 separable (sep.) convolu-
tion (conv.) with a 1x7 followed by 7x1 conv. in the normal
cell, replacing a 5x5 sep. conv. by a 1x7 followed by 7x1
conv. in the reduction cell, and replacing a 3x3 sep. conv.
with 3x3 avg. pool in the reduction cell. Additionally, we
reduced the numeric precision from 32-bit to 16-bit floats,
and set a learning rate schedule of step-wise decay, reducing
by a factor of 0.88 every epoch. We trained for 35 epochs
in total. To submit to Stanford DAWNBench (see Outcome
section), we used N=2 and F=256.

Findings
Figure D-1 presents all three model architectures. We refrain
from benchmarking these here. Instead, in the Outcome Sec-
tion below, we will refer the reader to results presented else-
where.

Outcome
In this supplement we have described additional evolution-
ary experiments that led to three new models. Such exper-
iments were intended mainly to search for better models.
Due to the resource-intensive nature of these methods, we
forewent ablations and baselines in this supplement. For a
more empirically rigorous approach, please refer to the pro-
cess that produced AmoebaNet-A in the main text.
AmoebaNet-B had set a new state of the art on CIFAR-10
(2.13% test error) in a previous preprint of this paper8 after
being trained with cutout, but has since been superseded.
AmoebaNet-C had set the previous state-of-the-art top-
1 accuracy on ImageNet after being trained with advanced
data augmentation techniques in [11].
AmoebaNet-D won the Stanford DAWNBench competi-
tion for lowest training cost on ImageNet. The goal of this
competition category was to minimize the monetary cost of
training a model to 93% top-5 accuracy. AmoebaNet-D costs
$49.30 to train. This was 16% better than the second-best
model, which was ResNet and which trained on the same
hardware. The results were published in [9].

8
Version 1 with same title on arXiv: https://arxiv.org/
pdf/1802.01548v1.pdf

You might also like