Regularized Evolution For Image Classifier Architecture Search
Regularized Evolution For Image Classifier Architecture Search
Esteban Real∗† and Alok Aggarwal† and Yanping Huang† and Quoc V. Le
Google Brain, Mountain View, California, USA
†
Equal contribution. ∗ Correspondence: [email protected]
arXiv:1802.01548v7 [cs.NE] 16 Feb 2019
to train each model: machines that train faster models fin- Normal Cell xN
5x5 3x3
ish earlier and must wait idle until all machines are ready. Normal Cell
Real-time algorithms address this issue, e.g. rtNEAT [41] Reduction Cell 2 3
+ +
and tournament selection [19]. Unlike the generational al- Normal Cell avg
3x3
max none
avg
3x3
3x3
gorithms, however, these discard models according to their Normal Cell xN
Softmax 7 7
Normal Cell xN
6
+
avg sep
Reduction Cell 3x3 3x3 4 5 6
+ + +
max sep sep avg sep 1x7
Normal Cell xN 4
3x3 7x7 7x7 3x3 3x3 7x1
+
sep sep 5
Reduction Cell 5x5 3x3 +
sep
none
3x3 3
xN
2
Normal Cell
3 + +
2 avg sep max max
+
Reduction Cell x2 + avg 3x3 3x3 3x3 3x3
avg max none
3x3
3x3 3x3
3x3 conv, stride 2
Input Image 0 1 0 1
Figure 5: AmoebaNet-A architecture. The overall model [54] (LEFT) and the AmoebaNet-A normal cell (MIDDLE) and
reduction cell (RIGHT).
Table 2: ImageNet classification results for AmoebaNet-A compared to hand-designs (top rows) and other automated methods
(middle rows). The evolved AmoebaNet-A architecture (bottom rows) reaches the current state of the art (SOTA) at similar
model sizes and sets a new SOTA at a larger size. All evolution-based approaches are marked with a ∗ . We omitted Squeeze-
and-Excite-Net because it was not benchmarked on the same ImageNet dataset version.
Evol.
Evol.
Evol.
Evol.
then transferred to ImageNet, so the evolved architecture
RL
RL
RL
RL
RL
cannot have overfit the ImageNet dataset. When re-trained
0.7500 0.9944 0.0330
on ImageNet, AmoebaNet-A performs comparably to the
baseline for the same number of parameters (Table 2, model 0.77 Evolution 0.77 Evolution
with F=190).
Top Testing Accuracy
Motivation Findings
We first optimized the meta-parameters for evolution and for
In this supplement, we will extend the comparison between
RL by running experiments with each algorithm, repeatedly,
evolution and reinforcement learning (RL) from the Results
under each condition (Figure A-1a). We then compared the
Section. Evolutionary algorithms and RL have been applied
algorithms in 5 different contexts by swapping the dataset or
recently to the field of architecture search. Yet, comparison
the search space (Figure A-1b). Evolution was either better
is difficult because studies tend to use novel search spaces,
than or equal to RL, with statistical significance. The best
preventing direct attribution of the results to the algorithm.
contexts for evolution and for RL are shown in more de-
For example, the search space may be small instead of the
tail in Figures A-1c and A-1d, respectively. They show the
algorithm being fast. The picture is blurred further by the
progress of 5 repeats of each algorithm. The initial speed
use of different training techniques that affect model accu-
of evolution is noticeable, especially in the largest search
racy [8, 40, 46], different definitions of FLOPs that affect
space (SP-III). Figures A-1f and A-1g illustrate the top ar-
model compute cost6 and different hardware platforms that
chitectures from SP-I and SP-III, respectively. Regardless of
affect algorithm run-time7 . Accounting for all these factors,
context, Figure A-1e indicates that accuracy under evolution
we will compare the two approaches in a variety of image
increases significantly faster than RL at the initial stage. This
classification contexts. To achieve statistical confidence, we
stage was not accelerated by higher RL learning rates.
will present repeated experiments without sampling bias.
Outcome
Setup The main text provides a comparison between algorithms for
image classifier architecture search in the context of the SP-
All evolution and RL experiments used the NASNet search
I search space on CIFAR-10, at scale. This supplement ex-
space design [54]. Within this design, we define three con-
tends those results, varying the dataset and the search space
crete search spaces that differ in the number of pairwise
by running many small experiments, confirming the conclu-
combinations (C) and in the number of ops allowed (see
sions of the main text.
Methods Section). In order of increasing size, we will refer
to them as SP-I (e.g. Figure A-1f), SP-II, and SP-III (e.g.
Figure A-1g). SP-I is the exact variant used in the main
text and in the study that we use as our baseline [54]. SP-
II increases the allowed ops from 8 to 19 (identity; 1x1 and
3x3 convs.; 3x3, 5x5 and 7x7 sep. convs.; 2x2 and 3x3 avg.
pools; 2x2 min pool.; 2x2 and 3x3 max pools; 3x3, 5x5 and
7x7 dil. sep. convs.; 1x3 then 3x1 conv.; 1x7 then 7x1 conv.;
3x3 dil. conv. with rates 2, 4 and 6). SP-III allows for larger
tree structures within the cells (C=15, same 19 ops).
The evolutionary algorithm is the same as that in the main
text. The RL algorithm is the one used in the baseline study.
We chose this baseline because, when we began, it had ob-
tained the most accurate results on CIFAR-10, a popular
dataset for image classifier architecture search.
We ran evolution and RL experiments for comparison pur-
poses at different compute scales, always ensuring both ap-
proaches used identical conditions. In particular, evolution
and RL used the same code for network construction, train-
ing and evaluation. The experiments in this supplement were
performed at a smaller compute scale than in the main text,
to reduce resource usage: we used gray-scale versions of
popular datasets (e.g. “G-Imagenet” instead of ImageNet),
we ran on CPU instead of GPU and trained relatively small
models (F=8, see Methods Details in main text) for only 4
epochs. Where unstated, the experiments ran on SP-I and
G-CIFAR.
6
For example, see https://stackoverflow.com/
questions/329174/what-is-flop-s-and-is-it-
a-good-measure-of-performance.
7
A Tesla P100 can be twice as fast as a K40, for example.
0.802 0.7720 0.9951 0.0385
lr=0.000025
lr=0.00005
lr=0.0001
lr=0.0002
lr=0.0004
lr=0.0008
lr=0.0016
lr=0.0032
lr=0.0064
lr=0.0128
P=1024,S=256
P=1024,S=64
P=1024,S=16
P=256,S=64
P=128,S=64
P=256,S=32
P=128,S=32
P=256,S=16
P=128,S=16
P=1024,S=4
P=64,S=32
P=64,S=16
P=32,S=16
P=256,S=8
P=128,S=8
P=256,S=4
P=128,S=4
Evol.
Evol.
Evol.
Evol.
Evol.
P=64,S=8
P=32,S=8
P=16,S=8
P=64,S=4
P=32,S=4
P=16,S=4
P,S=1024
P,S=256
RL
RL
RL
RL
RL
P,S=64
P,S=32
P,S=16
P,S=4
0.7500 0.9944 0.0330
0.760
(a) (b)
0.77 Evolution 0.77 Evolution 0.7630 0.9951 0.0340
G-ImageNet MTA @ 5k
G-CIFAR MTA @ 5k
MNIST MTA @ 5k
G-CIFAR MTA
G-CIFAR MTA
RL
RL
Evol.
Evol.
Evol.
Evol.
Evol.
RL
RL
RL
RL
RL
0.66 m 0.69 m
0 20000 0 20000 0.7000 0.9939 0.0270
(c) (d) (e)
0 = sep. 3x3
1 = sep. 5x5
2 = sep. 7X7
3 = none
4 = avg. 3x3
5 = max 3x3
6 = dil. 3x3
7 = 1x7+7x1
(f) (g)
Figure A-1: Evolution and RL in different contexts. Plots show repeated evolution (orange) and RL (blue) experiments side-by-
side. (a) Summary of hyper-parameter optimization experiments on G-CIFAR. We swept the learning rate (lr) for RL (left) and
the population size (P) and sample size (S) for evolution (right). We ran 5 experiments (circles) for each scenario. The vertical
axis measures the mean validation accuracy (MVA) of the top 100 models in an experiment. Superposed on the raw data are
± 2 SEM error bars. From these results, we selected best meta-parameters to use in the remainder of this figure. (b) We assessed
robustness by running the same experiments in 5 different contexts, spanning different datasets and search spaces: G-CIFAR/SP-
I, G-CIFAR/SP-II, G-CIFAR/SP-III, MNIST/SP-I and G-ImageNet/SP-I, shown from left to right. These experiments ran to 20k
models. The vertical axis measures the mean testing accuracy (MTA) of the top 100 models (selected by validation accuracy).
(c) and (d) show a detailed view of the progress of the experiments in the G-CIFAR/SP-II and G-CIFAR/SP-III contexts,
respectively. The horizontal axes indicate the number of models (m) produced as the experiment progresses. (e) Resource-
constrained settings may require stopping experiments early. At 5k models, evolution performs better than RL in all 5 contexts.
(f) and (g) show a stack of normal cells of the best model found for G-CIFAR in the SP-I and SP-III search spaces, respectively.
The “h” labels some of the hidden states. The ops (“avg 3x3”, etc.) are listed in full form in the text. Data flows from left
to right. See the baseline study for a detailed description of these diagrams. In (f), N=3 (see Methods section), so the cell is
replicated three times; i.e. the left two-thirds of the diagram (grayed out) are constrained to mirror the right third. This is in
contrast with the vastly larger SP-III search space of (g), where a bigger, unconstrained construct without replication (N=1) is
explored.
Supplement B: Aging and Non-Aging Evolution
MTA
P=64, S=16 (AE)
Setup P=256, S=4 (NAE)
The search spaces and datasets were the same as in Supple- P=1000, S=2 (NAE)
ment A.
0.8960 m 20000
Findings
Figure B-2: A comparison of AE and NAE at scale. These
0.7730 0.9951 0.0422 experiments use the same conditions as the main text (in-
NAE
NAE
NAE
NAE
AE
AE
AE
AE
AE
SA
will see that aging evolution outperforms non-aging evolu-
tion.
AE
Setup NAE
The toy search space we use here does not involve any neural 0.65
networks. The goal is to evolve solutions to a very simple, 6 log2 D 9
single-optimum, D-dimensional, noisy optimization prob-
lem with a signal-to-noise ratio matching that of our neuro- Figure C-1: Results in the toy search space. The graph sum-
evolution experiments. marizes thousands of evolutionary search simulations. The
The search space used is the set of vertices of a D- vertical axis measures the simulated accuracy (SA) and the
dimensional unit cube. A specific vertex is “analogous” to horizontal axis the dimensionality (D) of the problem, a
a neural network architecture in a real experiment. A ver- measure of its difficulty. For each D, we optimized the meta-
tex can be represented as a sequence of its coordinates (0s parameters for NAE and AE independently. To do this, we
and 1s)—a bit-string. In other words, this bit-string consti- carried out 100 simulations for each meta-parameter combi-
tutes a simulated architecture. In a real experiment, training nation and averaged the outcomes. We plot here the optima
and evaluating an architecture yields a noisy accuracy. Like- found, together with ± 2 SEM error bars. The graph shows
wise, in this toy search space, we assign a noisy simulated that in this toy search space, AE is never worse and is sig-
accuracy (SA) to each cube vertex. The SA is the fraction nificantly better for larger D (note the broad range of the
of coordinates that are zero, plus a small amount of Gaus- vertical axis).
sian noise (µ = 0, σ = 0.01, matching the observed noise
for neural networks). Thus, the goal is to get close to the
optimum, the origin. The sample complexity used was 10k. Outcome
This space is helpful because an experiment completes in
milliseconds. The findings provide circumstantial evidence in favor of our
This optimization problem can be seen as a simplifica- suspicion that aging may help navigate noise (Discussion
tion of the evolutionary search for the minimum of a multi- Section), suggesting that attempting to verify this with more
dimensional integer-valued paraboloid with bounded sup- generality may be an interesting direction for future work.
port, where the mutations treat the values along each co-
ordinate categorically. If we restrict the domain along each
direction to the set {0, 1}, we reduce the problem to the unit
cube described above. The paraboloid’s value at a cube cor-
ner is just the number of coordinates that are not zero. We
mention this connection because searching for the minimum
of a paraboloid seems like a more natural choice for a triv-
ial problem (“trivial” compared to architecture search). The
simpler unit cube version, however, was chosen because it
permits faster computation.
We stress that these simulations are not intended to truly
mimic architecture search experiments over the space of
neural networks. We used them only as a testing ground
for techniques that evolve solutions in the presence of noisy
evaluations.
Findings
We found that optimized NAE and AE perform similarly in
low-dimensional problems, which are easier. As the dimen-
sionality (D) increases, AE becomes relatively better than
Supplement D: Additional AmoebaNets
Motivation lection of the top model was as follows. We picked from the
In the Discussion Section, we briefly mentioned three experiment K=100 models. To do this, we binned the mod-
additional models, AmoebaNet-B, AmoebaNet-C, and els by their number of parameters to cover the range, using
AmoebaNet-D. While all three used the aging evolu- B bins. From each bin, we took the top K/B models by vali-
tion algorithm presented the main text, there were some dation accuracy. We then augmented all models to N=6 and
differences in the experimental setups: AmoebaNet-B F=32 and selected the one with the top validation accuracy.
was obtained through platform-aware architecture search; AmoebaNet-C was discovered in the experiments de-
AmoebaNet-C was selected with a pareto-optimal criterion; scribed in the main text (see Methods and Methods Details
and AmoebaNet-D involved multi-stage search, including sections). Instead of selecting the highest validation accu-
manual extrapolation of the evolutionary process. Below we racy at the end of the experiments (as done in the main
describe each of these models and the methods that produced text), we picked a promising model while the experiments
them. were still ongoing. This was done entirely for expediency,
to be able to study a model while we waited for the search
Setup to complete. AmoebaNet-C was promising in that it stood
AmoebaNet-B was evolved by running experiments directly out in a pareto-optimal sense: it was a high-accuracy outlier
on Google TPUv2 hardware, since this was the target plat- for its relatively small number of parameters. As opposed to
form for its final evaluation. In the main text, the architecture all other architectures, AmoebaNet-C was selected based on
had been discovered on GPU but the largest model was eval- CIFAR-10 test accuracy, because it was intended to only be
uated on TPUs. In contrast, here we perform the full process benchmarked on ImageNet. This process was less methodi-
on TPUs. This architecture-aware approach allows the evo- cal than the one used in the main text but because the model
lutionary search to optimize even hardware-dependent as- has been cited in the literature, we include it here for com-
pects of the final accuracy, such as optimizations carried out pleteness.
by the compiler. The search setup was as in the main text, ex- AmoebaNet-D was obtained by manually modifying
cept that it used the larger SP-II space of Supplement A and AmoebaNet-B by extrapolating evolution. To do this, we
trained larger models (F=32) for longer (50 epochs). The se- studied the progress of an experiment and identified which
7 7 7
Softmax
+ + +
+ 1x7 1x7
avg 1x1 avg
1x1 none 1x1
3x3 + + + 7x1 7x1 3x3
avg sep sep
Normal Cell xN sep
+ 3x3 3x3 3x3
none
sep
3x3
avg
3x3
none +
3x3
none 1x1
+
max
none
3x3
+ + 1
Reduction Cell max
3x3
1x1
1x1
+
sep
avg
3x3
sep
3x3 none
+
sep
3x3 3x3
1 +
max
1x1
3x3
1 ...
Normal Cell xN ...
...
0 0 0
Reduction Cell
7 7 7
+ + + +
+ + 1x7 max avg
Normal Cell xN avg
3x3
1x1 +
sep
sep
7x7
sep
3x3
max
3x3
sep
3x3
sep
5x5
sep
5x5
7x1 3x3 3x3
1x1
none
+
max
none 2x2
3x3
+ +
+ + max sep
dil max none 3x3
Reduction Cell x2 5x5 3x3 none 3x3 3x3 7x7
1 +
max max
2x2 3x3
1
3x3 conv, stride 2 max
+
max
max
+
max ... 1
2x2 3x3
3x3 3x3
...
...
Input Image 0 0 0
Figure D-1: Architectures of overall model and cells. From left to right: outline of the overall model [54] and diagrams for the
cell architectures discovered by evolution: AmoebaNet-B, AmoebaNet-C, and AmoebaNet-D. The three normal cells are on
the top row and the three reduction cells are on the bottom row. The labeled activations or hidden states correspond to the cell
inputs (“0” and “1”) and the cell output (“7”).
mutations were still causing improvements in fitness at the
later stages of the process. By inspection, we found these
mutations to be: replacing a 3x3 separable (sep.) convolu-
tion (conv.) with a 1x7 followed by 7x1 conv. in the normal
cell, replacing a 5x5 sep. conv. by a 1x7 followed by 7x1
conv. in the reduction cell, and replacing a 3x3 sep. conv.
with 3x3 avg. pool in the reduction cell. Additionally, we
reduced the numeric precision from 32-bit to 16-bit floats,
and set a learning rate schedule of step-wise decay, reducing
by a factor of 0.88 every epoch. We trained for 35 epochs
in total. To submit to Stanford DAWNBench (see Outcome
section), we used N=2 and F=256.
Findings
Figure D-1 presents all three model architectures. We refrain
from benchmarking these here. Instead, in the Outcome Sec-
tion below, we will refer the reader to results presented else-
where.
Outcome
In this supplement we have described additional evolution-
ary experiments that led to three new models. Such exper-
iments were intended mainly to search for better models.
Due to the resource-intensive nature of these methods, we
forewent ablations and baselines in this supplement. For a
more empirically rigorous approach, please refer to the pro-
cess that produced AmoebaNet-A in the main text.
AmoebaNet-B had set a new state of the art on CIFAR-10
(2.13% test error) in a previous preprint of this paper8 after
being trained with cutout, but has since been superseded.
AmoebaNet-C had set the previous state-of-the-art top-
1 accuracy on ImageNet after being trained with advanced
data augmentation techniques in [11].
AmoebaNet-D won the Stanford DAWNBench competi-
tion for lowest training cost on ImageNet. The goal of this
competition category was to minimize the monetary cost of
training a model to 93% top-5 accuracy. AmoebaNet-D costs
$49.30 to train. This was 16% better than the second-best
model, which was ResNet and which trained on the same
hardware. The results were published in [9].
8
Version 1 with same title on arXiv: https://arxiv.org/
pdf/1802.01548v1.pdf