Efficientnet: Rethinking Model Scaling For Convolutional Neural Networks
Efficientnet: Rethinking Model Scaling For Convolutional Neural Networks
Abstract EfficientNet-B7
84
B6
Convolutional Neural Networks (ConvNets) are B5 AmoebaNet-A AmoebaNet-C
commonly developed at a fixed resource budget, 82
B4 NASNet-A SENet
olution can lead to better performance. Based 78 ResNet-152 Top1 Acc. #Params
ResNet-152 (He et al., 2016) 77.8% 60M
on this observation, we propose a new scaling DenseNet-201 EfficientNet-B1 78.8% 7.8M
B0 ResNeXt-101 (Xie et al., 2017) 80.9% 84M
method that uniformly scales all dimensions of 76
ResNet-50
EfficientNet-B3
SENet (Hu et al., 2018)
81.1%
82.7%
12M
146M
depth/width/resolution using a simple yet highly NASNet-A (Zoph et al., 2018)
EfficientNet-B4
82.7%
82.6%
89M
19M
Inception-v2
effective compound coefficient. We demonstrate 74
GPipe (Huang et al., 2018) †
EfficientNet-B7
84.3%
84.4%
556M
66M
NASNet-A †
the effectiveness of this method on scaling up ResNet-34
Not plotted
Scaling up ConvNets is widely used to achieve better accu- In this paper, we want to study and rethink the process
racy. For example, ResNet (He et al., 2016) can be scaled of scaling up ConvNets. In particular, we investigate the
up from ResNet-18 to ResNet-200 by using more layers; central question: is there a principled method to scale up
Recently, GPipe (Huang et al., 2018) achieved 84.3% Ima- ConvNets that can achieve better accuracy and efficiency?
geNet top-1 accuracy by scaling up a baseline model four Our empirical study shows that it is critical to balance all
time larger. However, the process of scaling up ConvNets dimensions of network width/depth/resolution, and surpris-
has never been well understood and there are currently many ingly such balance can be achieved by simply scaling each
of them with constant ratio. Based on this observation, we
1
Google Research, Brain Team, Mountain View, CA. Corre- propose a simple yet effective compound scaling method.
spondence to: Mingxing Tan <[email protected]>. Unlike conventional practice that arbitrary scales these fac-
Proceedings of the 36 th International Conference on Machine tors, our method uniformly scales network width, depth,
Learning, Long Beach, California, PMLR 97, 2019. Copyright and resolution with a set of fixed scaling coefficients. For
2019 by the author(s). example, if we want to use 2N times more computational
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
wider
#channels
wider
deeper
deeper
layer_i
higher higher
resolution HxW resolution resolution
(a) baseline (b) width scaling (c) depth scaling (d) resolution scaling (e) compound scaling
Figure 2. Model Scaling. (a) is a baseline network example; (b)-(d) are conventional scaling that only increases one dimension of network
width, depth, or resolution. (e) is our proposed compound scaling method that uniformly scales all three dimensions with a fixed ratio.
resources, then we can simply increase the network depth by 2. Related Work
αN , width by β N , and image size by γ N , where α, β, γ are
constant coefficients determined by a small grid search on ConvNet Accuracy: Since AlexNet (Krizhevsky et al.,
the original small model. Figure 2 illustrates the difference 2012) won the 2012 ImageNet competition, ConvNets have
between our scaling method and conventional methods. become increasingly more accurate by going bigger: while
the 2014 ImageNet winner GoogleNet (Szegedy et al., 2015)
Intuitively, the compound scaling method makes sense be- achieves 74.8% top-1 accuracy with about 6.8M parameters,
cause if the input image is bigger, then the network needs the 2017 ImageNet winner SENet (Hu et al., 2018) achieves
more layers to increase the receptive field and more channels 82.7% top-1 accuracy with 145M parameters. Recently,
to capture more fine-grained patterns on the bigger image. In GPipe (Huang et al., 2018) further pushes the state-of-the-art
fact, previous theoretical (Raghu et al., 2017; Lu et al., 2018) ImageNet top-1 validation accuracy to 84.3% using 557M
and empirical results (Zagoruyko & Komodakis, 2016) both parameters: it is so big that it can only be trained with a
show that there exists certain relationship between network specialized pipeline parallelism library by partitioning the
width and depth, but to our best knowledge, we are the network and spreading each part to a different accelera-
first to empirically quantify the relationship among all three tor. While these models are mainly designed for ImageNet,
dimensions of network width, depth, and resolution. recent studies have shown better ImageNet models also per-
We demonstrate that our scaling method work well on exist- form better across a variety of transfer learning datasets
ing MobileNets (Howard et al., 2017; Sandler et al., 2018) (Kornblith et al., 2019), and other computer vision tasks
and ResNet (He et al., 2016). Notably, the effectiveness of such as object detection (He et al., 2016; Tan et al., 2019).
model scaling heavily depends on the baseline network; to Although higher accuracy is critical for many applications,
go even further, we use neural architecture search (Zoph & we have already hit the hardware memory limit, and thus
Le, 2017; Tan et al., 2019) to develop a new baseline net- further accuracy gain needs better efficiency.
work, and scale it up to obtain a family of models, called Effi-
cientNets. Figure 1 summarizes the ImageNet performance, ConvNet Efficiency: Deep ConvNets are often over-
where our EfficientNets significantly outperform other Con- parameterized. Model compression (Han et al., 2016; He
vNets. In particular, our EfficientNet-B7 surpasses the best et al., 2018; Yang et al., 2018) is a common way to re-
existing GPipe accuracy (Huang et al., 2018), but using duce model size by trading accuracy for efficiency. As mo-
8.4x fewer parameters and running 6.1x faster on inference. bile phones become ubiquitous, it is also common to hand-
Compared to the widely used ResNet (He et al., 2016), our craft efficient mobile-size ConvNets, such as SqueezeNets
EfficientNet-B4 improves the top-1 accuracy from 76.3% (Iandola et al., 2016; Gholami et al., 2018), MobileNets
of ResNet-50 to 82.6% with similar FLOPS. Besides Ima- (Howard et al., 2017; Sandler et al., 2018), and ShuffleNets
geNet, EfficientNets also transfer well and achieve state-of- (Zhang et al., 2018; Ma et al., 2018). Recently, neural archi-
the-art accuracy on 5 out of 8 widely used datasets, while tecture search becomes increasingly popular in designing
reducing parameters by up to 21x than existing ConvNets. efficient mobile-size ConvNets (Tan et al., 2019; Cai et al.,
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
2019), and achieves even better efficiency than hand-crafted dimension is expanded over layers, for example, from initial
mobile ConvNets by extensively tuning the network width, input shape h224, 224, 3i to final output shape h7, 7, 512i.
depth, convolution kernel types and sizes. However, it is
Unlike regular ConvNet designs that mostly focus on find-
unclear how to apply these techniques for larger models that
ing the best layer architecture Fi , model scaling tries to ex-
have much larger design space and much more expensive
pand the network length (Li ), width (Ci ), and/or resolution
tuning cost. In this paper, we aim to study model efficiency
(Hi , Wi ) without changing Fi predefined in the baseline
for super large ConvNets that surpass state-of-the-art accu-
network. By fixing Fi , model scaling simplifies the design
racy. To achieve this goal, we resort to model scaling.
problem for new resource constraints, but it still remains
a large design space to explore different Li , Ci , Hi , Wi for
Model Scaling: There are many ways to scale a Con- each layer. In order to further reduce the design space, we
vNet for different resource constraints: ResNet (He et al., restrict that all layers must be scaled uniformly with con-
2016) can be scaled down (e.g., ResNet-18) or up (e.g., stant ratio. Our target is to maximize the model accuracy
ResNet-200) by adjusting network depth (#layers), while for any given resource constraints, which can be formulated
WideResNet (Zagoruyko & Komodakis, 2016) and Mo- as an optimization problem:
bileNets (Howard et al., 2017) can be scaled by network
width (#channels). It is also well-recognized that bigger
input image size will help accuracy with the overhead of max Accuracy N (d, w, r)
d,w,r
more FLOPS. Although prior studies (Raghu et al., 2017; K
F̂id·L̂i Xhr·Ĥi ,r·Ŵi ,w·Ĉi i
Lin & Jegelka, 2018; Sharir & Shashua, 2018; Lu et al., s.t. N (d, w, r) =
i=1...s
2018) have shown that network deep and width are both
important for ConvNets’ expressive power, it still remains Memory(N ) ≤ target memory
an open question of how to effectively scale a ConvNet to FLOPS(N ) ≤ target flops
achieve better efficiency and accuracy. Our work systemati- (2)
cally and empirically studies ConvNet scaling for all three where w, d, r are coefficients for scaling network width,
dimensions of network width, depth, and resolutions. depth, and resolution; F̂i , L̂i , Ĥi , Ŵi , Ĉi are predefined pa-
rameters in baseline network (see Table 1 as an example).
3. Compound Model Scaling
3.2. Scaling Dimensions
In this section, we will formulate the scaling problem, study
different approaches, and propose our new scaling method. The main difficulty of problem 2 is that the optimal d, w, r
depend on each other and the values change under different
3.1. Problem Formulation resource constraints. Due to this difficulty, conventional
methods mostly scale ConvNets in one of these dimensions:
A ConvNet Layer i can be defined as a function: Yi =
Fi (Xi ), where Fi is the operator, Yi is output tensor, Xi is
input tensor, with tensor shape hHi , Wi , Ci i1 , where Hi and Depth (dd): Scaling network depth is the most common way
Wi are spatial dimension and Ci is the channel dimension. used by many ConvNets (He et al., 2016; Huang et al., 2017;
A ConvNet N can be represented by aJ list of composed lay- Szegedy et al., 2015; 2016). The intuition is that deeper
ers: N = Fk ... F1 F1 (X1 ) = j=1...k Fj (X1 ). In ConvNet can capture richer and more complex features, and
practice, ConvNet layers are often partitioned into multiple generalize well on new tasks. However, deeper networks
stages and all layers in each stage share the same architec- are also more difficult to train due to the vanishing gradient
ture: for example, ResNet (He et al., 2016) has five stages, problem (Zagoruyko & Komodakis, 2016). Although sev-
and all layers in each stage has the same convolutional type eral techniques, such as skip connections (He et al., 2016)
except the first layer performs down-sampling. Therefore, and batch normalization (Ioffe & Szegedy, 2015), alleviate
we can define a ConvNet as: the training problem, the accuracy gain of very deep network
K diminishes: for example, ResNet-1000 has similar accuracy
FiLi XhHi ,Wi ,Ci i
N = (1) as ResNet-101 even though it has much more layers. Figure
i=1...s 3 (middle) shows our empirical study on scaling a baseline
model with different depth coefficient d, further suggesting
where FiLi denotes layer Fi is repeated Li times in stage i,
the diminishing accuracy return for very deep ConvNets.
hHi , Wi , Ci i denotes the shape of input tensor X of layer
i. Figure 2(a) illustrate a representative ConvNet, where
the spatial dimension is gradually shrunk but the channel
Width (w w ): Scaling network width is commonly used for
1
For the sake of simplicity, we omit batch dimension. small size models (Howard et al., 2017; Sandler et al., 2018;
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
81 81 81
ImageNet Top-1 Accuracy(%)
80 w=5.0 80 80
w=3.8 d=6.0 d=8.0
w=2.6 r=2.2 r=2.5
79 79 79 r=1.9
d=3.0d=4.0 r=1.7
w=1.8 d=2.0
78 78 78 r=1.5
r=1.3
w=1.4
77 77 77
75 75 75
0 2 4 6 8 0 1 2 3 4 0 1 2 3
FLOPS (Billions) FLOPS (Billions) FLOPS (Billions)
Figure 3. Scaling Up a Baseline Model with Different Network Width (w), Depth (d), and Resolution (r) Coefficients. Bigger
networks with larger width, depth, or resolution tend to achieve higher accuracy, but the accuracy gain quickly saturate after reaching
80%, demonstrating the limitation of single dimension scaling. Baseline network is described in Table 1.
In fact, a few prior work (Zoph et al., 2018; Real et al., 2019) Table 1. EfficientNet-B0 baseline network – Each row describes
have already tried to arbitrarily balance network width and a stage i with L̂i layers, with input resolution hĤi , Ŵi i and output
depth, but they all require tedious manual tuning. channels Ĉi . Notations are adopted from equation 2.
Stage Operator Resolution #Channels #Layers
In this paper, we propose a new compound scaling method,
i F̂i Ĥi × Ŵi Ĉi L̂i
which use a compound coefficient φ to uniformly scales
1 Conv3x3 224 × 224 32 1
network width, depth, and resolution in a principled way: 2 MBConv1, k3x3 112 × 112 16 1
3 MBConv6, k3x3 112 × 112 24 2
depth: d = αφ 4 MBConv6, k5x5 56 × 56 40 2
5 MBConv6, k3x3 28 × 28 80 3
width: w = β φ 6 MBConv6, k5x5 28 × 28 112 3
7 MBConv6, k5x5 14 × 14 192 4
resolution: r = γ φ (3) 8 MBConv6, k3x3 7×7 320 1
2 2 9 Conv1x1 & Pooling & FC 7×7 1280 1
s.t. α · β · γ ≈ 2
α ≥ 1, β ≥ 1, γ ≥ 1
Net, except our EfficientNet-B0 is slightly bigger due to
where α, β, γ are constants that can be determined by a the larger FLOPS target (our FLOPS target is 400M). Ta-
small grid search. Intuitively, φ is a user-specified coeffi- ble 1 shows the architecture of EfficientNet-B0. Its main
cient that controls how many more resources are available building block is mobile inverted bottleneck MBConv (San-
for model scaling, while α, β, γ specify how to assign these dler et al., 2018; Tan et al., 2019), to which we also add
extra resources to network width, depth, and resolution re- squeeze-and-excitation optimization (Hu et al., 2018).
spectively. Notably, the FLOPS of a regular convolution op Starting from the baseline EfficientNet-B0, we apply our
is proportional to d, w2 , r2 , i.e., doubling network depth compound scaling method to scale it up with two steps:
will double FLOPS, but doubling network width or resolu-
tion will increase FLOPS by four times. Since convolution • STEP 1: we first fix φ = 1, assuming twice more re-
ops usually dominate the computation cost in ConvNets, sources available, and do a small grid search of α, β, γ
scaling a ConvNet with equation 3 will approximately in- based on Equation 2 and 3. In particular, we find
φ
crease total FLOPS by α · β 2 · γ 2 . In this paper, we the best values for EfficientNet-B0 are α = 1.2, β =
constraint α · β 2 · γ 2 ≈ 2 such that for any new φ, the total 1.1, γ = 1.15, under constraint of α · β 2 · γ 2 ≈ 2.
FLOPS will approximately3 increase by 2φ . • STEP 2: we then fix α, β, γ as constants and scale up
baseline network with different φ using Equation 3, to
4. EfficientNet Architecture obtain EfficientNet-B1 to B7 (Details in Table 2).
Since model scaling does not change layer operators F̂i Notably, it is possible to achieve even better performance by
in baseline network, having a good baseline network is searching for α, β, γ directly around a large model, but the
also critical. We will evaluate our scaling method using search cost becomes prohibitively more expensive on larger
existing ConvNets, but in order to better demonstrate the models. Our method solves this issue by only doing search
effectiveness of our scaling method, we have also developed once on the small baseline network (step 1), and then use
a new mobile-size baseline, called EfficientNet. the same scaling coefficients for all other models (step 2).
Inspired by (Tan et al., 2019), we develop our baseline net-
work by leveraging a multi-objective neural architecture 5. Experiments
search that optimizes both accuracy and FLOPS. Specifi-
cally, we use the same search space as (Tan et al., 2019), In this section, we will first evaluate our scaling method on
and use ACC(m)×[F LOP S(m)/T ]w as the optimization existing ConvNets and the new proposed EfficientNets.
goal, where ACC(m) and F LOP S(m) denote the accu-
racy and FLOPS of model m, T is the target FLOPS and 5.1. Scaling Up MobileNets and ResNets
w=-0.07 is a hyperparameter for controlling the trade-off
As a proof of concept, we first apply our scaling method
between accuracy and FLOPS. Unlike (Tan et al., 2019;
to the widely-used MobileNets (Howard et al., 2017; San-
Cai et al., 2019), here we optimize FLOPS rather than la-
dler et al., 2018) and ResNet (He et al., 2016). Table 3
tency since we are not targeting any specific hardware de-
shows the ImageNet results of scaling them in different
vice. Our search produces an efficient network, which we
ways. Compared to other single-dimension scaling methods,
name EfficientNet-B0. Since we use the same search space
our compound scaling method improves the accuracy on all
as (Tan et al., 2019), the architecture is similar to Mnas-
these models, suggesting the effectiveness of our proposed
3
FLOPS may differ from theocratic value due to rounding. scaling method for general existing ConvNets.
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Table 2. EfficientNet Performance Results on ImageNet (Russakovsky et al., 2015). All EfficientNet models are scaled from our
baseline EfficientNet-B0 using different compound coefficient φ in Equation 3. ConvNets with similar top-1/top-5 accuracy are grouped
together for efficiency comparison. Our scaled EfficientNet models consistently reduce parameters and FLOPS by an order of magnitude
(up to 8.4x parameter reduction and up to 16x FLOPS reduction) than existing ConvNets.
Table 4. Inference Latency Comparison – Latency is measured 5.2. ImageNet Results for EfficientNet
with batch size 1 on a single core of Intel Xeon CPU E5-2690.
We train our EfficientNet models on ImageNet using simi-
Acc. @ Latency Acc. @ Latency lar settings as (Tan et al., 2019): RMSProp optimizer with
ResNet-152 77.8% @ 0.554s GPipe 84.3% @ 19.0s decay 0.9 and momentum 0.9; batch norm momentum 0.99;
EfficientNet-B1 78.8% @ 0.098s EfficientNet-B7 84.4% @ 3.1s weight decay 1e-5; initial learning rate 0.256 that decays
Speedup 5.7x Speedup 6.1x
by 0.97 every 2.4 epochs. We also use swish activation
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Table 5. EfficientNet Performance Results on Transfer Learning Datasets. Our scaled EfficientNet models achieve new state-of-the-
art accuracy for 5 out of 8 datasets, with 9.6x fewer parameters on average.
90 94
Accuracy(%)
80
98
0.8 88 93
86 75
97 92
84 70 91
0.6
101 102 103 101 102 103 101 102 103 101 102 103
Flowers FGVC Aircraft Oxford-IIIT Pets Food-101
96
0.4 92.5
98.5 92
Accuracy(%)
90.0
94 90
98.0
0.2 87.5
97.5 88
85.0 92
97.0 86
0.0 82.5
0.0 101 102 0.2103 101 1020.4 103 101 0.6 102 103 101
0.8 102 1.03
10
Number of Parameters (Millions, log-scale)
Inception-v1 Inception-v4 ResNet-50 ResNet-152 DenseNet-201 GPIPE
Inception-v3 Inception-ResNet-v2 ResNet-101 DenseNet-169 NASNet-A EfficientNet
Figure 6. Model Parameters vs. Transfer Learning Accuracy – All models are pretrained on ImageNet and finetuned on new datasets.
(Ramachandran et al., 2018; Elfwing et al., 2018), fixed Au- ConvNets. Notably, our EfficientNet models are not only
toAugment policy (Cubuk et al., 2019), and stochastic depth small, but also computational cheaper. For example, our
(Huang et al., 2016) with drop connect ratio 0.3. As com- EfficientNet-B3 achieves higher accuracy than ResNeXt-
monly known that bigger models need more regularization, 101 (Xie et al., 2017) using 18x fewer FLOPS.
we linearly increase dropout (Srivastava et al., 2014) ratio
To validate the computational cost, we have also mea-
from 0.2 for EfficientNet-B0 to 0.5 for EfficientNet-B7.
sured the inference latency on a real CPU as shown in
Table 2 shows the performance of all EfficientNet models Table 4, where we report average latency over 20 runs.
that are scaled from the same baseline EfficientNet-B0. Our Our EfficientNet-B1 runs 5.7x faster than the widely used
EfficientNet models generally use an order of magnitude ResNet-152 (He et al., 2016), while EfficientNet-B7 runs
fewer parameters and FLOPS than other ConvNets with about 6.1x faster than GPipe (Huang et al., 2018), suggest-
similar accuracy. In particular, our EfficientNet-B7 achieves ing our EfficientNets are indeed fast on real hardware.
84.4% top1 / 97.1% top-5 accuracy with 66M parameters
and 37B FLOPS, being more accurate but 8.4x smaller than 5.3. Transfer Learning Results for EfficientNet
the previous best GPipe (Huang et al., 2018).
We have also evaluated our EfficientNet on a list of com-
Figure 1 and Figure 5 illustrates the parameters-accuracy monly used transfer learning datasets, as shown in Table
and FLOPS-accuracy curve for representative ConvNets, 6. We borrow the same training settings from (Kornblith
where our scaled EfficientNet models achieve better accu- et al., 2019) and (Huang et al., 2018), which take ImageNet
racy with much fewer parameters and FLOPS than other pretrained checkpoints and finetune on new datasets.
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
original image baseline model deeper (d=4) wider (w=2) higher resolution (r=2) compound scaling
bakeshop
maze
Figure 7. Class Activation Map (CAM) (Zhou et al., 2016) for Models with different scaling methods – Our compound scaling
method allows the scaled model (last column) to focus on more relevant regions with more object details.
83
Table 6. Transfer Learning Datasets.
Chollet, F. Xception: Deep learning with depthwise separa- Ioffe, S. and Szegedy, C. Batch normalization: Accelerating
ble convolutions. CVPR, pp. 1610–02357, 2017. deep network training by reducing internal covariate shift.
ICML, pp. 448–456, 2015.
Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le,
Q. V. Autoaugment: Learning augmentation policies Kornblith, S., Shlens, J., and Le, Q. V. Do better imagenet
from data. CVPR, 2019. models transfer better? CVPR, 2019.
Elfwing, S., Uchibe, E., and Doya, K. Sigmoid-weighted Krause, J., Deng, J., Stark, M., and Fei-Fei, L. Collecting a
linear units for neural network function approximation large-scale dataset of fine-grained cars. Second Workshop
in reinforcement learning. Neural Networks, 107:3–11, on Fine-Grained Visual Categorizatio, 2013.
2018. Krizhevsky, A. and Hinton, G. Learning multiple layers of
features from tiny images. Technical Report, 2009.
Gholami, A., Kwon, K., Wu, B., Tai, Z., Yue, X., Jin, P.,
Zhao, S., and Keutzer, K. Squeezenext: Hardware-aware Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
neural network design. ECV Workshop at CVPR’18, classification with deep convolutional neural networks.
2018. In NIPS, pp. 1097–1105, 2012.
Han, S., Mao, H., and Dally, W. J. Deep compression: Lin, H. and Jegelka, S. Resnet with one-neuron hidden
Compressing deep neural networks with pruning, trained layers is a universal approximator. NeurIPS, pp. 6172–
quantization and huffman coding. ICLR, 2016. 6181, 2018.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B.,
learning for image recognition. CVPR, pp. 770–778, and Belongie, S. Feature pyramid networks for object
2016. detection. CVPR, 2017.
He, K., Gkioxari, G., Dollár, P., and Girshick, R. Mask Liu, C., Zoph, B., Shlens, J., Hua, W., Li, L.-J., Fei-Fei, L.,
r-cnn. ICCV, pp. 2980–2988, 2017. Yuille, A., Huang, J., and Murphy, K. Progressive neural
architecture search. ECCV, 2018.
He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., and Han, S.
Amc: Automl for model compression and acceleration Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. The expres-
on mobile devices. ECCV, 2018. sive power of neural networks: A view from the width.
NeurIPS, 2018.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,
W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Ma, N., Zhang, X., Zheng, H.-T., and Sun, J. Shufflenet v2:
Efficient convolutional neural networks for mobile vision Practical guidelines for efficient cnn architecture design.
applications. arXiv preprint arXiv:1704.04861, 2017. ECCV, 2018.
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A.
M., Li, Y., Bharambe, A., and van der Maaten, L. Explor- Inception-v4, inception-resnet and the impact of residual
ing the limits of weakly supervised pretraining. arXiv connections on learning. AAAI, 4:12, 2017.
preprint arXiv:1805.00932, 2018.
Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M.,
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi, Howard, A., and Le, Q. V. MnasNet: Platform-aware
A. Fine-grained visual classification of aircraft. arXiv neural architecture search for mobile. CVPR, 2019.
preprint arXiv:1306.5151, 2013.
Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggre-
Ngiam, J., Peng, D., Vasudevan, V., Kornblith, S., Le, Q. V., gated residual transformations for deep neural networks.
and Pang, R. Domain adaptive transfer learning with spe- CVPR, pp. 5987–5995, 2017.
cialist models. arXiv preprint arXiv:1811.07056, 2018.
Yang, T.-J., Howard, A., Chen, B., Zhang, X., Go, A., Sze,
Nilsback, M.-E. and Zisserman, A. Automated flower clas- V., and Adam, H. Netadapt: Platform-aware neural net-
sification over a large number of classes. ICVGIP, pp. work adaptation for mobile applications. ECCV, 2018.
722–729, 2008.
Zagoruyko, S. and Komodakis, N. Wide residual networks.
Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar, C. BMVC, 2016.
Cats and dogs. CVPR, pp. 3498–3505, 2012.
Zhang, X., Li, Z., Loy, C. C., and Lin, D. Polynet: A pursuit
Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl- of structural diversity in very deep networks. CVPR, pp.
Dickstein, J. On the expressive power of deep neural 3900–3908, 2017.
networks. ICML, 2017. Zhang, X., Zhou, X., Lin, M., and Sun, J. Shufflenet: An ex-
Ramachandran, P., Zoph, B., and Le, Q. V. Searching for tremely efficient convolutional neural network for mobile
activation functions. arXiv preprint arXiv:1710.05941, devices. CVPR, 2018.
2018. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba,
Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Regu- A. Learning deep features for discriminative localization.
larized evolution for image classifier architecture search. CVPR, pp. 2921–2929, 2016.
AAAI, 2019. Zoph, B. and Le, Q. V. Neural architecture search with
reinforcement learning. ICLR, 2017.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning
M., et al. Imagenet large scale visual recognition chal- transferable architectures for scalable image recognition.
lenge. International Journal of Computer Vision, 115(3): CVPR, 2018.
211–252, 2015.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich,
A. Going deeper with convolutions. CVPR, pp. 1–9,
2015.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,
Z. Rethinking the inception architecture for computer
vision. CVPR, pp. 2818–2826, 2016.