A CNN Accelerator on FPGA Using Depthwise
A CNN Accelerator on FPGA Using Depthwise
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2018.2865896, IEEE
Transactions on Circuits and Systems II: Express Briefs
1
Abstract—Convolutional neural networks (CNNs) have been Almost all the existed FPGA-based CNN implementation
widely deployed in the fields of computer vision and pattern works were to explore memory bandwidth and computing
recognition because of their high accuracy. However, large con- parallelism limitations. To conquer the limitation of memory
volution operations are computing-intensive that often requires
a powerful computing platform such as Graphics Processing bandwidth, [2] and [3] stored the parameters in on-chip
Unit (GPU). This makes it difficult to apply CNNs to portable memory. However, as CNN goes deeper, parameters required
devices. The state-of-the-art CNNs, such as MobileNetV2 and by convolution increase sharply, which makes the on-chip
Xception, adopt depthwise separable convolution to replace the memory solution inefficient. Other works like [4] [5] [6]
standard convolution for embedded platforms. That significantly alleviated the pressure on off-chip memory through limiting
reduces operations and parameters with only limited loss in
accuracy. This highly structured model is very suitable for the parameters precision of the neural networks, as lower
Field-Programmable Gate Array (FPGA) implementation. In numerical precision were proved to be sufficient for CNN
this paper, a scalable high performance depthwise separable [16] [17]. In [7] [8], computing engine was optimized for
convolution optimized CNN accelerator is proposed. The ac- highly parallelism in computation. [6] proposed a pipeline
celerator can be fit into an FPGA of different sizes, provided based solution for CNN for high throughput. [9] made a
the balancing between hardware resources and processing speed.
As an example, MobileNetV2 is implemented on Arria 10 SoC comprehensive evaluation and comparison of Altera and Xilinx
FPGA, and the results show this accelerator can classify each OpenCL frameworks for CNN. [10] explored the sparsity-
picture from ImageNet in 3.75ms, which is about 266.6 frames based optimizations, which could achieve up to 3x higher core
per second. This achieves 20x speedup if compared to CPU. energy efficiency and raise the device-level energy efficiency
Index Terms—convolutional neural network, FPGA, hardware by around 70% through data compression. Both [11] and [12]
accelerator, MobileNetV2. implemented separable depthwise convolution with the exam-
ple MobileNetV1, and achieved processing speed at 7.85ms
per image and 231.7 frame per second (fps) respectively.
I. I NTRODUCTION
The key contributions of this work are:
1549-7747 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2018.2865896, IEEE
Transactions on Circuits and Systems II: Express Briefs
2
ODSC = M × M × K × K × N + M × M × N × P (4)
ODSC 1 1
FO = = + 2 (6)
OSC P K
WDSC = K × K × N + N × P (3)
1549-7747 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2018.2865896, IEEE
Transactions on Circuits and Systems II: Express Briefs
3
1549-7747 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2018.2865896, IEEE
Transactions on Circuits and Systems II: Express Briefs
4
5) Pointwise Convolution: Pointwise convolution is actu- convolution starts. This, on one hand, reduces the latency
ally standard convolution with kernel size 1 × 1 (Fig. 9). caused by parameters loading, and on the other hand, avoids
To fully take advantage of all the multipliers in MME, the the latency brought the limited bandwidth of external memory.
input feature map is divided into several M × M × 32 sub- Besides, weight buffer is built as a ping-pong buffer (Fig. 11),
matrices, and these sub-matrices are shifted into line buffers which means that when weight buffer 1 outputs data for
one after another. This idea comes from divide and conquer convolution, the weight buffer 2 loads the data from external
algorithm in large matrix multiplication illustrated in Fig. 10, memory for the next one and vice versa.
which consists in dividing large matrix into several small
matrices and sum the results up after doing small matrix
multiplication. For one MME, it is able to do M 2 × 32 and
32 × 9 multiplication at once. The adder tree sums up the 32
products in each cell as revealed by Fig. 9. Thus the output
channel number is 9.
IV. R ESULTS
The proposed accelerator architecture (Fig. 12) is demon-
strated by implementing the MobileNetV2 network on the
Arria 10 SoC Development Kit (10AS066N3F40E2SG), which
contains 251680 ALMs, 2131 M20K, and 1687 DSP blocks.
The design consideration will be described bellow and then
followed by implementation results with resources utilization.
A. Implementation Consideration
Fig. 10. Divide-conquer for large matrix multiplication As mentioned in Section I, lower numerical precision is
sufficient for CNN. So 16-bit quantization strategy is chosen
6) Normalization: After training, parameters of batch nor- because it is widely selected by previous works [2] [3] [20]
malization are fixed [19]. Thus the complex normalization is [6].
downgraded into multiplication and add operation. Based on the description in Section III, 4-MME array is
7) Pooling: Average pooling and max pooling are treated decided to instantiate in this design after carefully balancing
differently. As pixels of a feature map channel are output one the resources usage and processing time. The weight buffer
by one, average pooling could be easily calculated by adding size is 36Kb as a ping-pong buffer. Since the update rate of
one more multiply-accumulate stage by a factor of 1/S, where weights when performing depthwise separable convolution is
S is average pooling size. On the other hand, max pooling every M × M clock cycles. The size of intermediate feature
needs one more comparison stage. map buffer is 24.5Mb.
8) ReLU: Same as the pooling layer, a ReLU stage is
added after the normalization stage. Three options: no ReLU, B. Implementation Results
standard ReLU and ReLU6 are selectable.
Fig. 12 presents the system architecture on Arria 10 SoC.
Since HPS is not used in this design, only FPGA part is
C. Memory Organization shown. The DDR4 memory is the one connected to the FPGA
To have an efficient memory organization, one has to part. The CNN accelerator runs at frequency 133MHz. Its
balance on-chip memory resources and external memory band- adder tree limits this frequency. A Nios II softcore micro-
width. On-chip memory is limited on FPGA but supplies very processor is implemented for loading weights and input images
high bandwidth. Contrarily, external memory has the capability from flash memory to DDR4 external memory. An external
to store large amount of data but with the penalty of limited memory interface IP combined with a Modular Scatter-Gather
bandwidth. Therefore, in this proposed accelerator, we adapt Direct Memory Access (mSG-DMA) IP are used to bridge
the hierarchical memory methodology. Weight buffer loads the buffers in the CNN accelerator and the FPGA memory,
the needed parameters from external memory before each whose maximum bandwidth is 8.5GB/s. This structure avoids
1549-7747 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2018.2865896, IEEE
Transactions on Circuits and Systems II: Express Briefs
5
R EFERENCES
[1] Y.-H. Chen, T. Krishna, J.S. Emer, and V. Sze, “Eyeriss: An energy
efficient reconfigurable accelerator for deep convolutional neural net-
works,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
2017.
[2] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z.
Xu, N. Sun et al., “Dadiannao: A machine-learning supercomputer,” in
Proceedings of the 47th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO), 2014, pp. 609–622.
[3] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
and O. Temam, “Shidiannao: Shifting vision processing closer to the
sensor,” in ACM SIGARCH Computer Architecture News, vol. 43, no. 3.
2015, pp. 92–104.
[4] Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y.W. Tai, “Exploring het-
erogeneous algorithms for accelerating deep convolutional neural net-
works on fpgas,” in Design Automation Conference (DAC), 2017 54th
ACM/EDAC/IEEE, 2017, pp. 1–6.
[5] S. I. Venieris and C.-S. Bouganis, “fpgaconvnet: Automated mapping of
convolutional neural networks on fpgas,” in Proceedings of the 2017
ACM/SIGDA International Symposium on Field-Programmable Gate Ar-
Fig. 12. System architecture on FPGA design rays (FPGA), 2017, pp. 291–292.
[6] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high
performance fpga-based accelerator for large-scale convolutional neural
networks,” in Field Programmable Logic and Applications (FPL), 2016
the host’s intervention during multiple transfers back and 26th International Conference on, 2016, pp. 1–9.
forth with DDR4 memory and makes non-continuous data [7] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing loop operation and
movement more efficient. The function of customized mSG- dataflow in fpga acceleration of deep convolutional neural networks,” in
Proceedings of the 2017 ACM/SIGDA International Symposium on Field-
DMA controller makes it possible to drive mSG-DMA to Programmable Gate Arrays (FPGA), 2017, pp. 45–54.
read/write different sizes of data from/to specific addresses, [8] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N.
in order to fit convolutions in various sizes. Xu, S. Song et al., “Going deeper with embedded fpga platform for
convolutional neural network,” in Proceedings of the 2016 ACM/SIGDA
The implementation result is listed in Table II. International Symposium on Field-Programmable Gate Arrays (FPGA),
2016, pp. 26–35.
TABLE II [9] R. Tapiador, A. Rios-Navarro, A. Linares-Barranco, M. Kim, D. Kadeto-
R ESOURCE U SAGE OF M OBILE N ET V2 tad, and J.S. Seo, “Comprehensive evaluation of opencl-based convo-
lutional neural network accelerators in xilinx and altera fpgas,” arXiv
Name ALM DSP RAM preprint arXiv:1609.09296, 2016.
[10] A. Aimar, H. Mostafa, E. Calabrese, A. Rios-Navarro, R. Tapiador-
MME 66127(26.3%) 1278(75.9%) 51(2.4%) Morales, I.-A. Lungu, M. B. Milde, F. Corradi, A. Linares-Barranco, S.-C.
Weight Buffer 9317(3.7%) 0(0%) 0(0%) Liu, and T. Delbruck, ”Nullhop: A flexible convolutional neural network
Feature Map Buffer 1(0%) 0(0%) 1779(83.4%) accelerator based on sparse representations of feature maps,” arXiv
Others 6308(2.5%) 0(0%) 14(0.6%) preprint arXiv:1706.01406, 2017.
Totally 81753(32.5%) 1278(75.9%) 1844(86.5%) [11] J. Su, J. Faraone, J. Liu, Y. Zhao, D.B. Thomas, P. H. W. Leong, and
P. Y. K. Cheung, ”Redundancy-reduced mobilenet acceleration on re-
configurable logic for imagenet classification,” in Applied Reconfigurable
Table III provides a comparison between the solution Computing. Architectures, Tools, and Applications, 2018, pp. 16–28.
proposed in this paper and other similar ones. Note that [12] R. Zhao, X. Niu, and W. Luk, ”Automatic optimising cnn with depth-
MobileNetV2 has more complex structure and higher accuracy wise separable convolution on fpga: (abstact only),” in Proceedings of
the 2018 ACM/SIGDA International Symposium on Field-Programmable
on benchmarks. Gate Arrays (FPGA), 2018, pp. 285–285.
[13] L. Sifre and S. Mallat, ”Rigid-motion scattering for texture classifica-
TABLE III tion,” arXiv preprint arXiv:1403.1687, 2014.
C OMPARISON TO OTHER IMPLEMENTATION [14] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T.
Weyand, M. Andreetto, and H. Adam, ”Mobilenets: Efficient convolu-
[11] [12] this paper tional neural networks for mobile vision applications,” arXiv preprint
Network RR-MobileNet MobileNetV1 MobileNetV2 arXiv:1704.04861, 2017.
Platform Zynq UltraScale+ Stratix-V Arria 10 SoC [15] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
”Mobilenetv2: Inverted residuals and linear bottlenecks,” arXiv preprint
Speed 127.4 fps 231.7 fps 266.2 fps
arXiv:1801.04381, 2018.
[16] M. Courbariaux, Y. Bengio, and J.-P. David, ”Training deep neural
networks with low precision multiplications,” in Proceedings of the
International Conference on Learning Representations (ICLR), 2015.
V. C ONCLUSION [17] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, ”Deep
learning with limited numerical precision,” in International Conference
In this paper, a high-performance, scalable CNN accelerator on Machine Learning (ICML), 2015, pp. 1737–1746.
is proposed. This structure is optimized for depth separable [18] F. Chollet, ”Xception: Deep learning with depthwise separable convo-
convolution, which results in remarkably less operations and lutions,” arXiv preprint arXiv:1610.02357, 2016.
[19] S. Ioffe and C. Szegedy, ”Batch normalization: Accelerating deep
parameters. This makes it possible to run the CNNs on network training by reducing internal covariate shift,” arXiv preprint
portable devices. By choosing different number of MMEs and arXiv:1502.03167, 2015.
variable on-chip memories, this accelerator can be fit into a [20] Y. Ma, N. Suda, Y. Cao, J.-s. Seo, and S. Vrudhula, ”Scalable
and modularized rtl compilation of convolutional neural networks onto
large or small FPGA. As an example, the latest MobileNetV2 fpga,” in Field Programmable Logic and Applications (FPL), 2016 26th
is implemented on Arria 10 SoC FPGA, which achieves 266.6 International Conference on, 2016, pp. 1–8.
fps and 170.6 GOPS.
1549-7747 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.