A CNN Accelerator on FPGA Using Depthwise

This article presents a high-performance CNN accelerator designed for FPGA implementation using depthwise separable convolution, which significantly reduces computational demands while maintaining accuracy. The proposed architecture includes a matrix multiplication engine and a hierarchical memory structure, achieving 266.6 frames per second for MobileNetV2 on an Arria 10 SoC FPGA, resulting in a 20x speedup compared to CPU performance. Key contributions include scalable design methodologies and optimizations for memory bandwidth and processing speed.

Uploaded by

Chandra shekar chandu3333 Chandu

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

A CNN Accelerator on FPGA Using Depthwise

Uploaded by

Chandra shekar chandu3333 Chandu

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2018.2865896, IEEE
Transactions on Circuits and Systems II: Express Briefs
1

A CNN Accelerator on FPGA Using Depthwise

Separable Convolution
Lin Bai, Student Member, IEEE, Yiming Zhao, and Xinming Huang, Senior Member, IEEE

Abstract—Convolutional neural networks (CNNs) have been Almost all the existed FPGA-based CNN implementation
widely deployed in the fields of computer vision and pattern works were to explore memory bandwidth and computing
recognition because of their high accuracy. However, large con- parallelism limitations. To conquer the limitation of memory
volution operations are computing-intensive that often requires
a powerful computing platform such as Graphics Processing bandwidth, [2] and [3] stored the parameters in on-chip
Unit (GPU). This makes it difficult to apply CNNs to portable memory. However, as CNN goes deeper, parameters required
devices. The state-of-the-art CNNs, such as MobileNetV2 and by convolution increase sharply, which makes the on-chip
Xception, adopt depthwise separable convolution to replace the memory solution inefficient. Other works like [4] [5] [6]
standard convolution for embedded platforms. That significantly alleviated the pressure on off-chip memory through limiting
reduces operations and parameters with only limited loss in
accuracy. This highly structured model is very suitable for the parameters precision of the neural networks, as lower
Field-Programmable Gate Array (FPGA) implementation. In numerical precision were proved to be sufficient for CNN
this paper, a scalable high performance depthwise separable [16] [17]. In [7] [8], computing engine was optimized for
convolution optimized CNN accelerator is proposed. The ac- highly parallelism in computation. [6] proposed a pipeline
celerator can be fit into an FPGA of different sizes, provided based solution for CNN for high throughput. [9] made a
the balancing between hardware resources and processing speed.
As an example, MobileNetV2 is implemented on Arria 10 SoC comprehensive evaluation and comparison of Altera and Xilinx
FPGA, and the results show this accelerator can classify each OpenCL frameworks for CNN. [10] explored the sparsity-
picture from ImageNet in 3.75ms, which is about 266.6 frames based optimizations, which could achieve up to 3x higher core
per second. This achieves 20x speedup if compared to CPU. energy efficiency and raise the device-level energy efficiency
Index Terms—convolutional neural network, FPGA, hardware by around 70% through data compression. Both [11] and [12]
accelerator, MobileNetV2. implemented separable depthwise convolution with the exam-
ple MobileNetV1, and achieved processing speed at 7.85ms
per image and 231.7 frame per second (fps) respectively.
I. I NTRODUCTION
The key contributions of this work are:

N OWADAYS, convolutional neural networks (CNNs) have

become the center of interest, due to their superior per-
formance in tasks ranging from image classification, semantic
(1) A high performance CNN hardware accelerator frame-
work is proposed where all layers are processed in a computing
unit named matrix multiplication engine.
segmentation, to object detection and tracking. This technique (2) The utilization of hierarchical memory structure and
has also been widely used in the industry, such as autonomous ping-pong on-chip buffer reduces the bandwidth limitation of
driving, video surveillance, speech recognition, etc. off-chip memory.
CNN is a computing intensive model. It consumes huge (3) A methodology for scalable design is proposed, so
amounts of computing power during training and deployment. that this framework could be implemented in various FPGAs,
In practice, Graphics Processing Units (GPUs) are often through balancing the on-chip resources and performance.
selected as the platform. However, GPU’s natural of high (4) By applying the proposed framework and methods, the
power consumption limits its application in embedded scenario state-of-the-art CNN, MobileNetV2 [15], for the first time,
such as portable devices and wearable systems. Therefore, is implemented on Intel Arria 10 SoC FPGA. The results
Field-Programmable Gate Arrays (FPGAs) and Application- show 266.6 frames per second and 170.6 Giga Operations Per
Specific Integrated Circuits (ASICs), as the replacement of Second (GOPS) at system clock frequency of 133MHz. This
GPUs, are adopted in neural network applications [1]–[12]. represents a 20x speedup comparing to that on CPU [15].
More specifically, increasing research attention is focused on This paper is organized as follows. Section II provides
FPGA-based CNN accelerator due to the possibility of trade- fundamental knowledge of depthwise separable convolution,
off between power consumption and reconfigurability. followed by one of its application, MobilNetV2. Section III de-
To further lighten the computing burden of standard convo- scribes the architecture of the accelerator, including the matrix
lution, depthwise separable convolution is proposed in [13]. multiplication engine, and on-chip buffer organization. System
This has been applied in MobileNetV1 [14] and later Mo- implementation and its results are discussed in Section IV. The
bileNetV2 [15], and thus achieved comparable results with conclusion is given in Section V.
much less multiply-accumulation operations and parameters.
II. D EPTHWISE S EPARABLE C ONVOLUTION
The authors are with the Department of Electrical and Computer Engineer-
ing, Worcester Polytechnic Institute, MA 01609, USA. The corresponding Depthwise separable convolution was first introduced in
aurthor is X. Huang (e-mail:[email protected]) [18]. As one kind of the factorized convolutions, depthwise

1549-7747 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2018.2865896, IEEE
Transactions on Circuits and Systems II: Express Briefs
2

and the total number of operations is

ODSC = M × M × K × K × N + M × M × N × P (4)

Thus, the reduction factors on weights and operation are

calculated in (5-6):

(a) standard convolution WDSC 1 1

FW = = + 2 (5)
WSC P K

ODSC 1 1
FO = = + 2 (6)
OSC P K

One of the typical application of depthwise separable

convolution is MobileNetV2, the successor of MobileNetV1
[14]. Comparing to its first version, the newly proposed
MobileNetV2 further decreased the number of weights by
shrinking the output channels in some layers. It also improves
(b) depthwise convolution its performance through importing one more pointwise convo-
lution layer before the depthwise separable convolution. The
new operation is called bottleneck (Fig. 2).

(c) pointwise convolution

Fig. 1. Comparison of different convolution types

separable convolution factorizes the standard convolution into

a depthwise convolution plus a pointwise convolution. Fig. 1
demonstrates how the standard convolution (SC), depthwise
convolution (DWC) and pointwise convolution (PWC) work.
In standard convolution, each input channel has to do a
convolution with one specific kernel, and then the result is (a) stride = 1 (b) stride = 2
the sum of the convolution results from all channels. While in Fig. 2. Bottleneck operations in different strides
depthwise separable convolution case, depthwise convolution
is the first step, performing the convolution for each input The network structure of MobileNetV2 is illustrated in
channel individually. The next step is to do convolution in Table I.
pointwise, which is actually a standard convolution with kernel
size 1×1. Comparing to standard convolution, using depthwise TABLE I
separable convolution considerably reduces the number of M OBILE N ET V2 S TRUCTURE [15], WHERE EACH LINE REPRESENTS A
mathematical operations and the number of parameters. SEQUENCE OF 1 OR MORE IDENTICAL ( EXCEPT STRIDE ) LAYERS . A LL
DEPTHWISE CONVOLUTIONS USE 3 X 3 KERNELS .
As it is shown in Fig. 1, considering the input feature map
with size M × M × N and kernel size K × K × N × P , in extend output repeat
case of stride length of 1, the number of weights needed for input operator factor channel time stride
standard convolution is [14] 224x224x3 standard conv. - 32 1 2
112x112x3 bottleneck 1 16 1 1
WSC = K × K × N × P (1) 112x112x16 bottleneck 6 24 2 2
56x56x24 bottleneck 6 32 3 2
and the corresponding number of operations is 28x28x32 bottleneck 6 64 4 2
14x14x64 bottleneck 6 96 3 1
OSC = M × M × K × K × N × P (2) 14x14x96 bottleneck 6 160 3 2
7x7x160 bottleneck 6 320 1 1
7x7x320 pointwise conv. - 1280 1 1
In case of depthwise separable convolution, the total number 7x7x1280 avgpool 7x7 - - 1 -
of weights is 1x1x1280 pointwise conv. - 1000 -

WDSC = K × K × N + N × P (3)

III. S YSTEM D ESIGN

A. Architecture Overview
The block diagram in Fig. 3 gives an overview of this
accelerator. The proposed matrix multiplication engine (MME)
array in this paper is responsible for all the CNN operations,
including convolution, normalization, ReLU and pooling. All
the parameters and input images are stored on off-chip mem-
ory. A ping-pong weight buffer is placed between MME array
and memory to maximize the bandwidth. Biases are loaded
to the registers in MME array. Feature map buffer stores all
the intermediate feature maps to avoid the latency brought by Fig. 5. Line buffer in MME
off-chip memory read and write. The accelerator is controlled
by a general finite state machine (FSM).

(a) depthwise sum (b) pointwise sum

Fig. 6. Adder tree modes for different convolution

Fig. 3. Block diagram of accelerator system

B. Matrix Multiplication Engine

In this paper, each MME consists of 32 slices line buffer,
32 slices 3 × 3 multiplier array, 1 adder tree, 1 normalization
(Norm) block, 1 ReLU block and 1 pooling block (Fig. 4). In
each convolution, MME loads the feature maps and weights to
line buffers. After multiplication in multiplier array, adder tree
sums the products according to the selected convolution type.
The following operations are optional normalization, ReLU Fig. 7. Block diagram of adder tree
and pooling.
3) Standard Convolution: To avoid losing too much infor-
mation, standard convolution is adopted to do the first layer
convolution. Therefore, this accelerator is adapted to be able to
do the standard convolution with input feature map channel is
3. For vision applications, the channel number of input feature
map is always 3.
4) Depthwise Convolution: Depthwise convolution per-
forms convolution for each feature map separately. As shown
in Fig. 8, adder tree is configured to sum up the products from
Fig. 4. Block diagram of an MME each slice of multiplier array in parallel. For one MME, the
output channel number is 32.
1) Line Buffer: The working length of line buffer can be
selected by control FSM to fit different input sizes, as it is
illustrated by Fig. 5. The implementation length is (K − 1) ×
M + K.
2) Adder Tree: Adder tree is configurable to do the sum-
ming operation in depthwise or pointwise (Fig. 6). In Fig. 7,
black lines or blocks are shared by both types of convolution.
Blue part is used when doing depthwise convolution. While
red part works if pointwise convolution is selected. All the
biases all added in this stage. Fig. 8. Depthwise convolution in MME

5) Pointwise Convolution: Pointwise convolution is actu- convolution starts. This, on one hand, reduces the latency
ally standard convolution with kernel size 1 × 1 (Fig. 9). caused by parameters loading, and on the other hand, avoids
To fully take advantage of all the multipliers in MME, the the latency brought the limited bandwidth of external memory.
input feature map is divided into several M × M × 32 sub- Besides, weight buffer is built as a ping-pong buffer (Fig. 11),
matrices, and these sub-matrices are shifted into line buffers which means that when weight buffer 1 outputs data for
one after another. This idea comes from divide and conquer convolution, the weight buffer 2 loads the data from external
algorithm in large matrix multiplication illustrated in Fig. 10, memory for the next one and vice versa.
which consists in dividing large matrix into several small
matrices and sum the results up after doing small matrix
multiplication. For one MME, it is able to do M 2 × 32 and
32 × 9 multiplication at once. The adder tree sums up the 32
products in each cell as revealed by Fig. 9. Thus the output
channel number is 9.

(a) buffer 1 outputs, buffer 2 (b) buffer 2 outputs, buffer 1

loads loads
Fig. 11. Weight buffer in ping-pong structure

Intermediate feature maps is another way chosen during

system design to reduce processing time. Its size depends on
the number of MME instantiated and the size of feature map.
Fig. 9. Pointwise convolution in MME

IV. R ESULTS
The proposed accelerator architecture (Fig. 12) is demon-
strated by implementing the MobileNetV2 network on the
Arria 10 SoC Development Kit (10AS066N3F40E2SG), which
contains 251680 ALMs, 2131 M20K, and 1687 DSP blocks.
The design consideration will be described bellow and then
followed by implementation results with resources utilization.

A. Implementation Consideration
Fig. 10. Divide-conquer for large matrix multiplication As mentioned in Section I, lower numerical precision is
sufficient for CNN. So 16-bit quantization strategy is chosen
6) Normalization: After training, parameters of batch nor- because it is widely selected by previous works [2] [3] [20]
malization are fixed [19]. Thus the complex normalization is [6].
downgraded into multiplication and add operation. Based on the description in Section III, 4-MME array is
7) Pooling: Average pooling and max pooling are treated decided to instantiate in this design after carefully balancing
differently. As pixels of a feature map channel are output one the resources usage and processing time. The weight buffer
by one, average pooling could be easily calculated by adding size is 36Kb as a ping-pong buffer. Since the update rate of
one more multiply-accumulate stage by a factor of 1/S, where weights when performing depthwise separable convolution is
S is average pooling size. On the other hand, max pooling every M × M clock cycles. The size of intermediate feature
needs one more comparison stage. map buffer is 24.5Mb.
8) ReLU: Same as the pooling layer, a ReLU stage is
added after the normalization stage. Three options: no ReLU, B. Implementation Results
standard ReLU and ReLU6 are selectable.
Fig. 12 presents the system architecture on Arria 10 SoC.
Since HPS is not used in this design, only FPGA part is
C. Memory Organization shown. The DDR4 memory is the one connected to the FPGA
To have an efficient memory organization, one has to part. The CNN accelerator runs at frequency 133MHz. Its
balance on-chip memory resources and external memory band- adder tree limits this frequency. A Nios II softcore micro-
width. On-chip memory is limited on FPGA but supplies very processor is implemented for loading weights and input images
high bandwidth. Contrarily, external memory has the capability from flash memory to DDR4 external memory. An external
to store large amount of data but with the penalty of limited memory interface IP combined with a Modular Scatter-Gather
bandwidth. Therefore, in this proposed accelerator, we adapt Direct Memory Access (mSG-DMA) IP are used to bridge
the hierarchical memory methodology. Weight buffer loads the buffers in the CNN accelerator and the FPGA memory,
the needed parameters from external memory before each whose maximum bandwidth is 8.5GB/s. This structure avoids

R EFERENCES
[1] Y.-H. Chen, T. Krishna, J.S. Emer, and V. Sze, “Eyeriss: An energy
efficient reconfigurable accelerator for deep convolutional neural net-
works,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
2017.
[2] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z.
Xu, N. Sun et al., “Dadiannao: A machine-learning supercomputer,” in
Proceedings of the 47th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO), 2014, pp. 609–622.
[3] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
and O. Temam, “Shidiannao: Shifting vision processing closer to the
sensor,” in ACM SIGARCH Computer Architecture News, vol. 43, no. 3.
2015, pp. 92–104.
[4] Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y.W. Tai, “Exploring het-
erogeneous algorithms for accelerating deep convolutional neural net-
works on fpgas,” in Design Automation Conference (DAC), 2017 54th
ACM/EDAC/IEEE, 2017, pp. 1–6.
[5] S. I. Venieris and C.-S. Bouganis, “fpgaconvnet: Automated mapping of
convolutional neural networks on fpgas,” in Proceedings of the 2017
ACM/SIGDA International Symposium on Field-Programmable Gate Ar-
Fig. 12. System architecture on FPGA design rays (FPGA), 2017, pp. 291–292.
[6] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high
performance fpga-based accelerator for large-scale convolutional neural
networks,” in Field Programmable Logic and Applications (FPL), 2016
the host’s intervention during multiple transfers back and 26th International Conference on, 2016, pp. 1–9.
forth with DDR4 memory and makes non-continuous data [7] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing loop operation and
movement more efficient. The function of customized mSG- dataflow in fpga acceleration of deep convolutional neural networks,” in
Proceedings of the 2017 ACM/SIGDA International Symposium on Field-
DMA controller makes it possible to drive mSG-DMA to Programmable Gate Arrays (FPGA), 2017, pp. 45–54.
read/write different sizes of data from/to specific addresses, [8] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N.
in order to fit convolutions in various sizes. Xu, S. Song et al., “Going deeper with embedded fpga platform for
convolutional neural network,” in Proceedings of the 2016 ACM/SIGDA
The implementation result is listed in Table II. International Symposium on Field-Programmable Gate Arrays (FPGA),
2016, pp. 26–35.
TABLE II [9] R. Tapiador, A. Rios-Navarro, A. Linares-Barranco, M. Kim, D. Kadeto-
R ESOURCE U SAGE OF M OBILE N ET V2 tad, and J.S. Seo, “Comprehensive evaluation of opencl-based convo-
lutional neural network accelerators in xilinx and altera fpgas,” arXiv
Name ALM DSP RAM preprint arXiv:1609.09296, 2016.
[10] A. Aimar, H. Mostafa, E. Calabrese, A. Rios-Navarro, R. Tapiador-
MME 66127(26.3%) 1278(75.9%) 51(2.4%) Morales, I.-A. Lungu, M. B. Milde, F. Corradi, A. Linares-Barranco, S.-C.
Weight Buffer 9317(3.7%) 0(0%) 0(0%) Liu, and T. Delbruck, ”Nullhop: A flexible convolutional neural network
Feature Map Buffer 1(0%) 0(0%) 1779(83.4%) accelerator based on sparse representations of feature maps,” arXiv
Others 6308(2.5%) 0(0%) 14(0.6%) preprint arXiv:1706.01406, 2017.
Totally 81753(32.5%) 1278(75.9%) 1844(86.5%) [11] J. Su, J. Faraone, J. Liu, Y. Zhao, D.B. Thomas, P. H. W. Leong, and
P. Y. K. Cheung, ”Redundancy-reduced mobilenet acceleration on re-
configurable logic for imagenet classification,” in Applied Reconfigurable
Table III provides a comparison between the solution Computing. Architectures, Tools, and Applications, 2018, pp. 16–28.
proposed in this paper and other similar ones. Note that [12] R. Zhao, X. Niu, and W. Luk, ”Automatic optimising cnn with depth-
MobileNetV2 has more complex structure and higher accuracy wise separable convolution on fpga: (abstact only),” in Proceedings of
the 2018 ACM/SIGDA International Symposium on Field-Programmable
on benchmarks. Gate Arrays (FPGA), 2018, pp. 285–285.
[13] L. Sifre and S. Mallat, ”Rigid-motion scattering for texture classifica-
TABLE III tion,” arXiv preprint arXiv:1403.1687, 2014.
C OMPARISON TO OTHER IMPLEMENTATION [14] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T.
Weyand, M. Andreetto, and H. Adam, ”Mobilenets: Efficient convolu-
[11] [12] this paper tional neural networks for mobile vision applications,” arXiv preprint
Network RR-MobileNet MobileNetV1 MobileNetV2 arXiv:1704.04861, 2017.
Platform Zynq UltraScale+ Stratix-V Arria 10 SoC [15] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
”Mobilenetv2: Inverted residuals and linear bottlenecks,” arXiv preprint
Speed 127.4 fps 231.7 fps 266.2 fps
arXiv:1801.04381, 2018.
[16] M. Courbariaux, Y. Bengio, and J.-P. David, ”Training deep neural
networks with low precision multiplications,” in Proceedings of the
International Conference on Learning Representations (ICLR), 2015.
V. C ONCLUSION [17] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, ”Deep
learning with limited numerical precision,” in International Conference
In this paper, a high-performance, scalable CNN accelerator on Machine Learning (ICML), 2015, pp. 1737–1746.
is proposed. This structure is optimized for depth separable [18] F. Chollet, ”Xception: Deep learning with depthwise separable convo-
convolution, which results in remarkably less operations and lutions,” arXiv preprint arXiv:1610.02357, 2016.
[19] S. Ioffe and C. Szegedy, ”Batch normalization: Accelerating deep
parameters. This makes it possible to run the CNNs on network training by reducing internal covariate shift,” arXiv preprint
portable devices. By choosing different number of MMEs and arXiv:1502.03167, 2015.
variable on-chip memories, this accelerator can be fit into a [20] Y. Ma, N. Suda, Y. Cao, J.-s. Seo, and S. Vrudhula, ”Scalable
and modularized rtl compilation of convolutional neural networks onto
large or small FPGA. As an example, the latest MobileNetV2 fpga,” in Field Programmable Logic and Applications (FPL), 2016 26th
is implemented on Arria 10 SoC FPGA, which achieves 266.6 International Conference on, 2016, pp. 1–8.
fps and 170.6 GOPS.