0% found this document useful (0 votes)
28 views

Research On FPGA Based Convolutional Neural Network Acceleration Method

The document discusses research on accelerating convolutional neural networks (CNNs) using field programmable gate arrays (FPGAs). It proposes designing an FPGA accelerator suitable for different CNNs by analyzing CNN principles and implementing basic operations like convolution and pooling. Optimization strategies such as fixed-point quantization and ping-pong caching are used to reduce memory usage. The designed accelerator is verified using LeNet-5 and YOLOv2 networks to evaluate performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Research On FPGA Based Convolutional Neural Network Acceleration Method

The document discusses research on accelerating convolutional neural networks (CNNs) using field programmable gate arrays (FPGAs). It proposes designing an FPGA accelerator suitable for different CNNs by analyzing CNN principles and implementing basic operations like convolution and pooling. Optimization strategies such as fixed-point quantization and ping-pong caching are used to reduce memory usage. The designed accelerator is verified using LeNet-5 and YOLOv2 networks to evaluate performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA)

Research on FPGA Based Convolutional Neural


Network Acceleration Method
2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA) | 978-1-6654-1867-6/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICAICA52286.2021.9498022

Tan Xiao Man Tao


Department of Information Technology School of Computer Science and Technology
Water Conservancy of Shandong Technician College Shandong University of Technology
Zibo, China Zibo, China
[email protected] [email protected]

Abstract—In recent years, with the continuous (QNN)[4]. Ding et al. proposed to use a finite combination of
breakthrough in the field of algorithms, the computational powers of 2 to replace weights, thereby replacing
complexity of current target detection algorithms is getting multiplication and accumulation operations with shift
higher and higher. In the forward inference stage, many operations, achieving higher accuracy and faster speed[5]. In
practical applications often have low latency and strict power terms of FPGA design and optimization, Zhang et al. designed
consumption restrictions. How to realize a low-power, low-cost an acceleration method for block expansion of feature maps
and high-performance target detection platform has gradually and proposed a Roofline model to optimize resource
attracted attention. Given the current mobile scene's utilization[6]. Li et al. designed a single instruction multiple
requirements for high performance and low power consumption,
data (Single Instruction Multiple Data, SIMD) convolutions
hardware acceleration architecture suitable for different CNNs
is designed by combining the working principle of CNN and the
and pipeline structure for CNN acceleration[7].
computing characteristics of FPGA. CNN’s basic operation unit Aiming at the problems of high computational complexity,
is realized through high-level synthesis technology, including large model footprint, and slow operation speed caused by
convolution operation unit, pool operation unit, activation floating-point numbers for convolution calculations, this
function unit, etc. Optimization strategies such as pipeline, paper designs an FPGA accelerator suitable for different
dynamic fixed-point quantization, and ping-pong caching are CNNs. First, analyze the basic principles of convolutional
adopted to reduce the use of on-chip and off-chip memory access
neural networks and design CNN's FPGA general acceleration
and storage resources. Finally, two convolutional neural
framework; Secondly, arithmetic units such as convolution
networks with different structures, the LeNet-5 classification
network and, the YOLOv2 detection network, are selected for and pooling are designed, and optimization strategies such as
functional verification and performance analysis. The data quantization and ping-pong caching are adopted; Finally,
experimental results show that the convolutional neural the image classification LeNet-5 network and the target
network FPGA accelerator designed in this paper can provide detection YOLOv2 network are used to verify the scheme and
better performance with fewer resources and power function.
consumption and can efficiently use the hardware resources on
the FPGA. II. CONVOLUTIONAL NEURAL NETWORK
The convolutional neural network is one of the basic
Keywords—CNN, FPGA, Hardware accelerating, HLS models of a neural network. It has local linkage and weight-
sharing characteristics and can simulate the visual and
I. INTRODUCTION auditory perception of animals. The current convolutional
Convolutional Neural Network (CNN) has become one of neural network is generally a feedforward neural network
the most successful algorithms in computer vision and has composed of a convolutional layer, a pooling layer, and a fully
been widely used in many fields. But as the CNN model’s connected layer in a certain order. The convolutional layer and
ability to deal with problems is constantly improving. The the fully connected layer have weight parameters set by the
inference stage of the convolutional neural network often training process. These weight parameters can determine each
requires millions or even hundreds of millions of times of layer of the convolutional neural network’s function and
multiplication and accumulation calculations, which puts high determine the content that the convolutional neural network
requirements on the hardware configuration of the can finally identify and classify.
calculations. It also greatly limits the use of CNN in some low-
28×28×50 14×14×20 10×10×50 5×5×50 1×1×500
power small and medium-sized devices. Make people pay 1×1×10
attention using different architecture hardware to accelerate
the calculation of CNN[1-3]. FPGA uses software and hardware F
...

...

...

...

cooperation to execute the CNN network. The configuration softmax


input
can be reconfigured to trade-off between speed, power
consumption and flexibility, greatly improving computing conv1 pool1 conv2 pool2 fc1
fc2

efficiency and reducing power consumption.


Fig. 1. Network structure diagram of LeNet-5
The current research is mainly aimed at designing and
optimizing CNN model compression and terminal inference LeNet-5 is a very successful and typical convolutional
calculation. Choi et al. quantified target weights and neural network model[8], and its network structure is shown in
activations separately in terms of model compression and Figure 1.
formed a new technology of Quantized Neural Network

978-1-6654-1867-6/21/$31.00 ©2021 IEEE 289 Dalian, China


June 28-30, 2021

Authorized licensed use limited to: ULAKBIM UASL - GAZI UNIV. Downloaded on December 30,2022 at 11:26:25 UTC from IEEE Xplore. Restrictions apply.
YOLOv2 supports input images of any resolution. After preprocessing and distribution of input data and parameters,
input, the image will be scaled to 416×416, and the vacant FPGA initial stage control and image result output. The
areas will have a space of 0.5. It consists of 32 layers in total, structural framework of the CNN accelerator is shown in
which involve the following layers: Figure 2.
• Convolutional layer B. Convolution Unit Design
• BN (Batch Normalization) layer In this paper, a convolution calculation unit is designed
according to the characteristics of convolution calculation. It
• Maximum pooling layer takes take four input channels and one output channel as an
example, as shown in Figure 3 and Figure 4, including storage
• Routing layer module, shift register, multiplier array, adder tree, and
• Reordering layer and the final detection layer accumulator. A multiply-add tree calculation method that does
not change the sliding window calculation process and uses
The size of the convolution kernel is 3×3 and 1×1. A BN fewer DSP resources is adopted. The convolutional layer
layer between the convolutional layer and the activation inputs multiple feature maps and their corresponding weights
function plays a role in regularization. In the forward and obtains a set of convolution results after multiplying and
inference stage, the BN layer and the convolutional layer can accumulating. after adding the set of results, the output of a
be fused to reduce operations. The fusion is shown in Formula convolution calculation is obtained, and then the result is
1. cached.
Addition tree
 ⋅ x + b
yconv −bn = ω (1) Storage Shift Multiplier
Module Register Array

+
Among them, ω  and b are the weight and bias Input A1
× × × × × +
× × × × × +.
Output
parameters after fusion, which are constants, and the formula Weights B1
×
×
×
×
×
×
×
×
×
× .

is as follows:
.
× × × × ×
+
+

γ Fig. 3. Process Element


=
ω ⋅ω (2)
σ2 +ε Channel 1 PE
+
+
γ Channel 2 PE
b ⋅ (b − u ) + β
+.
= (3) Output
σ2 +ε
.
Channel 3 PE .
+
+
Channel 4 PE
Among them, ω and b is the convolution kernel parameter,
Fig. 4. Convolution Computing Unit
γ is the scale parameter, β is the offset parameter, u is the
sample mean, σ 2 is the standard deviation, these are all Due to BRAM resources limitation, it is impossible to
constants. store all data in BRAM for calculation. The design proposes
The routing layer involves the transfer of data and does not storing the input feature map in the LUT-based storage
involve convolution calculations. The reordering layer is used module, and the weights are stored in the BRAM-based
to sample and rearrange features. storage module.
C. Pooling Unit Design
III. FPGA ACCELERATOR DESIGN
The pooling module designed in this paper is shown in
A. Calculation Framework Figure 5. It traverses the input channel, the length and width
of the input feature map, and the convolution kernel's size
Weight Buffer Controller ARM
through 5 for loops, and two pooling methods can be selected
according to the model.
Parameter
Pool configuration Maximum Pooling

Input Conv ReLU Output Comparators Reg


DDR
Buffer Buffer
Reorg Comparators
Selector

Input Output
Fig. 2. CNN Accelerator Average pooling
...
×
The overall architecture of ARM+FPGA is used to design Data buffer
Selector

+
the forward inference acceleration of the convolutional neural ...
Coefficient buffer Reg
network, including the ARM processor part (Processing
System, PS) and FPGA logic resource part (Programmable Fig. 5. Pooling computing unit
Logic, PL). When the software and hardware are divided,
what is implemented on the FPGA is mainly the operation of The average value pooling is realized by multiplying and
the network, including the convolution part and the pooling adding the input data and a constant coefficient, and the
part; The ARM processor is mainly responsible for comparator realizes the maximum value pooling. Pooling by

290

Authorized licensed use limited to: ULAKBIM UASL - GAZI UNIV. Downloaded on December 30,2022 at 11:26:25 UTC from IEEE Xplore. Restrictions apply.
sliding window with 2 steps and 2×2 edges. processing modules to achieve high-speed data stream
transmission. Through the overlap of the time of reading and
D. Optimization Strategy writing, the data stream is continuously transmitted. If the
1) Data quantification ping-pong buffer operation is regarded as a whole, then the
To reduce the resource consumption of FPGA in the data stream at both ends is continuous, so it is very suitable for
calculation, under the premise of ensuring the network’s pipeline processing of the data stream. Therefore, ping-pong
accuracy, data quantization is generally used to reduce the operations are often used in pipelined algorithms to complete
data bit width in the calculation of the network model[9]. the gapless buffering and data processing. The schematic
According to the experiment, the resource consumption of diagram of the ping-pong cache is shown in Figure 7.
multipliers and adders under different precisions is shown in
Table Ⅰ. Cache A
Input data Output data
TABLE I. RESOURCE CONSUMPTION COMPARISON stream stream Data operation
selection unit selection unit unit
Operation MUX MUX
DSP LUT
(data accuracy) Cache B
Adder (Float32) 2 231
Multiplier (Float32) 3 144 Fig. 7. Ping pong buffer
Adder (Fixed16) 0 52
Multiplier (Fixed16) 1 99 IV. EXPERIMENTAL RESULT
Table Ⅰ shows that the fixed-point 16-bits precision adder A. Experimental Environment
and multiplier consume fewer resources. Therefore, this paper
draws on the quantification ideas of literature [10-12], designs, CNN model: LeNet-5 and YOLOv2, LeNet-5 uses the
and uses 16-bit dynamic fixed-point numbers to quantify the MNIST data set and CIFAR-10 data set, and YOLOv2 uses
parameters and feature maps in the convolutional neural the COCO data set.
network. In quantization, the order code indicates the number FPGA platform: PYNQ-Z2 of Xilinx. Resources include
of bits allocated to the decimal part of each layer of parameters 140 18Kb BRAM, 220 DSP48E, 106400 FF and 53,200 LUT.
and feature maps. The order code of each layer is fixed, and
the order code between layers is dynamically adjusted In the experiment, Vivado HLS 2019.1 is used for
according to the data distribution range. accelerator design, and Vivado 2019.1 is used for synthesis
and layout design.
Mantissa/xi
Sign Bit/s
Order Code/f B. Resource Assessment
... After the design is synthesized, placed, and routed, the
resource usage given by Vivado 2019.1 is shown in Table Ⅲ.
The table lists the FPGA resource usage of the LeNet-5 model
Integer Part Decimal Part in the MNIST data set and the CIFAR-10 data set and the
Fig. 6. Dynamic fixed-point quantization resource usage of the YOLOv2 model in the COCO data set.
Because the method used in this article puts all the values of
The fixed-point number xfixed in dynamic fixed-point the feature map in the on-chip BRAM, and the weight
quantization is expressed by the following formula: parameters are stored on the LUT-based storage module, the
resource utilization rate for BRAM and LUT is very high.
( 1) s × ∑ i =0 xi ×2 − f × 2i
B −2
x fixed =− (4) TABLE III. FPGA RESOURCE CONSUMPTION
LUT FF BRAM DSP
Model
In the formula, s represents the sign bit, f represents the (53200) (106400) (140) (220)
order code, B represents the bit width length of the LeNet-5
34180 19163 140 128
(MNIST)
quantization, and xi represents the mantissa part. The highest
LeNet-5
bit of the fixed-point number xfixed is the sign bit. Before (CIFAR-10)
42014 21525 122.5 158
quantization, the optimal order code f for each layer needs to YOLOv2
35948 32452 87.5 149
be found. (COCO)

TABLE II. EXPONENT OF CNN LAYERS C. Performance Evaluation


For the LeNet-5 network accelerator, the system clock
Layer Order codes
frequency used in this article is 100 MHz, the time to identify
conv1 parameter: 15 output: 14
each picture is 2.067 ms for the MNIST data set and 4.98 ms
pool1 output: 14
for the CIFAR-10 data set, and the power consumption is only
conv2 parameter: 15 output: 13 about 1.6 W. Through the board-level test of this design, the
pool2 output: 13 performance and other data of the system are compared with
fc1 parameter: 15 output: 11 other documents, as shown in Table Ⅳ below. The literature
fc2 parameter: 15 output: 10 [13,14] all studied the acceleration scheme of small networks
Taking the LeNet-5 model as an example, its order codes on FPGA. By comparison, in literature [13] has lower
are shown in Table Ⅱ. accuracy, but this article is better regarding single frame time,
power consumption, and performance. Although the literature
2) Ping pong buffer
[14] is similar to this article in terms of power consumption,
The essence of the ping-pong cache is to use low-speed

291

Authorized licensed use limited to: ULAKBIM UASL - GAZI UNIV. Downloaded on December 30,2022 at 11:26:25 UTC from IEEE Xplore. Restrictions apply.
this article is 15.7 times the performance and 15.5 times the for all of the guidance and support he has provided me with
energy efficiency ratio. my research and thesis.

TABLE IV. LENET-5 RESULTS COMPARISON Finally, I want to express my sincere gratitude to my
parents for all of the encouragement and support throughout
[13] Ours [14] Ours this process. I would like to thank my friends and family
Dataset MNIST CIFAR-10 members for their wishes and hopes.
Precision(bit) 8-fixed 16-fixed 16-fixed 16-fixed
REFERENCES
Clock(MHz) 100 100 100 100
Power(W) 1.73 1.65 1.68 1.69 [1] COATES A, HUVAL B, WANG T, et al. Deep learning with cots hpc
systems[J]. 30th International Conference on Machine Learning, ICML
Performance 2013:2374-2382.
0.27 3.42 0.11 1.73
(GOP/s)
[2] KAYAER K, TAVSANOGLU V. A new approach to emulate cnn on
For the YOLOv2 network accelerator, the system clock fpgas for real time video processing[C]//International Workshop on
frequency used in this article is 150 MHz, the time to Cellular Neural Networks and Their Applications. IEEE, 2008: 23-28.
recognize each picture is 1.12 s, the performance is 23.08 [3] WU Y X, LIANG K, LIU Y, et al. The progress and trends of fpga-
based accelerators in deep learning[J]. Chinese Journal of Computers,
GOP/s, and the power consumption is 2.387 W. The 2019, 42(11): 2461-2480.
comparison with other documents is shown in Table Ⅴ below. [4] CHOI J, CHUANG P, WANG Z, et al. Bridging the accuracy gap for
Although the performance of literature [15] is better than our 2-bit quantized neural networks (QNN) [J]. arXiv Preprint arXiv:
design, it uses 2240 DSPs, which is 15 times as much as ours, 1807.06964, 2018.
and the performance of a single DSP is not as good as ours. [5] DING R Z, LIU Z Y, CHIN T W, et al. FLightNNs: Lightweight
quantized deep neural networks for fast and accurate inference[J]. 2019
TABLE V. YOLOV2 RESULTS COMPARISON 56th ACM/IEEE Design Automation Conference (DAC), Las Vegas,
NV, USA, 2019: 1-6.
[15] Ours [6] ZHANG C, LI P, SUN G Y, et al. Optimizing fpga-based accelerator
FPGA Virtex-7 VX485T Zynq XC7Z020 design for deep convolutional neural networks[C]//Proceedings of the
2015 ACM/SIGDA International Symposium on Field-Programmable
Clock(Hz) 100 150 Gate Arrays, Association for Computing Machinery, New York, NY,
Power(W) N/A 2.39 USA, 2015: 161–170.
Precision(bit) 32-float 16-fixed [7] LI H M, FAN X T, LI J, et al. A high performance fpga-based
DSP 2240 149 accelerator for large-scale convolutional neural networks[C]//2016
26th International Conference on Field Programmable Logic and
Performance(GOP/s) 61.6 26.23
Applications (FPL), Lausanne, 2016: 1-9.
GOP/s/DSP 0.028 0.18
[8] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning
applied to document recognition[J]. Proceedings of the IEEE, 1998,
In summary, the FPGA acceleration scheme designed in 86(11): 2278-2324.
this article has better overall performance and effectively
[9] HUBARA I, COURBARIAUX M, SOUDRY D, et al. Quantized
improves performance. neural networks: training neural networks with low precision weights
and activeations[J]. Journal of Machine Learning Research, 2016: 18.
V. CONCLUSION [10] QIU J T, WANG J, YAO S, et al. Going deeper with embedded fpga
In order to solve the problems of limited resources and platform for convolutional neural network [C]//Proceedings of the
high power consumption of convolution neural networks on 2016 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays, Association for Computing Machinery, New York, NY,
hardware devices, an accelerator architecture suitable for USA, 2016: 26-35.
different convolution neural networks is presented. The [11] SHAN L, ZHANG M X, DENG L, et al. A dynamic multi-precision
convolution calculation strategy for the structure design of fixed-point data quantization strategy for convolutional neural
CNN network solves the problem that the convolution neural network[C]//Proceedings of the 20th CCF Conference on Computer
network needs a lot of resources from the field programmable Engineering and Technology, Xi’an, Aug 10-12, 2016. Berlin,
gate array to compute. This design implements LeNet-5 and Heidelberg: Springer, 2016: 102-111.
yolov2 on xc7z020. The results show that, compared to [12] GYSEL P, MOTAMEDI M, GHIASI S. Hardware-oriented
approximation of convolutional neural networks [J]. arXiv Preprint
implementations in CPU, GPU, and other FPGA platforms, arXiv: 1604.03168, 2016.
The method presented in this paper maintains the accuracy, [13] ZHENG W K, YANG J M. Implementation and optimization of
saves resources and reduces power consumption, which is of accelerating convolution neural network on fpga[J]. Journal of
great significance to the acceleration of convolution neural Shandong Normal University (Natural Science), 2019, 34(2): 186-192.
network on platforms with limited resources and power [14] SUN L, XIAO J Q, XIA Y, et al. Improved convolutional neural
consumption. network recognition model based on embedded SOC, 2020,37(3):258-
260.
ACKNOWLEDGMENT [15] Zhang C, Li P, Sun G. Optimizing fpgabased accelerator design for
deep convolutional neural networks[C]//ACM/SIGDA International
I want to thank Dr. Qu, my major Tutor, for providing me Symposium on Field Programmable Gate Arrays(FPGA). ACM, 2015:
an opportunity to participate in this emerging field of research, 161-170.

292

Authorized licensed use limited to: ULAKBIM UASL - GAZI UNIV. Downloaded on December 30,2022 at 11:26:25 UTC from IEEE Xplore. Restrictions apply.

You might also like