A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA
A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA
ABSTRACT In limited-resource edge computing circumstances such as on mobile devices, IoT devices,
and electric vehicles, the energy-efficient optimized convolutional neural network (CNN) accelerator
implemented on mobile Field Programmable Gate Array (FPGA) is becoming more attractive due to
its high accuracy and scalability. In recent days, mobile FPGAs such as the Xilinx PYNQ-Z1/Z2 and
Ultra96, definitely have the advantage of scalability and flexibility for the implementation of deep learning
algorithm-based object detection applications. It is also suitable for battery-powered systems, especially for
drones and electric vehicles, to achieve energy efficiency in terms of power consumption and size aspect.
However, it has the low and limited performance to achieve real-time processing. In this article, optimizing
the accelerator design flow in the register-transfer level (RTL) will be introduced to achieve fast programming
speed by applying low-power techniques on FPGA accelerator implementation. In general, most accelerator
optimization techniques are conducted on the system level on the FPGA. In this article, we propose the
reconfigurable accelerator design for a CNN-based object detection system on the register-transfer level on
mobile FPGA. Furthermore, we present RTL optimization design techniques that will be applied such as
various types of clock gating techniques to eliminate residual signals and to deactivate the unnecessarily
active block. Based on the analysis of the CNN-based object detection architecture, we analyze and classify
the common computing operation components from the Convolutional Neuron Network, such as multipliers
and adders. We implement a multiplier/adder unit to a universal computing unit and modularize it to be
suitable for a hierarchical structure of RTL code. The proposed system design was tested with Resnet-20
which has 23 layers and it was trained with the dataset, CIFAR-10 which provides a test set of 10,000
images in several formats, and the weight data we used for this experiment was provided from Tensil.
Experimental results show that the proposed design process improves the power efficient consumption,
hardware utilization, and throughput by 16%, up to 58%, and 15%, respectively.
INDEX TERMS FPGA accelerator, CNN accelerator, RT level design techniques, low power techniques,
reconfigurable accelerator, CNN-based object detection, low power consumption, high performance, mobile
FPGA.
I. INTRODUCTION Field Programmable Gate Array (FPGA) devices from
Convolutional Neural Network(CNN)-based object detection personal mobile devices to industrial machines such as
application has been applied in various systems including healthcare devices, smart surveillance systems, Advanced
Driver Assistance Systems (ADAS), drones, and logistics
The associate editor coordinating the review of this manuscript and robots [1], [2], [3], [4], [5], [6]. To achieve high accuracy
approving it for publication was Tao Zhou . of recognition, CNNs have become an essential feature of
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
59438 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 11, 2023
V. H. Kim, K. K. Choi: Reconfigurable CNN-Based Accelerator Design
FIGURE 14. Psuedocode for proposed MAC operation. FIGURE 16. HW resource report comparison of tensil sample simulation
and our work tested on PYNQ-Z1.
bandwidth requirements, however, for getting high perfor- the converted ML model code in ONNX format. Tensil com-
mance, the optimized size of bandwidth should be defined piler generates three import artifacts, a .tmodel, .tdata, and
by the analysis of the network architecture. Once the data .tprog files. Once the .tmodel manifest for the model into the
transmission size is fixed, memory splitting and merging driver is loaded, it tells the driver where to locate the binary
should be applied. Our CNN accelerator is based on the 16-bit files, program data, and weights data. They were not open
fixed point bandwidth which is given by the reference [24]. data and, we are using them without any modification, so that
We modularize the RTL code based on thorough analysis, means the accuracy was not changed.
which helps easy modification for implementing the accel-
erator design. B. FPGA IMPLEMENTATION RESULTS
Compared with Tensil’s optimization result, we verified more
V. EXPERIMENT AND RESULTS WITH DISCUSSION register buffers are activated for our proposed structure. Once
A. EXPERIMENT ENVIRONMENT we check the functionality and performance result, then you
For the basic hardware platform, we chose the PYNQ-Z1 can modify the structure by RTL code modification. Then
board instead of the regular ZYNQ-7020 board, where the we can improve the specific hardware resources and power
PYNQ is an open-source project from AMD [36]. It embeds consumption of the design. Analyzing the result leads to
Xilinx ZYNQ-7020, and also provides a Jupyter-based frame- improved performance of CNN processing. Figure 16 shows
work with Python APIs. The PYNQ-Z1 board has the the power consumption reduction of the processing system
FPGA-SoC platform which is composed of PL and PS. unit. We were able to archive the 43.9 (GOPs/W) as a power
The basic software development tool is Jupyter Notebook, efficiency result, compared to other FPGA board implemen-
a web-based software programming platform. It is also sup- tations, it increased 1.37 times. the hardware resource utiliza-
porting Python, C/C++ programming languages, and other tion in DSPs is increased 2.2 times from the result of [24].
open-source libraries such as OpenCV. Our experiment envi-
ronment is as follows in Figure 15. The imported CNN C. POWER CONSUMPTION RESULTS
architecture is the Resnet-20. It has 23 layers and it was Our optimization will decrease 16% of the dynamic power
trained with CIFAR-10 which provides a dataset of 10,000 consumption. Also, the total On-Chip power will deduct 20%
images in several formats. We used the provided weight file of the total power consumption. Once the global buffer is
and converted the ONNX format of the ResNet Model [40]. activated, the unused global clock buffer and the second
ONNX, a machine learning (ML) model converter, provides global clock resource will help to improve the performance of
the design. Moreover, this can be the solution to some high greatly assisted the research and greatly improved the
fan-out signals to make the device fully functional. In the manuscript.
pipeline logic, inserting an intermediate flip-flop(FF) can
improve the working speed of the device, however, too many REFERENCES
flip-flops make computational complexity. Our low-power [1] A. K. Jameil and H. Al-Raweshidy, ‘‘Efficient CNN architecture on FPGA
techniques show better performance than the performance of using high level module for healthcare devices,’’ IEEE Access, vol. 10,
FFs. pp. 60486–60495, 2022.
[2] S. Y. Nikouei, Y. Chen, S. Song, R. Xu, B. Choi, and T. Faughnan, ‘‘Smart
surveillance as an edge network service: From Harr-Cascade, SVM to a
VI. CONCLUSION lightweight CNN,’’ in Proc. IEEE 4th Int. Conf. Collaboration Internet
In this article, the proposed highly reconfigurable FPGA Comput. (CIC), Oct. 2018, pp. 256–265.
[3] K. Haeublein, W. Brueckner, S. Vaas, S. Rachuj, M. Reichenbach, and
hardware accelerator showed improved performance in terms D. Fey, ‘‘Utilizing PYNQ for accelerating image processing functions
of the processing speed and power consumption result during in ADAS applications,’’ in Proc. 32nd Int. Conf. Archit. Comput. Syst.,
inference of various CNNs. The hardware optimization is May 2019, pp. 1–8.
conducted mainly for two purposes: to improve the through- [4] Z. Zhang, M. A. P. Mahmud, and A. Z. Kouzani, ‘‘FitNN: A low-resource
FPGA-based CNN accelerator for drones,’’ IEEE Internet Things J., vol. 9,
put and to reduce power consumption. For improving perfor- no. 21, pp. 21357–21369, Nov. 2022.
mance, the minimized data transferring strategy was applied [5] C. Fu and Y. Yu, ‘‘FPGA-based power efficient face detection for mobile
by assigning the maximum amount of buffers during the robots,’’ in Proc. IEEE Int. Conf. Robot. Biomimetics (ROBIO), Dec. 2019,
pp. 467–473.
computations and by applying a controlled pipeline design
[6] X. Li, X. Gong, D. Wang, J. Zhang, T. Baker, J. Zhou, and T. Lu, ‘‘ABM-
for minimized data access. For achieving energy efficient SpConv-SIMD: Accelerating convolutional neural network inference for
results of CNN object detection operation, not only the data industrial IoT applications on edge devices,’’ IEEE Trans. Netw. Sci. Eng.,
access controlling for minimized memory access, but also we early access, Feb. 25, 2022, doi: 10.1109/TNSE.2022.3154412.
[7] S. Tamimi, Z. Ebrahimi, B. Khaleghi, and H. Asadi, ‘‘An efficient SRAM-
proposed the RT level low power techniques-applied recon- based reconfigurable architecture for embedded processors,’’ IEEE Trans.
figured MAC units such as advanced clock gating-applied Comput.-Aided Design Integr. Circuits Syst., vol. 38, no. 3, pp. 466–479,
adder, register Z with bus specific clock, and OR-based MAC Mar. 2019.
[8] A. J. A. El-Maksoud, M. Ebbed, A. H. Khalil, and H. Mostafa, ‘‘Power
architecture to RTL code of the proposed accelerator. The efficient design of high-performance convolutional neural networks hard-
proposed hardware accelerator for ResNet-20 was imple- ware accelerator on FPGA: A case study with GoogLeNet,’’ IEEE Access,
mented on mobile FPGA-SoC, PYNQ-Z1, and the power vol. 9, pp. 151897–151911, 2021.
consumption was measured during inference operation. As a [9] S. Lee, D. Kim, D. Nguyen, and J. Lee, ‘‘Double MAC on a DSP:
Boosting the performance of convolutional neural networks on FPGAs,’’
result, the throughput result showed a 15% improvement IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 38, no. 5,
compared with the baseline RTL code of the accelerator, pp. 888–897, May 2019.
also power consumption was reduced by 16%, and hard- [10] S. Ullah, S. Rehman, M. Shafique, and A. Kumar, ‘‘High-performance
accurate and approximate multipliers for FPGA-based hardware acceler-
ware utilization was increased by 58%. The object detection ators,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 41,
processing speed was 9.17FPS, which shows that real-time no. 2, pp. 211–224, Feb. 2022.
processing is feasible in mobile FPGA. [11] X. Wu, Y. Ma, M. Wang, and Z. Wang, ‘‘A flexible and efficient FPGA
accelerator for various large-scale and lightweight CNNs,’’ IEEE Trans.
Circuits Syst. I, Reg. Papers, vol. 69, no. 3, pp. 1185–1198, Mar. 2022.
ACKNOWLEDGMENT
[12] W. Liu, J. Lin, and Z. Wang, ‘‘A precision-scalable energy-efficient con-
The authors would like to thank their colleagues from volutional neural network accelerator,’’ IEEE Trans. Circuits Syst. I, Reg.
KETI and KEIT who provided insight and expertise that Papers, vol. 67, no. 10, pp. 3484–3497, Oct. 2020.
[13] H. Irmak, D. Ziener, and N. Alachiotis, ‘‘Increasing flexibility of FPGA- [33] Y. Kim, H. Kim, N. Yadav, S. Li, and K. K. Choi, ‘‘Low-power RTL
based CNN accelerators with dynamic partial reconfiguration,’’ in Proc. code generation for advanced CNN algorithms toward object detection
31st Int. Conf. Field-Programmable Log. Appl. (FPL), Aug. 2021, in autonomous vehicles,’’ Electronics, vol. 9, no. 3, p. 478, Mar. 2020.
pp. 306–311. [Online]. Available: https://www.mdpi.com/2079-9292/9/3/478
[14] W. Chen, D. Wang, H. Chen, S. Wei, A. He, and Z. Wang, ‘‘An asyn- [34] H. Kim and K. Choi, ‘‘The implementation of a power efficient BCNN-
chronous and reconfigurable CNN accelerator,’’ in Proc. IEEE Int. Conf. based object detection acceleration on a Xilinx FPGA-SoC,’’ in Proc. Int.
Electron Devices Solid State Circuits (EDSSC), Jun. 2018, pp. 1–2. Conf. Internet Things (iThings), IEEE Green Comput. Commun. (Green-
[15] C. Yang, Y. Wang, H. Zhang, X. Wang, and L. Geng, ‘‘A reconfigurable Com), IEEE Cyber, Phys. Social Comput. (CPSCom), IEEE Smart Data
CNN accelerator using tile-by-tile computing and dynamic adaptive data (SmartData), Jul. 2019, pp. 240–243.
truncation,’’ in Proc. IEEE Int. Conf. Integr. Circuits, Technol. Appl. [35] Y. Kim, Q. Tong, K. Choi, E. Lee, S. Jang, and B. Choi, ‘‘System
(ICTA), Nov. 2019, pp. 73–74. level power reduction for YOLO2 sub-modules for object detection of
[16] S. Zeng, K. Guo, S. Fang, J. Kang, D. Xie, Y. Shan, Y. Wang, and H. Yang, future autonomous vehicles,’’ in Proc. Int. SoC Design Conf. (ISOCC),
‘‘An efficient reconfigurable framework for general purpose CNN-RNN Nov. 2018, pp. 151–155.
models on FPGAs,’’ in Proc. IEEE 23rd Int. Conf. Digit. Signal Process. [36] PYNQ: Python Productivity. Accessed: Feb. 15, 2023. [Online]. Available:
(DSP), Nov. 2018, pp. 1–5. http://www.pynq.io/
[17] L. Gong, C. Wang, X. Li, H. Chen, and X. Zhou, ‘‘MALOC: A fully [37] L. Li, K. Choi, S. Park, and M. Chung, ‘‘Selective clock gating by using
pipelined FPGA accelerator for convolutional neural networks with all lay- wasting toggle rate,’’ in Proc. IEEE Int. Conf. Electro/Information Tech-
ers mapped on chip,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits nol., Jun. 2009, pp. 399–404.
Syst., vol. 37, no. 11, pp. 2601–2612, Nov. 2018. [38] W. Wang, Y.-C. Tsao, K. Choi, S. Park, and M.-K. Chung, ‘‘Pipeline power
[18] L. Bai, Y. Zhao, and X. Huang, ‘‘A CNN accelerator on FPGA using reduction through single comparator-based clock gating,’’ in Proc. Int. SoC
depthwise separable convolution,’’ IEEE Trans. Circuits Syst. II, Exp. Design Conf. (ISOCC), Nov. 2009, pp. 480–483.
Briefs, vol. 65, no. 10, pp. 1415–1419, Oct. 2018. [39] Y. Zhang, Q. Tong, L. Li, W. Wang, K. Choi, J. Jang, H. Jung,
[19] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, and S.-Y. Ahn, ‘‘Automatic register transfer level CAD tool design for
N. Xu, S. Song, Y. Wang, and H. Yang, ‘‘Going deeper with embedded advanced clock gating and low power schemes,’’ in Proc. Int. SoC Design
FPGA platform for convolutional neural network,’’ in Proc. ACM/SIGDA Conf. (ISOCC), Nov. 2012, pp. 21–24.
Int. Symp. Field-Programmable Gate Arrays. New York, NY, USA: [40] Compile an ML Model. Accessed: Feb. 15, 2023. [Online]. Available:
Association for Computing Machinery, Feb. 2016, pp. 26–35, doi: https://www.tensil.ai/docs/howto/compile/
10.1145/2847263.2847265.
[20] T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Patel, and M. Herbordt,
‘‘A framework for acceleration of CNN training on deeply-pipelined FPGA VICTORIA HEEKYUNG KIM (Graduate Student
clusters with work and weight load balancing,’’ in Proc. 28th Int. Conf. Member, IEEE) received the B.S. degree in elec-
Field Program. Log. Appl. (FPL), Aug. 2018, p. 394. tronic and electrical engineering from Hongik Uni-
[21] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, versity, Seoul, South Korea, in 2012, and the M.S.
and J. Cong, ‘‘FP-DNN: An automated framework for mapping deep degree in electrical and computer engineering
neural networks onto FPGAs with RTL-HLS hybrid templates,’’ in Proc. from the Illinois Institute of Technology, Chicago,
IEEE 25th Annu. Int. Symp. Field-Programmable Custom Comput. Mach. in 2015. She is currently pursuing the Ph.D. degree
(FCCM), Apr. 2017, pp. 152–159. in computer engineering with the Illinois Insti-
[22] Y. Ma, Y. Cao, S. Vrudhula, and J. Seo, ‘‘Optimizing the convolution tute of Technology, Chicago. Her current research
operation to accelerate deep neural networks on FPGA,’’ IEEE Trans. Very interests include low-power and high-performance
Large Scale Integr. (VLSI) Syst., vol. 26, no. 7, pp. 1354–1367, Jul. 2018. HW/SW optimization for CNN accelerator design for FPGA and ASIC,
[23] S. Li, Y. Luo, K. Sun, N. Yadav, and K. K. Choi, ‘‘A novel FPGA
wireless sensor networks design and sensor data analysis and embedded
accelerator design for real-time and ultra-low power deep convolutional
HW/SW system design for robotics, drones, and the IoTs.
neural networks compared with Titan X GPU,’’ IEEE Access, vol. 8,
pp. 105455–105471, 2020.
[24] Learn Tensil With ResNet and PYNQ Z1. Accessed: Dec. 15, 2022. KYUWON KEN CHOI (Senior Member, IEEE)
[Online]. Available: https://www.tensil.ai/docs/tutorials/resnet20-pynqz1/ received the Ph.D. degree in electrical and com-
[25] X. Zhang, Y. Ma, J. Xiong, W. W. Hwu, V. Kindratenko, and D. Chen, puter engineering from the Georgia Institute of
‘‘Exploring HW/SW co-design for video analysis on CPU-FPGA heteroge- Technology, Atlanta, USA, in 2002. He is cur-
neous systems,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., rently a Full Professor with the Department
vol. 41, no. 6, pp. 1606–1619, Jun. 2022. of Electrical and Computer Engineering, Illinois
[26] E. Antolak and A. Pulka, ‘‘Energy-efficient task scheduling in design Institute of Technology. He was a Postdoc-
of multithread time predictable real-time systems,’’ IEEE Access, vol. 9,
toral Researcher with the Prof. Takayasu Sakurai
pp. 121111–121127, 2021.
Laboratory, Institute of Industrial Science, The
[27] W. Huang, H. Wu, Q. Chen, C. Luo, S. Zeng, T. Li, and Y. Huang, ‘‘FPGA-
based high-throughput CNN hardware accelerator with high computing University of Tokyo, Japan, working on leakage-
resource utilization ratio,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 33, power-reduction circuit techniques. He has proposed and conducted several
no. 8, pp. 4069–4083, Aug. 2022. projects supported by the National Aeronautics and Space Administration
[28] G. Lakshminarayanan and B. Venkataramani, ‘‘Optimization techniques (NASA), Defense Advanced Research Projects Agency (DARPA), U.S.
for FPGA-based wave-pipelined DSP blocks,’’ IEEE Trans. Very Large National Science Foundation (NSF), Scientific Research Corporation (SRC),
Scale Integr. (VLSI) Syst., vol. 13, no. 7, pp. 783–793, Jul. 2005. and Korea Electronics Technology Institute (KETI) regarding power-aware
[29] D. Wang, K. Xu, J. Guo, and S. Ghiasi, ‘‘DSP-efficient hardware accel- computing/communication (PACC). He was a Senior CAD Engineer and
eration of convolutional neural network inference on FPGAs,’’ IEEE a Technical Consultant for low-power systems-on-chip (SoC) design with
Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 39, no. 12, Samsung Semiconductor, Broadcom and Sequence Design, prior to joining
pp. 4867–4880, Dec. 2020. IIT. In the past, he had eight-year industry experience in the area of ultralow
[30] A. Prihozhy, E. Bezati, A. A. A. Rahman, and M. Mattavelli, ‘‘Synthesis power VLSI chip design from compiler level to circuit level. Last few years,
and optimization of pipelines for HW implementations of dataflow pro-
by using his novel low-power techniques, several processor and mobile chips
grams,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 34,
were successfully fabricated in deep-submicrometer technology and more
no. 10, pp. 1613–1626, Oct. 2015.
[31] W. Lou, L. Gong, C. Wang, Z. Du, and X. Zhou, ‘‘OctCNN: A high than 120 peer-reviewed journals and conference papers have been published.
throughput FPGA accelerator for CNNs using octave convolution algo- His current research interests include ultra-low-power digital circuits and
rithm,’’ IEEE Trans. Comput., vol. 71, no. 8, pp. 1847–1859, Aug. 2022. artificial intelligent (AI) related IC designs. He is also the Director of the
[32] H. Kim and K. Choi, ‘‘Low power FPGA-SoC design techniques for CNN- VLSI Design and Automation Laboratory (DA-Lab), IIT, the Editor-in-Chief
based object detection accelerator,’’ in Proc. IEEE 10th Annu. Ubiqui- of Journal of Pervasive Technologies, a guest editor of Springer and Wiley
tous Comput., Electron. Mobile Commun. Conf. (UEMCON), Oct. 2019, journals, and a TPC member for several IEEE circuit design conferences.
pp. 1130–1134.