0% found this document useful (0 votes)
101 views

A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA

Uploaded by

Akash Meka
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views

A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA

Uploaded by

Akash Meka
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Received 10 April 2023, accepted 9 May 2023, date of publication 12 June 2023, date of current version 19 June 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3285279

A Reconfigurable CNN-Based Accelerator Design


for Fast and Energy-Efficient Object Detection
System on Mobile FPGA
VICTORIA HEEKYUNG KIM , (Graduate Student Member, IEEE),
AND KYUWON KEN CHOI, (Senior Member, IEEE)
Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL 60616, USA
Corresponding author: Victoria Heekyung Kim ([email protected])
This work was supported by the Technology Innovation Program of the Ministry of Trade, Industry & Energy (MOTIE), South Korea,
through the Korea Electronics Technology Institute (KETI), South Korea (Software and Hardware Development of Cooperative
Autonomous Driving Control Platform for Commercial Special and Work-Assist Vehicles) under Grant 1415181272.

ABSTRACT In limited-resource edge computing circumstances such as on mobile devices, IoT devices,
and electric vehicles, the energy-efficient optimized convolutional neural network (CNN) accelerator
implemented on mobile Field Programmable Gate Array (FPGA) is becoming more attractive due to
its high accuracy and scalability. In recent days, mobile FPGAs such as the Xilinx PYNQ-Z1/Z2 and
Ultra96, definitely have the advantage of scalability and flexibility for the implementation of deep learning
algorithm-based object detection applications. It is also suitable for battery-powered systems, especially for
drones and electric vehicles, to achieve energy efficiency in terms of power consumption and size aspect.
However, it has the low and limited performance to achieve real-time processing. In this article, optimizing
the accelerator design flow in the register-transfer level (RTL) will be introduced to achieve fast programming
speed by applying low-power techniques on FPGA accelerator implementation. In general, most accelerator
optimization techniques are conducted on the system level on the FPGA. In this article, we propose the
reconfigurable accelerator design for a CNN-based object detection system on the register-transfer level on
mobile FPGA. Furthermore, we present RTL optimization design techniques that will be applied such as
various types of clock gating techniques to eliminate residual signals and to deactivate the unnecessarily
active block. Based on the analysis of the CNN-based object detection architecture, we analyze and classify
the common computing operation components from the Convolutional Neuron Network, such as multipliers
and adders. We implement a multiplier/adder unit to a universal computing unit and modularize it to be
suitable for a hierarchical structure of RTL code. The proposed system design was tested with Resnet-20
which has 23 layers and it was trained with the dataset, CIFAR-10 which provides a test set of 10,000
images in several formats, and the weight data we used for this experiment was provided from Tensil.
Experimental results show that the proposed design process improves the power efficient consumption,
hardware utilization, and throughput by 16%, up to 58%, and 15%, respectively.

INDEX TERMS FPGA accelerator, CNN accelerator, RT level design techniques, low power techniques,
reconfigurable accelerator, CNN-based object detection, low power consumption, high performance, mobile
FPGA.
I. INTRODUCTION Field Programmable Gate Array (FPGA) devices from
Convolutional Neural Network(CNN)-based object detection personal mobile devices to industrial machines such as
application has been applied in various systems including healthcare devices, smart surveillance systems, Advanced
Driver Assistance Systems (ADAS), drones, and logistics
The associate editor coordinating the review of this manuscript and robots [1], [2], [3], [4], [5], [6]. To achieve high accuracy
approving it for publication was Tao Zhou . of recognition, CNNs have become an essential feature of
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
59438 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 11, 2023
V. H. Kim, K. K. Choi: Reconfigurable CNN-Based Accelerator Design

many diverse object detection-adopted devices, whether it is


cloud-based or edge devices. The primary implementation
issue of the CNN application is that computing complexity
is above average and the power consumption amount is huge
to achieve fast processing speed and high accuracy at the
same time. High computing complexity also involves a large
number of operation units, and massive memory accesses
as well. The dynamic power consumption occurs over the
data transfer process and in the time-delay process of com-
puting operation. It seems impossible to make the real-time
inference of CNN-based object detection on mobile FPGA
devices which have limited hardware resources such as mem-
ory size and lower processor performance. In these power and
hardware resource-limited circumstances, to improve per- FIGURE 1. Vivado HLS design flow.

formance and reduce power consumption, many researchers


have proposed CNN accelerators at various design levels
including system level, application level, architecture level, the architecture of the proposed accelerator and the design
and transistor level [7], [8], [9]. Recent studies have proposed details of data flow and processing modules in two parts, opti-
a flexible CNN accelerator design for FPGA implementation mization & modularization and low power techniques. Sec-
at the system level and a flexible FPGA accelerator for vari- tion V discusses the implementation and simulation results
ous CNN architectures from lightweight CNN to large-scale with previous works. Finally, the conclusions are given in
CNN [11], [12], [13], [14], [15], [16]. Section VI.
Since CNN-based object detection applications become
more common technology for unmanned drones, autonomous
II. BACKGROUND
vehicles, ADAS systems on the vehicle, and industrial
automation systems, researchers have been conducting CNN To design the CNN accelerator on an FPGA-SoC board, the
object detection-related research in terms of the following use of CAD tools and platforms is required. Each manufac-
topics; implementation on the mobile FPGA-SoC board for turer provides the CAD tools and development platforms for
real-time processing, accelerator design for mobile FPGA- the implementation process and reconfigurable components
System-On-Chip (SoC), and hardware optimization tech- and parts on the FPGA (e.g., Vivado from Xilinx, Quartus
niques, are becoming popular. To overcome the lack of Prime from Intel, and PYNQ). However, due to the closed
hardware resources on the mobile FPGAs such as Xilinx platform feature of the Xilinx FPGA products, in the High
Ultra 96 and Xilinx PYNQ-Z1 which are popular FPGA-SoC Level of Synthesis (HLS) design flow as shown in Fig-
devices implemented on drone and IoT devices, many ure 1, the Vivado HLS system can verify the functionality
papers have been published to achieve high performance, of the C/C++/System C code and convert the code to the
low power consumption, and real-time processing speed register-transfer level (RTL) code for the FPGA hardware
[17], [18], [19], [20], [21], [22], [23], [24]. The main focus operation and optimization [7], [23], [35], however, once
of their proposed implementation techniques in those papers the RTL code is generated by Vivado HLS Tool, the code
is reducing the size of the CNN architecture, pre-processing is no longer readable or modifiable. On the other hand, the
the input feature map, tightening pipe-lining design, size platform-based design flow can import the VHDL/Verilog
adjustment of the input and output feature maps, and code code to set as customized IP blocks so that we can easily
modify the hardware design at the RT level or gate level
optimization [9], [25], [26], [27], [22], [28], [29], [30], [31].
and intuitively configure the data flow for Processing System
Moreover, in previous our research, we verified that RT-level
(PS) and Programmable Logic (PL) through the Vivado IP
optimization is able to not only reduce the processing time
Integrator.
but also, save dynamic power [32], [33], [34].
Therefore, in this work, we applied the low-power tech-
niques to the baseline RTL code of the CNN accelerator gen- A. PLATFORM-BASED DESIGN FLOW WITH RTL CODE
erated from the Tensil and applied the hardware-optimized The platform-based design flow was introduced by Xilinx
techniques to the proposed reconfigurable FPGA hardware Vivado which is an integrated design environment program
accelerator design through the proposed automated optimiza- tool as shown in Figure 2. The RTL code can be imported into
tion tool for RTL code. The rest of this paper is organized the IP block, and it can be assembled with other peripheral
as follows: Section II introduces the low-power techniques IP blocks and PS IP blocks to generate the hardware design
in RT-level for energy-efficiently accelerating the CNN com- for the bitstream. Jupyter Notebook is a web-based primary
puting operation and overviews the basic RT-level-based opti- computing environment of the PYNQ which is linked to
mization hardware design flow based on generated a baseline Xilinx platforms [36]. PYNQ is running based on Python
CNN accelerator RTL code by Tensil. Section III describes on the Jupyter Notebook with Linux kernel on the FPGA.

VOLUME 11, 2023 59439


V. H. Kim, K. K. Choi: Reconfigurable CNN-Based Accelerator Design

FIGURE 5. Local explicit clock gating (LECG).


FIGURE 2. FPGA SoC platform design architecture.

FIGURE 3. Conventional register.

FIGURE 4. Local explicit clock enable(LECE).

FIGURE 6. Bus-specific clock gating (BSCG).


However, the Python library is not fully supported in the
PYNQ.
C. LOW POWER TECHNIQUES AT RT-LEVEL
B. BASELINE RTL CODE GENERATION FOR CNN 1) LOW POWER CLOCK TECHNIQUES
ACCELERATOR: TENSIL Clock Gating(CG) is a basic low-power technique to enhance
Tensil is a set of tools for designing the accelerator includ- performance and efficiency by disabling unnecessary clock
ing an RTL generator, a model compiler, and a set of cycles as shown in Figure 3. Standby states are included in
drivers [24]. The basic processing flow is that using the many parts of the CNN computing process. This leads to
selected machine learning accelerator architectures on the a significant amount of power consumption. CG eliminates
limited FPGA-SoC devices, it generates the RTL code by unnecessary clock cycle occurrences. Local Explicit Clock
using a model compiler. The primary advantage of Tensil is Enable (LECE) [37], [38], [39] is a method using ENABLE
that it is able to create an accelerator without quantization or signal for 2:1 multiplexer or multiplexed D flip-flop to update
other degradation. Tensil applies a few of the optimization the output on the rising edge of the clock only when the
techniques for the selected FPGAs, thus the optimization ENABLE signal is high as shown in Figure 4. The more bits
performance is not effective enough. Previously we applied are used as an input, the more ENABLE signals occur. The
our low-power techniques to CNN accelerator RTL code Local Explicit Clock Gating (LECG) [37], [38], [39] has the
and verified the performance [33]. In section III, we apply equivalent fundamental of the LECE as shown in Figure 5,
the techniques to the Tensil RTL code and evaluate the however, LECG has the advantage of reducing power con-
effectiveness of the techniques on the FPGA-SoC boards, sumption in the case of the multi-bit of output, by updating
PYNQ-Z1. output at once when completing the output update.

59440 VOLUME 11, 2023


V. H. Kim, K. K. Choi: Reconfigurable CNN-Based Accelerator Design

FIGURE 8. Proposed processing system and programmable logic unit.


FIGURE 7. Enhanced clock gating (ECG).

Bus-Specific Clock Gating (BSCG) [37], [38], [39] utilizes


the clock gating technique and adjusts the EN signal based on
the comparison of I/O signals as shown in Figure 6. In terms
of power consumption, XOR gates are significantly lower
power-consuming logic gates for the gate-level power anal-
ysis compared to AND/OR gates. Enhanced Clock Gating
(ECG) [37], [39] consists of XOR gates to control the input
clock signals and enable signals when considering multi-
bit I/O data as shown in Figure 7. The efficiency of the FIGURE 9. IP block design for CNN object detection.
power reduction would be maximized when there are larger
pipelines and IO bit sizes.
customized IP blocks (top_pynqz1_0). In the top_pynqz1_0
III. ARCHITECTURE DESIGN OF ACCELERATOR block, there are hierarchically defined multiply-accumulate
A. ARCHITECTURE OVERVIEW units (MACs), POOLs, memory bandwidth, memory access
The block diagram in Figure 8 shows a data flow of the pro- scheduler, and CONV computing modules. The original Ten-
posed processing unit design for an FPGA-based CNN object sil’s RTL codes do not have hierarchical architecture, How-
detection accelerator. For the programmable logic, each type ever, in this case, analysis of the RTL code would take a long
of block is defined specifically and is modularized to enhance time.
the implementation efficiency of various CNN models. The The primary feature of FPGA devices is in reconfigura-
proposed architecture can be mainly divided into computing bility. Therefore, to maximize the flexibility of the FPGA-
processing logic and memory system as detailed as follows: SoC design, the proposed RTL code of the CNN acceler-
In the memory system, there are three main functional com- ator was designed with hierarchical and modularized main
ponents for the on-chip and off-chip data transfer to prepare modules including MACs, Conv, Multiplier, Adder, MUX,
data for computation. First, the buffers are responsible for and ALU as shown in Figure 10. This figure shows that
storing data. All the weights and intermediate feature maps the proposed flexible accelerator design has the scalabil-
are arranged in a layer-by-layer format which is stored in ity to support the different CNN architectures such as the
external DRAM. When loading a tile of data to the on-chip YOLO series and ResNet20. After the modularization of
input/weight/output ping-pong buffers, they are arranged in the MAC unit, we applied our low-power techniques such
a unique format according to the requirement of computa- as clock gating, XOR gate, and OR gate for MUXs. This
tion mode. Second, a dispatching module employs Direct design is able to accommodate add-on detectors, such as
Memory Access (DMA) engine through DMA descriptors Single-Shot Detectors (SSD) and Multibox detectors. For the
generated by the DMA control module to fetch required data memory access modules such as InnerDualPortMem1, Dual-
from DRAM or save the results back to DRAM. Third, the PortMem1, MemSplitter, and MemBoundarySplitter, mem-
on-chip data scheduling modules, consisting of scatter and ory partitioning techniques are applied. To accelerate the
gather modules, realize the serial-to-parallel or parallel-to- CPU computing operation, the memory reassignment tech-
serial conversions, which manipulate the data flow for the nique has been applied so that the memory size and flow
following computation or transmission. would be changed once it detected the pre-assigned compu-
tation. For example, our target CNN accelerator architectures
B. THE PROPOSED RECONFIGURABLE ACCELERATOR should be using fixed 16-bit, then we can pre-assign the
HARDWARE ARCHITECTURE memory size prior to the input data or the weight. This would
As shown in Figure 9, the IP block design for the CNN object be helpful to compute the sequential computation operation
detection accelerator consists of referencing IP blocks and such as convolution operation.

VOLUME 11, 2023 59441


V. H. Kim, K. K. Choi: Reconfigurable CNN-Based Accelerator Design

FIGURE 10. Flexible accelerator design overview.

FIGURE 13. Proposed MAC unit design.

The optimized multiplier design using a power-efficient


adder block based on power analysis was implemented at
RT-level as shown in Figure 13. In convolution computa-
tion, the computation complexity of the multipliers can cause
dynamic power consumption and delays. To reduce the com-
plexity of the adder and multiplier, first, we tested the full
FIGURE 11. Low power design flow.
adder designs through the transistor-level design process. For
the MAC module, Bus Specific Clock (BSC) is applied. In a
conventional register, the data input is active and lasts until
the end of the period. In this case, the power could be wasted.
When BSC is applied to register Z, the XOR can control
enabling the clock so that the clock toggles are not wasted.
Compared to the conventional design in Figure 12, the AND
gate and Latch were added to safely disable the clock without
allowing any glitches to reach the register clock.

B. PROPOSED MAC HARDWARE DESIGN


The detailed technique approaches are as follows: 1. MAC
unit is the major power consumption unit of the convolution
operation in which the data transmissions occur frequently.
This technique is applied to remove the wasted clock toggles
during the data input is deactivated. 2. The proposed adder
group with BSC can reduce the wasted clock toggles so that
FIGURE 12. Conventional MAC unit design. it can reduce the power consumption of the adder unit. 3.
In stochastic multiplication, two unary bit-streams can be
IV. PROPOSED HARDWARE IMPLEMENTATION operated using AND gate and the OR gate can be applied
A. LOW POWER HW DESIGN TECHNIQUES AT RTL instead of the MUX operation. Not only as the same as
As shown in Figure 9, in this experiment, our optimization for MUX, OR gate can support parallel MAC operation, but
low power is targeted to the Tensil’s processing flow. Step 1. also, it consequences a reduced dynamic power consumption
Based on the neural architecture file, .tarch, Tensil helps result. Eventually, we utilized this parallel MAC structure
generate the TCU RTL code for basic hardware resource using OR gate as shown in Figure 13. Figure 14 shows the
design as shown in Figure 11. Step 2. After the RTL code proposed MAC pseudocode which has been applied to BSC
is generated, we applied our low-power techniques including and OR-based MAC computing operations.
LECG, Split memory, BSC, and ECG. Step 3. Using Vivado,
we designed the hardware IP block. From the IP block design, C. FLEXIBLE ACCELERATOR DESIGN FOR
we were able to get the bit-stream file. Step 4. Based on MULTI-ARCHITECTURE AND OPTIMIZATION TECHNIQUES
the customized bit stream, we were able to implement the Based on the analysis result of the target CNN architecture,
hardware accelerator for the CNN object detection algorithm. we customize the pipe-lining of the data flow and assign max-
Step 5. We simulated it on the FPGA board and evaluated the imized buffer capacity in the BRAM and external memory.
power consumption of the target DNN-based object detection The fixed-point numbers are able to reduce the computa-
processing by using Vivado. tion resource consumption and it also is able to reduce the

59442 VOLUME 11, 2023


V. H. Kim, K. K. Choi: Reconfigurable CNN-Based Accelerator Design

FIGURE 15. FPGA testbed (Xilinx PYNQ-Z1 FPGA).

FIGURE 14. Psuedocode for proposed MAC operation. FIGURE 16. HW resource report comparison of tensil sample simulation
and our work tested on PYNQ-Z1.

bandwidth requirements, however, for getting high perfor- the converted ML model code in ONNX format. Tensil com-
mance, the optimized size of bandwidth should be defined piler generates three import artifacts, a .tmodel, .tdata, and
by the analysis of the network architecture. Once the data .tprog files. Once the .tmodel manifest for the model into the
transmission size is fixed, memory splitting and merging driver is loaded, it tells the driver where to locate the binary
should be applied. Our CNN accelerator is based on the 16-bit files, program data, and weights data. They were not open
fixed point bandwidth which is given by the reference [24]. data and, we are using them without any modification, so that
We modularize the RTL code based on thorough analysis, means the accuracy was not changed.
which helps easy modification for implementing the accel-
erator design. B. FPGA IMPLEMENTATION RESULTS
Compared with Tensil’s optimization result, we verified more
V. EXPERIMENT AND RESULTS WITH DISCUSSION register buffers are activated for our proposed structure. Once
A. EXPERIMENT ENVIRONMENT we check the functionality and performance result, then you
For the basic hardware platform, we chose the PYNQ-Z1 can modify the structure by RTL code modification. Then
board instead of the regular ZYNQ-7020 board, where the we can improve the specific hardware resources and power
PYNQ is an open-source project from AMD [36]. It embeds consumption of the design. Analyzing the result leads to
Xilinx ZYNQ-7020, and also provides a Jupyter-based frame- improved performance of CNN processing. Figure 16 shows
work with Python APIs. The PYNQ-Z1 board has the the power consumption reduction of the processing system
FPGA-SoC platform which is composed of PL and PS. unit. We were able to archive the 43.9 (GOPs/W) as a power
The basic software development tool is Jupyter Notebook, efficiency result, compared to other FPGA board implemen-
a web-based software programming platform. It is also sup- tations, it increased 1.37 times. the hardware resource utiliza-
porting Python, C/C++ programming languages, and other tion in DSPs is increased 2.2 times from the result of [24].
open-source libraries such as OpenCV. Our experiment envi-
ronment is as follows in Figure 15. The imported CNN C. POWER CONSUMPTION RESULTS
architecture is the Resnet-20. It has 23 layers and it was Our optimization will decrease 16% of the dynamic power
trained with CIFAR-10 which provides a dataset of 10,000 consumption. Also, the total On-Chip power will deduct 20%
images in several formats. We used the provided weight file of the total power consumption. Once the global buffer is
and converted the ONNX format of the ResNet Model [40]. activated, the unused global clock buffer and the second
ONNX, a machine learning (ML) model converter, provides global clock resource will help to improve the performance of

VOLUME 11, 2023 59443


V. H. Kim, K. K. Choi: Reconfigurable CNN-Based Accelerator Design

TABLE 1. Comparison result of FPGA implementation.

the design. Moreover, this can be the solution to some high greatly assisted the research and greatly improved the
fan-out signals to make the device fully functional. In the manuscript.
pipeline logic, inserting an intermediate flip-flop(FF) can
improve the working speed of the device, however, too many REFERENCES
flip-flops make computational complexity. Our low-power [1] A. K. Jameil and H. Al-Raweshidy, ‘‘Efficient CNN architecture on FPGA
techniques show better performance than the performance of using high level module for healthcare devices,’’ IEEE Access, vol. 10,
FFs. pp. 60486–60495, 2022.
[2] S. Y. Nikouei, Y. Chen, S. Song, R. Xu, B. Choi, and T. Faughnan, ‘‘Smart
surveillance as an edge network service: From Harr-Cascade, SVM to a
VI. CONCLUSION lightweight CNN,’’ in Proc. IEEE 4th Int. Conf. Collaboration Internet
In this article, the proposed highly reconfigurable FPGA Comput. (CIC), Oct. 2018, pp. 256–265.
[3] K. Haeublein, W. Brueckner, S. Vaas, S. Rachuj, M. Reichenbach, and
hardware accelerator showed improved performance in terms D. Fey, ‘‘Utilizing PYNQ for accelerating image processing functions
of the processing speed and power consumption result during in ADAS applications,’’ in Proc. 32nd Int. Conf. Archit. Comput. Syst.,
inference of various CNNs. The hardware optimization is May 2019, pp. 1–8.
conducted mainly for two purposes: to improve the through- [4] Z. Zhang, M. A. P. Mahmud, and A. Z. Kouzani, ‘‘FitNN: A low-resource
FPGA-based CNN accelerator for drones,’’ IEEE Internet Things J., vol. 9,
put and to reduce power consumption. For improving perfor- no. 21, pp. 21357–21369, Nov. 2022.
mance, the minimized data transferring strategy was applied [5] C. Fu and Y. Yu, ‘‘FPGA-based power efficient face detection for mobile
by assigning the maximum amount of buffers during the robots,’’ in Proc. IEEE Int. Conf. Robot. Biomimetics (ROBIO), Dec. 2019,
pp. 467–473.
computations and by applying a controlled pipeline design
[6] X. Li, X. Gong, D. Wang, J. Zhang, T. Baker, J. Zhou, and T. Lu, ‘‘ABM-
for minimized data access. For achieving energy efficient SpConv-SIMD: Accelerating convolutional neural network inference for
results of CNN object detection operation, not only the data industrial IoT applications on edge devices,’’ IEEE Trans. Netw. Sci. Eng.,
access controlling for minimized memory access, but also we early access, Feb. 25, 2022, doi: 10.1109/TNSE.2022.3154412.
[7] S. Tamimi, Z. Ebrahimi, B. Khaleghi, and H. Asadi, ‘‘An efficient SRAM-
proposed the RT level low power techniques-applied recon- based reconfigurable architecture for embedded processors,’’ IEEE Trans.
figured MAC units such as advanced clock gating-applied Comput.-Aided Design Integr. Circuits Syst., vol. 38, no. 3, pp. 466–479,
adder, register Z with bus specific clock, and OR-based MAC Mar. 2019.
[8] A. J. A. El-Maksoud, M. Ebbed, A. H. Khalil, and H. Mostafa, ‘‘Power
architecture to RTL code of the proposed accelerator. The efficient design of high-performance convolutional neural networks hard-
proposed hardware accelerator for ResNet-20 was imple- ware accelerator on FPGA: A case study with GoogLeNet,’’ IEEE Access,
mented on mobile FPGA-SoC, PYNQ-Z1, and the power vol. 9, pp. 151897–151911, 2021.
consumption was measured during inference operation. As a [9] S. Lee, D. Kim, D. Nguyen, and J. Lee, ‘‘Double MAC on a DSP:
Boosting the performance of convolutional neural networks on FPGAs,’’
result, the throughput result showed a 15% improvement IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 38, no. 5,
compared with the baseline RTL code of the accelerator, pp. 888–897, May 2019.
also power consumption was reduced by 16%, and hard- [10] S. Ullah, S. Rehman, M. Shafique, and A. Kumar, ‘‘High-performance
accurate and approximate multipliers for FPGA-based hardware acceler-
ware utilization was increased by 58%. The object detection ators,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 41,
processing speed was 9.17FPS, which shows that real-time no. 2, pp. 211–224, Feb. 2022.
processing is feasible in mobile FPGA. [11] X. Wu, Y. Ma, M. Wang, and Z. Wang, ‘‘A flexible and efficient FPGA
accelerator for various large-scale and lightweight CNNs,’’ IEEE Trans.
Circuits Syst. I, Reg. Papers, vol. 69, no. 3, pp. 1185–1198, Mar. 2022.
ACKNOWLEDGMENT
[12] W. Liu, J. Lin, and Z. Wang, ‘‘A precision-scalable energy-efficient con-
The authors would like to thank their colleagues from volutional neural network accelerator,’’ IEEE Trans. Circuits Syst. I, Reg.
KETI and KEIT who provided insight and expertise that Papers, vol. 67, no. 10, pp. 3484–3497, Oct. 2020.

59444 VOLUME 11, 2023


V. H. Kim, K. K. Choi: Reconfigurable CNN-Based Accelerator Design

[13] H. Irmak, D. Ziener, and N. Alachiotis, ‘‘Increasing flexibility of FPGA- [33] Y. Kim, H. Kim, N. Yadav, S. Li, and K. K. Choi, ‘‘Low-power RTL
based CNN accelerators with dynamic partial reconfiguration,’’ in Proc. code generation for advanced CNN algorithms toward object detection
31st Int. Conf. Field-Programmable Log. Appl. (FPL), Aug. 2021, in autonomous vehicles,’’ Electronics, vol. 9, no. 3, p. 478, Mar. 2020.
pp. 306–311. [Online]. Available: https://www.mdpi.com/2079-9292/9/3/478
[14] W. Chen, D. Wang, H. Chen, S. Wei, A. He, and Z. Wang, ‘‘An asyn- [34] H. Kim and K. Choi, ‘‘The implementation of a power efficient BCNN-
chronous and reconfigurable CNN accelerator,’’ in Proc. IEEE Int. Conf. based object detection acceleration on a Xilinx FPGA-SoC,’’ in Proc. Int.
Electron Devices Solid State Circuits (EDSSC), Jun. 2018, pp. 1–2. Conf. Internet Things (iThings), IEEE Green Comput. Commun. (Green-
[15] C. Yang, Y. Wang, H. Zhang, X. Wang, and L. Geng, ‘‘A reconfigurable Com), IEEE Cyber, Phys. Social Comput. (CPSCom), IEEE Smart Data
CNN accelerator using tile-by-tile computing and dynamic adaptive data (SmartData), Jul. 2019, pp. 240–243.
truncation,’’ in Proc. IEEE Int. Conf. Integr. Circuits, Technol. Appl. [35] Y. Kim, Q. Tong, K. Choi, E. Lee, S. Jang, and B. Choi, ‘‘System
(ICTA), Nov. 2019, pp. 73–74. level power reduction for YOLO2 sub-modules for object detection of
[16] S. Zeng, K. Guo, S. Fang, J. Kang, D. Xie, Y. Shan, Y. Wang, and H. Yang, future autonomous vehicles,’’ in Proc. Int. SoC Design Conf. (ISOCC),
‘‘An efficient reconfigurable framework for general purpose CNN-RNN Nov. 2018, pp. 151–155.
models on FPGAs,’’ in Proc. IEEE 23rd Int. Conf. Digit. Signal Process. [36] PYNQ: Python Productivity. Accessed: Feb. 15, 2023. [Online]. Available:
(DSP), Nov. 2018, pp. 1–5. http://www.pynq.io/
[17] L. Gong, C. Wang, X. Li, H. Chen, and X. Zhou, ‘‘MALOC: A fully [37] L. Li, K. Choi, S. Park, and M. Chung, ‘‘Selective clock gating by using
pipelined FPGA accelerator for convolutional neural networks with all lay- wasting toggle rate,’’ in Proc. IEEE Int. Conf. Electro/Information Tech-
ers mapped on chip,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits nol., Jun. 2009, pp. 399–404.
Syst., vol. 37, no. 11, pp. 2601–2612, Nov. 2018. [38] W. Wang, Y.-C. Tsao, K. Choi, S. Park, and M.-K. Chung, ‘‘Pipeline power
[18] L. Bai, Y. Zhao, and X. Huang, ‘‘A CNN accelerator on FPGA using reduction through single comparator-based clock gating,’’ in Proc. Int. SoC
depthwise separable convolution,’’ IEEE Trans. Circuits Syst. II, Exp. Design Conf. (ISOCC), Nov. 2009, pp. 480–483.
Briefs, vol. 65, no. 10, pp. 1415–1419, Oct. 2018. [39] Y. Zhang, Q. Tong, L. Li, W. Wang, K. Choi, J. Jang, H. Jung,
[19] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, and S.-Y. Ahn, ‘‘Automatic register transfer level CAD tool design for
N. Xu, S. Song, Y. Wang, and H. Yang, ‘‘Going deeper with embedded advanced clock gating and low power schemes,’’ in Proc. Int. SoC Design
FPGA platform for convolutional neural network,’’ in Proc. ACM/SIGDA Conf. (ISOCC), Nov. 2012, pp. 21–24.
Int. Symp. Field-Programmable Gate Arrays. New York, NY, USA: [40] Compile an ML Model. Accessed: Feb. 15, 2023. [Online]. Available:
Association for Computing Machinery, Feb. 2016, pp. 26–35, doi: https://www.tensil.ai/docs/howto/compile/
10.1145/2847263.2847265.
[20] T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Patel, and M. Herbordt,
‘‘A framework for acceleration of CNN training on deeply-pipelined FPGA VICTORIA HEEKYUNG KIM (Graduate Student
clusters with work and weight load balancing,’’ in Proc. 28th Int. Conf. Member, IEEE) received the B.S. degree in elec-
Field Program. Log. Appl. (FPL), Aug. 2018, p. 394. tronic and electrical engineering from Hongik Uni-
[21] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, versity, Seoul, South Korea, in 2012, and the M.S.
and J. Cong, ‘‘FP-DNN: An automated framework for mapping deep degree in electrical and computer engineering
neural networks onto FPGAs with RTL-HLS hybrid templates,’’ in Proc. from the Illinois Institute of Technology, Chicago,
IEEE 25th Annu. Int. Symp. Field-Programmable Custom Comput. Mach. in 2015. She is currently pursuing the Ph.D. degree
(FCCM), Apr. 2017, pp. 152–159. in computer engineering with the Illinois Insti-
[22] Y. Ma, Y. Cao, S. Vrudhula, and J. Seo, ‘‘Optimizing the convolution tute of Technology, Chicago. Her current research
operation to accelerate deep neural networks on FPGA,’’ IEEE Trans. Very interests include low-power and high-performance
Large Scale Integr. (VLSI) Syst., vol. 26, no. 7, pp. 1354–1367, Jul. 2018. HW/SW optimization for CNN accelerator design for FPGA and ASIC,
[23] S. Li, Y. Luo, K. Sun, N. Yadav, and K. K. Choi, ‘‘A novel FPGA
wireless sensor networks design and sensor data analysis and embedded
accelerator design for real-time and ultra-low power deep convolutional
HW/SW system design for robotics, drones, and the IoTs.
neural networks compared with Titan X GPU,’’ IEEE Access, vol. 8,
pp. 105455–105471, 2020.
[24] Learn Tensil With ResNet and PYNQ Z1. Accessed: Dec. 15, 2022. KYUWON KEN CHOI (Senior Member, IEEE)
[Online]. Available: https://www.tensil.ai/docs/tutorials/resnet20-pynqz1/ received the Ph.D. degree in electrical and com-
[25] X. Zhang, Y. Ma, J. Xiong, W. W. Hwu, V. Kindratenko, and D. Chen, puter engineering from the Georgia Institute of
‘‘Exploring HW/SW co-design for video analysis on CPU-FPGA heteroge- Technology, Atlanta, USA, in 2002. He is cur-
neous systems,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., rently a Full Professor with the Department
vol. 41, no. 6, pp. 1606–1619, Jun. 2022. of Electrical and Computer Engineering, Illinois
[26] E. Antolak and A. Pulka, ‘‘Energy-efficient task scheduling in design Institute of Technology. He was a Postdoc-
of multithread time predictable real-time systems,’’ IEEE Access, vol. 9,
toral Researcher with the Prof. Takayasu Sakurai
pp. 121111–121127, 2021.
Laboratory, Institute of Industrial Science, The
[27] W. Huang, H. Wu, Q. Chen, C. Luo, S. Zeng, T. Li, and Y. Huang, ‘‘FPGA-
based high-throughput CNN hardware accelerator with high computing University of Tokyo, Japan, working on leakage-
resource utilization ratio,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 33, power-reduction circuit techniques. He has proposed and conducted several
no. 8, pp. 4069–4083, Aug. 2022. projects supported by the National Aeronautics and Space Administration
[28] G. Lakshminarayanan and B. Venkataramani, ‘‘Optimization techniques (NASA), Defense Advanced Research Projects Agency (DARPA), U.S.
for FPGA-based wave-pipelined DSP blocks,’’ IEEE Trans. Very Large National Science Foundation (NSF), Scientific Research Corporation (SRC),
Scale Integr. (VLSI) Syst., vol. 13, no. 7, pp. 783–793, Jul. 2005. and Korea Electronics Technology Institute (KETI) regarding power-aware
[29] D. Wang, K. Xu, J. Guo, and S. Ghiasi, ‘‘DSP-efficient hardware accel- computing/communication (PACC). He was a Senior CAD Engineer and
eration of convolutional neural network inference on FPGAs,’’ IEEE a Technical Consultant for low-power systems-on-chip (SoC) design with
Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 39, no. 12, Samsung Semiconductor, Broadcom and Sequence Design, prior to joining
pp. 4867–4880, Dec. 2020. IIT. In the past, he had eight-year industry experience in the area of ultralow
[30] A. Prihozhy, E. Bezati, A. A. A. Rahman, and M. Mattavelli, ‘‘Synthesis power VLSI chip design from compiler level to circuit level. Last few years,
and optimization of pipelines for HW implementations of dataflow pro-
by using his novel low-power techniques, several processor and mobile chips
grams,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 34,
were successfully fabricated in deep-submicrometer technology and more
no. 10, pp. 1613–1626, Oct. 2015.
[31] W. Lou, L. Gong, C. Wang, Z. Du, and X. Zhou, ‘‘OctCNN: A high than 120 peer-reviewed journals and conference papers have been published.
throughput FPGA accelerator for CNNs using octave convolution algo- His current research interests include ultra-low-power digital circuits and
rithm,’’ IEEE Trans. Comput., vol. 71, no. 8, pp. 1847–1859, Aug. 2022. artificial intelligent (AI) related IC designs. He is also the Director of the
[32] H. Kim and K. Choi, ‘‘Low power FPGA-SoC design techniques for CNN- VLSI Design and Automation Laboratory (DA-Lab), IIT, the Editor-in-Chief
based object detection accelerator,’’ in Proc. IEEE 10th Annu. Ubiqui- of Journal of Pervasive Technologies, a guest editor of Springer and Wiley
tous Comput., Electron. Mobile Commun. Conf. (UEMCON), Oct. 2019, journals, and a TPC member for several IEEE circuit design conferences.
pp. 1130–1134.

VOLUME 11, 2023 59445

You might also like