0% found this document useful (0 votes)

101 views

A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA

Uploaded by

Akash Meka

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views

A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA

Uploaded by

Akash Meka

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Received 10 April 2023, accepted 9 May 2023, date of publication 12 June 2023, date of current version 19 June 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3285279

A Reconfigurable CNN-Based Accelerator Design

for Fast and Energy-Efficient Object Detection
System on Mobile FPGA
VICTORIA HEEKYUNG KIM , (Graduate Student Member, IEEE),
AND KYUWON KEN CHOI, (Senior Member, IEEE)
Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL 60616, USA
Corresponding author: Victoria Heekyung Kim ([email protected])
This work was supported by the Technology Innovation Program of the Ministry of Trade, Industry & Energy (MOTIE), South Korea,
through the Korea Electronics Technology Institute (KETI), South Korea (Software and Hardware Development of Cooperative
Autonomous Driving Control Platform for Commercial Special and Work-Assist Vehicles) under Grant 1415181272.

ABSTRACT In limited-resource edge computing circumstances such as on mobile devices, IoT devices,
and electric vehicles, the energy-efficient optimized convolutional neural network (CNN) accelerator
implemented on mobile Field Programmable Gate Array (FPGA) is becoming more attractive due to
its high accuracy and scalability. In recent days, mobile FPGAs such as the Xilinx PYNQ-Z1/Z2 and
Ultra96, definitely have the advantage of scalability and flexibility for the implementation of deep learning
algorithm-based object detection applications. It is also suitable for battery-powered systems, especially for
drones and electric vehicles, to achieve energy efficiency in terms of power consumption and size aspect.
However, it has the low and limited performance to achieve real-time processing. In this article, optimizing
the accelerator design flow in the register-transfer level (RTL) will be introduced to achieve fast programming
speed by applying low-power techniques on FPGA accelerator implementation. In general, most accelerator
optimization techniques are conducted on the system level on the FPGA. In this article, we propose the
reconfigurable accelerator design for a CNN-based object detection system on the register-transfer level on
mobile FPGA. Furthermore, we present RTL optimization design techniques that will be applied such as
various types of clock gating techniques to eliminate residual signals and to deactivate the unnecessarily
active block. Based on the analysis of the CNN-based object detection architecture, we analyze and classify
the common computing operation components from the Convolutional Neuron Network, such as multipliers
and adders. We implement a multiplier/adder unit to a universal computing unit and modularize it to be
suitable for a hierarchical structure of RTL code. The proposed system design was tested with Resnet-20
which has 23 layers and it was trained with the dataset, CIFAR-10 which provides a test set of 10,000
images in several formats, and the weight data we used for this experiment was provided from Tensil.
Experimental results show that the proposed design process improves the power efficient consumption,
hardware utilization, and throughput by 16%, up to 58%, and 15%, respectively.

INDEX TERMS FPGA accelerator, CNN accelerator, RT level design techniques, low power techniques,
reconfigurable accelerator, CNN-based object detection, low power consumption, high performance, mobile
FPGA.
I. INTRODUCTION Field Programmable Gate Array (FPGA) devices from
Convolutional Neural Network(CNN)-based object detection personal mobile devices to industrial machines such as
application has been applied in various systems including healthcare devices, smart surveillance systems, Advanced
Driver Assistance Systems (ADAS), drones, and logistics
The associate editor coordinating the review of this manuscript and robots [1], [2], [3], [4], [5], [6]. To achieve high accuracy
approving it for publication was Tao Zhou . of recognition, CNNs have become an essential feature of
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
59438 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 11, 2023
V. H. Kim, K. K. Choi: Reconfigurable CNN-Based Accelerator Design

many diverse object detection-adopted devices, whether it is

cloud-based or edge devices. The primary implementation
issue of the CNN application is that computing complexity
is above average and the power consumption amount is huge
to achieve fast processing speed and high accuracy at the
same time. High computing complexity also involves a large
number of operation units, and massive memory accesses
as well. The dynamic power consumption occurs over the
data transfer process and in the time-delay process of com-
puting operation. It seems impossible to make the real-time
inference of CNN-based object detection on mobile FPGA
devices which have limited hardware resources such as mem-
ory size and lower processor performance. In these power and
hardware resource-limited circumstances, to improve per- FIGURE 1. Vivado HLS design flow.

formance and reduce power consumption, many researchers

have proposed CNN accelerators at various design levels
including system level, application level, architecture level, the architecture of the proposed accelerator and the design
and transistor level [7], [8], [9]. Recent studies have proposed details of data flow and processing modules in two parts, opti-
a flexible CNN accelerator design for FPGA implementation mization & modularization and low power techniques. Sec-
at the system level and a flexible FPGA accelerator for vari- tion V discusses the implementation and simulation results
ous CNN architectures from lightweight CNN to large-scale with previous works. Finally, the conclusions are given in
CNN [11], [12], [13], [14], [15], [16]. Section VI.
Since CNN-based object detection applications become
more common technology for unmanned drones, autonomous
II. BACKGROUND
vehicles, ADAS systems on the vehicle, and industrial
automation systems, researchers have been conducting CNN To design the CNN accelerator on an FPGA-SoC board, the
object detection-related research in terms of the following use of CAD tools and platforms is required. Each manufac-
topics; implementation on the mobile FPGA-SoC board for turer provides the CAD tools and development platforms for
real-time processing, accelerator design for mobile FPGA- the implementation process and reconfigurable components
System-On-Chip (SoC), and hardware optimization tech- and parts on the FPGA (e.g., Vivado from Xilinx, Quartus
niques, are becoming popular. To overcome the lack of Prime from Intel, and PYNQ). However, due to the closed
hardware resources on the mobile FPGAs such as Xilinx platform feature of the Xilinx FPGA products, in the High
Ultra 96 and Xilinx PYNQ-Z1 which are popular FPGA-SoC Level of Synthesis (HLS) design flow as shown in Fig-
devices implemented on drone and IoT devices, many ure 1, the Vivado HLS system can verify the functionality
papers have been published to achieve high performance, of the C/C++/System C code and convert the code to the
low power consumption, and real-time processing speed register-transfer level (RTL) code for the FPGA hardware
[17], [18], [19], [20], [21], [22], [23], [24]. The main focus operation and optimization [7], [23], [35], however, once
of their proposed implementation techniques in those papers the RTL code is generated by Vivado HLS Tool, the code
is reducing the size of the CNN architecture, pre-processing is no longer readable or modifiable. On the other hand, the
the input feature map, tightening pipe-lining design, size platform-based design flow can import the VHDL/Verilog
adjustment of the input and output feature maps, and code code to set as customized IP blocks so that we can easily
modify the hardware design at the RT level or gate level
optimization [9], [25], [26], [27], [22], [28], [29], [30], [31].
and intuitively configure the data flow for Processing System
Moreover, in previous our research, we verified that RT-level
(PS) and Programmable Logic (PL) through the Vivado IP
optimization is able to not only reduce the processing time
Integrator.
but also, save dynamic power [32], [33], [34].
Therefore, in this work, we applied the low-power tech-
niques to the baseline RTL code of the CNN accelerator gen- A. PLATFORM-BASED DESIGN FLOW WITH RTL CODE
erated from the Tensil and applied the hardware-optimized The platform-based design flow was introduced by Xilinx
techniques to the proposed reconfigurable FPGA hardware Vivado which is an integrated design environment program
accelerator design through the proposed automated optimiza- tool as shown in Figure 2. The RTL code can be imported into
tion tool for RTL code. The rest of this paper is organized the IP block, and it can be assembled with other peripheral
as follows: Section II introduces the low-power techniques IP blocks and PS IP blocks to generate the hardware design
in RT-level for energy-efficiently accelerating the CNN com- for the bitstream. Jupyter Notebook is a web-based primary
puting operation and overviews the basic RT-level-based opti- computing environment of the PYNQ which is linked to
mization hardware design flow based on generated a baseline Xilinx platforms [36]. PYNQ is running based on Python
CNN accelerator RTL code by Tensil. Section III describes on the Jupyter Notebook with Linux kernel on the FPGA.

VOLUME 11, 2023 59439

V. H. Kim, K. K. Choi: Reconfigurable CNN-Based Accelerator Design

FIGURE 5. Local explicit clock gating (LECG).

FIGURE 2. FPGA SoC platform design architecture.

FIGURE 3. Conventional register.

FIGURE 4. Local explicit clock enable(LECE).

FIGURE 6. Bus-specific clock gating (BSCG).

However, the Python library is not fully supported in the
PYNQ.
C. LOW POWER TECHNIQUES AT RT-LEVEL
B. BASELINE RTL CODE GENERATION FOR CNN 1) LOW POWER CLOCK TECHNIQUES
ACCELERATOR: TENSIL Clock Gating(CG) is a basic low-power technique to enhance
Tensil is a set of tools for designing the accelerator includ- performance and efficiency by disabling unnecessary clock
ing an RTL generator, a model compiler, and a set of cycles as shown in Figure 3. Standby states are included in
drivers [24]. The basic processing flow is that using the many parts of the CNN computing process. This leads to
selected machine learning accelerator architectures on the a significant amount of power consumption. CG eliminates
limited FPGA-SoC devices, it generates the RTL code by unnecessary clock cycle occurrences. Local Explicit Clock
using a model compiler. The primary advantage of Tensil is Enable (LECE) [37], [38], [39] is a method using ENABLE
that it is able to create an accelerator without quantization or signal for 2:1 multiplexer or multiplexed D flip-flop to update
other degradation. Tensil applies a few of the optimization the output on the rising edge of the clock only when the
techniques for the selected FPGAs, thus the optimization ENABLE signal is high as shown in Figure 4. The more bits
performance is not effective enough. Previously we applied are used as an input, the more ENABLE signals occur. The
our low-power techniques to CNN accelerator RTL code Local Explicit Clock Gating (LECG) [37], [38], [39] has the
and verified the performance [33]. In section III, we apply equivalent fundamental of the LECE as shown in Figure 5,
the techniques to the Tensil RTL code and evaluate the however, LECG has the advantage of reducing power con-
effectiveness of the techniques on the FPGA-SoC boards, sumption in the case of the multi-bit of output, by updating
PYNQ-Z1. output at once when completing the output update.

59440 VOLUME 11, 2023

V. H. Kim, K. K. Choi: Reconfigurable CNN-Based Accelerator Design

FIGURE 8. Proposed processing system and programmable logic unit.

FIGURE 7. Enhanced clock gating (ECG).

Bus-Specific Clock Gating (BSCG) [37], [38], [39] utilizes

the clock gating technique and adjusts the EN signal based on
the comparison of I/O signals as shown in Figure 6. In terms
of power consumption, XOR gates are significantly lower
power-consuming logic gates for the gate-level power anal-
ysis compared to AND/OR gates. Enhanced Clock Gating
(ECG) [37], [39] consists of XOR gates to control the input
clock signals and enable signals when considering multi-
bit I/O data as shown in Figure 7. The efficiency of the FIGURE 9. IP block design for CNN object detection.
power reduction would be maximized when there are larger
pipelines and IO bit sizes.
customized IP blocks (top_pynqz1_0). In the top_pynqz1_0
III. ARCHITECTURE DESIGN OF ACCELERATOR block, there are hierarchically defined multiply-accumulate
A. ARCHITECTURE OVERVIEW units (MACs), POOLs, memory bandwidth, memory access
The block diagram in Figure 8 shows a data flow of the pro- scheduler, and CONV computing modules. The original Ten-
posed processing unit design for an FPGA-based CNN object sil’s RTL codes do not have hierarchical architecture, How-
detection accelerator. For the programmable logic, each type ever, in this case, analysis of the RTL code would take a long
of block is defined specifically and is modularized to enhance time.
the implementation efficiency of various CNN models. The The primary feature of FPGA devices is in reconfigura-
proposed architecture can be mainly divided into computing bility. Therefore, to maximize the flexibility of the FPGA-
processing logic and memory system as detailed as follows: SoC design, the proposed RTL code of the CNN acceler-
In the memory system, there are three main functional com- ator was designed with hierarchical and modularized main
ponents for the on-chip and off-chip data transfer to prepare modules including MACs, Conv, Multiplier, Adder, MUX,
data for computation. First, the buffers are responsible for and ALU as shown in Figure 10. This figure shows that
storing data. All the weights and intermediate feature maps the proposed flexible accelerator design has the scalabil-
are arranged in a layer-by-layer format which is stored in ity to support the different CNN architectures such as the
external DRAM. When loading a tile of data to the on-chip YOLO series and ResNet20. After the modularization of
input/weight/output ping-pong buffers, they are arranged in the MAC unit, we applied our low-power techniques such
a unique format according to the requirement of computa- as clock gating, XOR gate, and OR gate for MUXs. This
tion mode. Second, a dispatching module employs Direct design is able to accommodate add-on detectors, such as
Memory Access (DMA) engine through DMA descriptors Single-Shot Detectors (SSD) and Multibox detectors. For the
generated by the DMA control module to fetch required data memory access modules such as InnerDualPortMem1, Dual-
from DRAM or save the results back to DRAM. Third, the PortMem1, MemSplitter, and MemBoundarySplitter, mem-
on-chip data scheduling modules, consisting of scatter and ory partitioning techniques are applied. To accelerate the
gather modules, realize the serial-to-parallel or parallel-to- CPU computing operation, the memory reassignment tech-
serial conversions, which manipulate the data flow for the nique has been applied so that the memory size and flow
following computation or transmission. would be changed once it detected the pre-assigned compu-
tation. For example, our target CNN accelerator architectures
B. THE PROPOSED RECONFIGURABLE ACCELERATOR should be using fixed 16-bit, then we can pre-assign the
HARDWARE ARCHITECTURE memory size prior to the input data or the weight. This would
As shown in Figure 9, the IP block design for the CNN object be helpful to compute the sequential computation operation
detection accelerator consists of referencing IP blocks and such as convolution operation.

VOLUME 11, 2023 59441

V. H. Kim, K. K. Choi: Reconfigurable CNN-Based Accelerator Design

FIGURE 10. Flexible accelerator design overview.

FIGURE 13. Proposed MAC unit design.

The optimized multiplier design using a power-efficient

adder block based on power analysis was implemented at
RT-level as shown in Figure 13. In convolution computa-
tion, the computation complexity of the multipliers can cause
dynamic power consumption and delays. To reduce the com-
plexity of the adder and multiplier, first, we tested the full
FIGURE 11. Low power design flow.
adder designs through the transistor-level design process. For
the MAC module, Bus Specific Clock (BSC) is applied. In a
conventional register, the data input is active and lasts until
the end of the period. In this case, the power could be wasted.
When BSC is applied to register Z, the XOR can control
enabling the clock so that the clock toggles are not wasted.
Compared to the conventional design in Figure 12, the AND
gate and Latch were added to safely disable the clock without
allowing any glitches to reach the register clock.

B. PROPOSED MAC HARDWARE DESIGN

The detailed technique approaches are as follows: 1. MAC
unit is the major power consumption unit of the convolution
operation in which the data transmissions occur frequently.
This technique is applied to remove the wasted clock toggles
during the data input is deactivated. 2. The proposed adder
group with BSC can reduce the wasted clock toggles so that
FIGURE 12. Conventional MAC unit design. it can reduce the power consumption of the adder unit. 3.
In stochastic multiplication, two unary bit-streams can be
IV. PROPOSED HARDWARE IMPLEMENTATION operated using AND gate and the OR gate can be applied
A. LOW POWER HW DESIGN TECHNIQUES AT RTL instead of the MUX operation. Not only as the same as
As shown in Figure 9, in this experiment, our optimization for MUX, OR gate can support parallel MAC operation, but
low power is targeted to the Tensil’s processing flow. Step 1. also, it consequences a reduced dynamic power consumption
Based on the neural architecture file, .tarch, Tensil helps result. Eventually, we utilized this parallel MAC structure
generate the TCU RTL code for basic hardware resource using OR gate as shown in Figure 13. Figure 14 shows the
design as shown in Figure 11. Step 2. After the RTL code proposed MAC pseudocode which has been applied to BSC
is generated, we applied our low-power techniques including and OR-based MAC computing operations.
LECG, Split memory, BSC, and ECG. Step 3. Using Vivado,
we designed the hardware IP block. From the IP block design, C. FLEXIBLE ACCELERATOR DESIGN FOR
we were able to get the bit-stream file. Step 4. Based on MULTI-ARCHITECTURE AND OPTIMIZATION TECHNIQUES
the customized bit stream, we were able to implement the Based on the analysis result of the target CNN architecture,
hardware accelerator for the CNN object detection algorithm. we customize the pipe-lining of the data flow and assign max-
Step 5. We simulated it on the FPGA board and evaluated the imized buffer capacity in the BRAM and external memory.
power consumption of the target DNN-based object detection The fixed-point numbers are able to reduce the computa-
processing by using Vivado. tion resource consumption and it also is able to reduce the

59442 VOLUME 11, 2023

V. H. Kim, K. K. Choi: Reconfigurable CNN-Based Accelerator Design

FIGURE 15. FPGA testbed (Xilinx PYNQ-Z1 FPGA).

FIGURE 14. Psuedocode for proposed MAC operation. FIGURE 16. HW resource report comparison of tensil sample simulation
and our work tested on PYNQ-Z1.

bandwidth requirements, however, for getting high perfor- the converted ML model code in ONNX format. Tensil com-
mance, the optimized size of bandwidth should be defined piler generates three import artifacts, a .tmodel, .tdata, and
by the analysis of the network architecture. Once the data .tprog files. Once the .tmodel manifest for the model into the
transmission size is fixed, memory splitting and merging driver is loaded, it tells the driver where to locate the binary
should be applied. Our CNN accelerator is based on the 16-bit files, program data, and weights data. They were not open
fixed point bandwidth which is given by the reference [24]. data and, we are using them without any modification, so that
We modularize the RTL code based on thorough analysis, means the accuracy was not changed.
which helps easy modification for implementing the accel-
erator design. B. FPGA IMPLEMENTATION RESULTS
Compared with Tensil’s optimization result, we verified more
V. EXPERIMENT AND RESULTS WITH DISCUSSION register buffers are activated for our proposed structure. Once
A. EXPERIMENT ENVIRONMENT we check the functionality and performance result, then you
For the basic hardware platform, we chose the PYNQ-Z1 can modify the structure by RTL code modification. Then
board instead of the regular ZYNQ-7020 board, where the we can improve the specific hardware resources and power
PYNQ is an open-source project from AMD [36]. It embeds consumption of the design. Analyzing the result leads to
Xilinx ZYNQ-7020, and also provides a Jupyter-based frame- improved performance of CNN processing. Figure 16 shows
work with Python APIs. The PYNQ-Z1 board has the the power consumption reduction of the processing system
FPGA-SoC platform which is composed of PL and PS. unit. We were able to archive the 43.9 (GOPs/W) as a power
The basic software development tool is Jupyter Notebook, efficiency result, compared to other FPGA board implemen-
a web-based software programming platform. It is also sup- tations, it increased 1.37 times. the hardware resource utiliza-
porting Python, C/C++ programming languages, and other tion in DSPs is increased 2.2 times from the result of [24].
open-source libraries such as OpenCV. Our experiment envi-
ronment is as follows in Figure 15. The imported CNN C. POWER CONSUMPTION RESULTS
architecture is the Resnet-20. It has 23 layers and it was Our optimization will decrease 16% of the dynamic power
trained with CIFAR-10 which provides a dataset of 10,000 consumption. Also, the total On-Chip power will deduct 20%
images in several formats. We used the provided weight file of the total power consumption. Once the global buffer is
and converted the ONNX format of the ResNet Model [40]. activated, the unused global clock buffer and the second
ONNX, a machine learning (ML) model converter, provides global clock resource will help to improve the performance of

VOLUME 11, 2023 59443

V. H. Kim, K. K. Choi: Reconfigurable CNN-Based Accelerator Design

TABLE 1. Comparison result of FPGA implementation.

the design. Moreover, this can be the solution to some high greatly assisted the research and greatly improved the
fan-out signals to make the device fully functional. In the manuscript.
pipeline logic, inserting an intermediate flip-flop(FF) can
improve the working speed of the device, however, too many REFERENCES
flip-flops make computational complexity. Our low-power [1] A. K. Jameil and H. Al-Raweshidy, ‘‘Efficient CNN architecture on FPGA
techniques show better performance than the performance of using high level module for healthcare devices,’’ IEEE Access, vol. 10,
FFs. pp. 60486–60495, 2022.
[2] S. Y. Nikouei, Y. Chen, S. Song, R. Xu, B. Choi, and T. Faughnan, ‘‘Smart
surveillance as an edge network service: From Harr-Cascade, SVM to a
VI. CONCLUSION lightweight CNN,’’ in Proc. IEEE 4th Int. Conf. Collaboration Internet
In this article, the proposed highly reconfigurable FPGA Comput. (CIC), Oct. 2018, pp. 256–265.
[3] K. Haeublein, W. Brueckner, S. Vaas, S. Rachuj, M. Reichenbach, and
hardware accelerator showed improved performance in terms D. Fey, ‘‘Utilizing PYNQ for accelerating image processing functions
of the processing speed and power consumption result during in ADAS applications,’’ in Proc. 32nd Int. Conf. Archit. Comput. Syst.,
inference of various CNNs. The hardware optimization is May 2019, pp. 1–8.
conducted mainly for two purposes: to improve the through- [4] Z. Zhang, M. A. P. Mahmud, and A. Z. Kouzani, ‘‘FitNN: A low-resource
FPGA-based CNN accelerator for drones,’’ IEEE Internet Things J., vol. 9,
put and to reduce power consumption. For improving perfor- no. 21, pp. 21357–21369, Nov. 2022.
mance, the minimized data transferring strategy was applied [5] C. Fu and Y. Yu, ‘‘FPGA-based power efficient face detection for mobile
by assigning the maximum amount of buffers during the robots,’’ in Proc. IEEE Int. Conf. Robot. Biomimetics (ROBIO), Dec. 2019,
pp. 467–473.
computations and by applying a controlled pipeline design
[6] X. Li, X. Gong, D. Wang, J. Zhang, T. Baker, J. Zhou, and T. Lu, ‘‘ABM-
for minimized data access. For achieving energy efficient SpConv-SIMD: Accelerating convolutional neural network inference for
results of CNN object detection operation, not only the data industrial IoT applications on edge devices,’’ IEEE Trans. Netw. Sci. Eng.,
access controlling for minimized memory access, but also we early access, Feb. 25, 2022, doi: 10.1109/TNSE.2022.3154412.
[7] S. Tamimi, Z. Ebrahimi, B. Khaleghi, and H. Asadi, ‘‘An efficient SRAM-
proposed the RT level low power techniques-applied recon- based reconfigurable architecture for embedded processors,’’ IEEE Trans.
figured MAC units such as advanced clock gating-applied Comput.-Aided Design Integr. Circuits Syst., vol. 38, no. 3, pp. 466–479,
adder, register Z with bus specific clock, and OR-based MAC Mar. 2019.
[8] A. J. A. El-Maksoud, M. Ebbed, A. H. Khalil, and H. Mostafa, ‘‘Power
architecture to RTL code of the proposed accelerator. The efficient design of high-performance convolutional neural networks hard-
proposed hardware accelerator for ResNet-20 was imple- ware accelerator on FPGA: A case study with GoogLeNet,’’ IEEE Access,
mented on mobile FPGA-SoC, PYNQ-Z1, and the power vol. 9, pp. 151897–151911, 2021.
consumption was measured during inference operation. As a [9] S. Lee, D. Kim, D. Nguyen, and J. Lee, ‘‘Double MAC on a DSP:
Boosting the performance of convolutional neural networks on FPGAs,’’
result, the throughput result showed a 15% improvement IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 38, no. 5,
compared with the baseline RTL code of the accelerator, pp. 888–897, May 2019.
also power consumption was reduced by 16%, and hard- [10] S. Ullah, S. Rehman, M. Shafique, and A. Kumar, ‘‘High-performance
accurate and approximate multipliers for FPGA-based hardware acceler-
ware utilization was increased by 58%. The object detection ators,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 41,
processing speed was 9.17FPS, which shows that real-time no. 2, pp. 211–224, Feb. 2022.
processing is feasible in mobile FPGA. [11] X. Wu, Y. Ma, M. Wang, and Z. Wang, ‘‘A flexible and efficient FPGA
accelerator for various large-scale and lightweight CNNs,’’ IEEE Trans.
Circuits Syst. I, Reg. Papers, vol. 69, no. 3, pp. 1185–1198, Mar. 2022.
ACKNOWLEDGMENT
[12] W. Liu, J. Lin, and Z. Wang, ‘‘A precision-scalable energy-efficient con-
The authors would like to thank their colleagues from volutional neural network accelerator,’’ IEEE Trans. Circuits Syst. I, Reg.
KETI and KEIT who provided insight and expertise that Papers, vol. 67, no. 10, pp. 3484–3497, Oct. 2020.

59444 VOLUME 11, 2023

V. H. Kim, K. K. Choi: Reconfigurable CNN-Based Accelerator Design

[13] H. Irmak, D. Ziener, and N. Alachiotis, ‘‘Increasing flexibility of FPGA- [33] Y. Kim, H. Kim, N. Yadav, S. Li, and K. K. Choi, ‘‘Low-power RTL
based CNN accelerators with dynamic partial reconfiguration,’’ in Proc. code generation for advanced CNN algorithms toward object detection
31st Int. Conf. Field-Programmable Log. Appl. (FPL), Aug. 2021, in autonomous vehicles,’’ Electronics, vol. 9, no. 3, p. 478, Mar. 2020.
pp. 306–311. [Online]. Available: https://www.mdpi.com/2079-9292/9/3/478
[14] W. Chen, D. Wang, H. Chen, S. Wei, A. He, and Z. Wang, ‘‘An asyn- [34] H. Kim and K. Choi, ‘‘The implementation of a power efficient BCNN-
chronous and reconfigurable CNN accelerator,’’ in Proc. IEEE Int. Conf. based object detection acceleration on a Xilinx FPGA-SoC,’’ in Proc. Int.
Electron Devices Solid State Circuits (EDSSC), Jun. 2018, pp. 1–2. Conf. Internet Things (iThings), IEEE Green Comput. Commun. (Green-
[15] C. Yang, Y. Wang, H. Zhang, X. Wang, and L. Geng, ‘‘A reconfigurable Com), IEEE Cyber, Phys. Social Comput. (CPSCom), IEEE Smart Data
CNN accelerator using tile-by-tile computing and dynamic adaptive data (SmartData), Jul. 2019, pp. 240–243.
truncation,’’ in Proc. IEEE Int. Conf. Integr. Circuits, Technol. Appl. [35] Y. Kim, Q. Tong, K. Choi, E. Lee, S. Jang, and B. Choi, ‘‘System
(ICTA), Nov. 2019, pp. 73–74. level power reduction for YOLO2 sub-modules for object detection of
[16] S. Zeng, K. Guo, S. Fang, J. Kang, D. Xie, Y. Shan, Y. Wang, and H. Yang, future autonomous vehicles,’’ in Proc. Int. SoC Design Conf. (ISOCC),
‘‘An efficient reconfigurable framework for general purpose CNN-RNN Nov. 2018, pp. 151–155.
models on FPGAs,’’ in Proc. IEEE 23rd Int. Conf. Digit. Signal Process. [36] PYNQ: Python Productivity. Accessed: Feb. 15, 2023. [Online]. Available:
(DSP), Nov. 2018, pp. 1–5. http://www.pynq.io/
[17] L. Gong, C. Wang, X. Li, H. Chen, and X. Zhou, ‘‘MALOC: A fully [37] L. Li, K. Choi, S. Park, and M. Chung, ‘‘Selective clock gating by using
pipelined FPGA accelerator for convolutional neural networks with all lay- wasting toggle rate,’’ in Proc. IEEE Int. Conf. Electro/Information Tech-
ers mapped on chip,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits nol., Jun. 2009, pp. 399–404.
Syst., vol. 37, no. 11, pp. 2601–2612, Nov. 2018. [38] W. Wang, Y.-C. Tsao, K. Choi, S. Park, and M.-K. Chung, ‘‘Pipeline power
[18] L. Bai, Y. Zhao, and X. Huang, ‘‘A CNN accelerator on FPGA using reduction through single comparator-based clock gating,’’ in Proc. Int. SoC
depthwise separable convolution,’’ IEEE Trans. Circuits Syst. II, Exp. Design Conf. (ISOCC), Nov. 2009, pp. 480–483.
Briefs, vol. 65, no. 10, pp. 1415–1419, Oct. 2018. [39] Y. Zhang, Q. Tong, L. Li, W. Wang, K. Choi, J. Jang, H. Jung,
[19] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, and S.-Y. Ahn, ‘‘Automatic register transfer level CAD tool design for
N. Xu, S. Song, Y. Wang, and H. Yang, ‘‘Going deeper with embedded advanced clock gating and low power schemes,’’ in Proc. Int. SoC Design
FPGA platform for convolutional neural network,’’ in Proc. ACM/SIGDA Conf. (ISOCC), Nov. 2012, pp. 21–24.
Int. Symp. Field-Programmable Gate Arrays. New York, NY, USA: [40] Compile an ML Model. Accessed: Feb. 15, 2023. [Online]. Available:
Association for Computing Machinery, Feb. 2016, pp. 26–35, doi: https://www.tensil.ai/docs/howto/compile/
10.1145/2847263.2847265.
[20] T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Patel, and M. Herbordt,
‘‘A framework for acceleration of CNN training on deeply-pipelined FPGA VICTORIA HEEKYUNG KIM (Graduate Student
clusters with work and weight load balancing,’’ in Proc. 28th Int. Conf. Member, IEEE) received the B.S. degree in elec-
Field Program. Log. Appl. (FPL), Aug. 2018, p. 394. tronic and electrical engineering from Hongik Uni-
[21] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, versity, Seoul, South Korea, in 2012, and the M.S.
and J. Cong, ‘‘FP-DNN: An automated framework for mapping deep degree in electrical and computer engineering
neural networks onto FPGAs with RTL-HLS hybrid templates,’’ in Proc. from the Illinois Institute of Technology, Chicago,
IEEE 25th Annu. Int. Symp. Field-Programmable Custom Comput. Mach. in 2015. She is currently pursuing the Ph.D. degree
(FCCM), Apr. 2017, pp. 152–159. in computer engineering with the Illinois Insti-
[22] Y. Ma, Y. Cao, S. Vrudhula, and J. Seo, ‘‘Optimizing the convolution tute of Technology, Chicago. Her current research
operation to accelerate deep neural networks on FPGA,’’ IEEE Trans. Very interests include low-power and high-performance
Large Scale Integr. (VLSI) Syst., vol. 26, no. 7, pp. 1354–1367, Jul. 2018. HW/SW optimization for CNN accelerator design for FPGA and ASIC,
[23] S. Li, Y. Luo, K. Sun, N. Yadav, and K. K. Choi, ‘‘A novel FPGA
wireless sensor networks design and sensor data analysis and embedded
accelerator design for real-time and ultra-low power deep convolutional
HW/SW system design for robotics, drones, and the IoTs.
neural networks compared with Titan X GPU,’’ IEEE Access, vol. 8,
pp. 105455–105471, 2020.
[24] Learn Tensil With ResNet and PYNQ Z1. Accessed: Dec. 15, 2022. KYUWON KEN CHOI (Senior Member, IEEE)
[Online]. Available: https://www.tensil.ai/docs/tutorials/resnet20-pynqz1/ received the Ph.D. degree in electrical and com-
[25] X. Zhang, Y. Ma, J. Xiong, W. W. Hwu, V. Kindratenko, and D. Chen, puter engineering from the Georgia Institute of
‘‘Exploring HW/SW co-design for video analysis on CPU-FPGA heteroge- Technology, Atlanta, USA, in 2002. He is cur-
neous systems,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., rently a Full Professor with the Department
vol. 41, no. 6, pp. 1606–1619, Jun. 2022. of Electrical and Computer Engineering, Illinois
[26] E. Antolak and A. Pulka, ‘‘Energy-efficient task scheduling in design Institute of Technology. He was a Postdoc-
of multithread time predictable real-time systems,’’ IEEE Access, vol. 9,
toral Researcher with the Prof. Takayasu Sakurai
pp. 121111–121127, 2021.
Laboratory, Institute of Industrial Science, The
[27] W. Huang, H. Wu, Q. Chen, C. Luo, S. Zeng, T. Li, and Y. Huang, ‘‘FPGA-
based high-throughput CNN hardware accelerator with high computing University of Tokyo, Japan, working on leakage-
resource utilization ratio,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 33, power-reduction circuit techniques. He has proposed and conducted several
no. 8, pp. 4069–4083, Aug. 2022. projects supported by the National Aeronautics and Space Administration
[28] G. Lakshminarayanan and B. Venkataramani, ‘‘Optimization techniques (NASA), Defense Advanced Research Projects Agency (DARPA), U.S.
for FPGA-based wave-pipelined DSP blocks,’’ IEEE Trans. Very Large National Science Foundation (NSF), Scientific Research Corporation (SRC),
Scale Integr. (VLSI) Syst., vol. 13, no. 7, pp. 783–793, Jul. 2005. and Korea Electronics Technology Institute (KETI) regarding power-aware
[29] D. Wang, K. Xu, J. Guo, and S. Ghiasi, ‘‘DSP-efficient hardware accel- computing/communication (PACC). He was a Senior CAD Engineer and
eration of convolutional neural network inference on FPGAs,’’ IEEE a Technical Consultant for low-power systems-on-chip (SoC) design with
Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 39, no. 12, Samsung Semiconductor, Broadcom and Sequence Design, prior to joining
pp. 4867–4880, Dec. 2020. IIT. In the past, he had eight-year industry experience in the area of ultralow
[30] A. Prihozhy, E. Bezati, A. A. A. Rahman, and M. Mattavelli, ‘‘Synthesis power VLSI chip design from compiler level to circuit level. Last few years,
and optimization of pipelines for HW implementations of dataflow pro-
by using his novel low-power techniques, several processor and mobile chips
grams,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 34,
were successfully fabricated in deep-submicrometer technology and more
no. 10, pp. 1613–1626, Oct. 2015.
[31] W. Lou, L. Gong, C. Wang, Z. Du, and X. Zhou, ‘‘OctCNN: A high than 120 peer-reviewed journals and conference papers have been published.
throughput FPGA accelerator for CNNs using octave convolution algo- His current research interests include ultra-low-power digital circuits and
rithm,’’ IEEE Trans. Comput., vol. 71, no. 8, pp. 1847–1859, Aug. 2022. artificial intelligent (AI) related IC designs. He is also the Director of the
[32] H. Kim and K. Choi, ‘‘Low power FPGA-SoC design techniques for CNN- VLSI Design and Automation Laboratory (DA-Lab), IIT, the Editor-in-Chief
based object detection accelerator,’’ in Proc. IEEE 10th Annu. Ubiqui- of Journal of Pervasive Technologies, a guest editor of Springer and Wiley
tous Comput., Electron. Mobile Commun. Conf. (UEMCON), Oct. 2019, journals, and a TPC member for several IEEE circuit design conferences.
pp. 1130–1134.

VOLUME 11, 2023 59445

FPGA-SoC Implementation of YOLOv4 For Flying-Object Detection
No ratings yet
FPGA-SoC Implementation of YOLOv4 For Flying-Object Detection
20 pages
Zynq FPGA Labs 23
No ratings yet
Zynq FPGA Labs 23
51 pages
Thin Dielectric Layers Characterization Using Corona-Oxide-Semiconductor Measurement Technique
No ratings yet
Thin Dielectric Layers Characterization Using Corona-Oxide-Semiconductor Measurement Technique
54 pages
Mixed-Signal ASIC Design For Digital RF Memory Applications: Michael J. Groden James R. Mann
No ratings yet
Mixed-Signal ASIC Design For Digital RF Memory Applications: Michael J. Groden James R. Mann
5 pages
High Performance FPGA Based CNN Accelerator
No ratings yet
High Performance FPGA Based CNN Accelerator
4 pages
Implementing Open Flow Switch Using FPGA Based Platform
No ratings yet
Implementing Open Flow Switch Using FPGA Based Platform
140 pages
Mosfet Fabrication Part 1
No ratings yet
Mosfet Fabrication Part 1
13 pages
EN17409056
No ratings yet
EN17409056
22 pages
Efficient CORDIC-Based Activation Functions for RNN Acceleration on FPGAs
No ratings yet
Efficient CORDIC-Based Activation Functions for RNN Acceleration on FPGAs
11 pages
High-/Metal-Gate Stack and Its MOSFET Characteristics
No ratings yet
High-/Metal-Gate Stack and Its MOSFET Characteristics
3 pages
Baharav - Capacitive Touch Sensing Signal and Image Processing Algorithms
100% (1)
Baharav - Capacitive Touch Sensing Signal and Image Processing Algorithms
12 pages
FPGA
No ratings yet
FPGA
20 pages
Design Optimisation of An Inductor-Integrated MF Transformer For A High-Power Isolated Dual-Active-Bridge DC-DC Converter
No ratings yet
Design Optimisation of An Inductor-Integrated MF Transformer For A High-Power Isolated Dual-Active-Bridge DC-DC Converter
11 pages
Design and Implementation of Uart Serial Communication Protocol
100% (2)
Design and Implementation of Uart Serial Communication Protocol
35 pages
Ping and How It Works PDF
No ratings yet
Ping and How It Works PDF
12 pages
Assignment # 1: 1.write A Program in Python To Print PRIME Numbers Between 1 To N, Where The
No ratings yet
Assignment # 1: 1.write A Program in Python To Print PRIME Numbers Between 1 To N, Where The
11 pages
A Silicon Sample Maintained at T 300K Is Characterized by The Energy Band-Diagram Below
No ratings yet
A Silicon Sample Maintained at T 300K Is Characterized by The Energy Band-Diagram Below
5 pages
An FPGA-Based Wireless Network Capstone Project - 10.1.1.156.9131
No ratings yet
An FPGA-Based Wireless Network Capstone Project - 10.1.1.156.9131
4 pages
Radio Frequency Identification: Tarun Gupta Vii - Sem
No ratings yet
Radio Frequency Identification: Tarun Gupta Vii - Sem
24 pages
421 Lab Manual
No ratings yet
421 Lab Manual
17 pages
A Comprehensive Survey On Applications of Transformers For Deep Learning Tasks
No ratings yet
A Comprehensive Survey On Applications of Transformers For Deep Learning Tasks
58 pages
What Is Xilinx XC7K160T-2FBG484i Fpga
No ratings yet
What Is Xilinx XC7K160T-2FBG484i Fpga
13 pages
Analog Solutions For Xilinx FPGAs
No ratings yet
Analog Solutions For Xilinx FPGAs
36 pages
Power HEMT Fabrication Process
No ratings yet
Power HEMT Fabrication Process
24 pages
C Programming
No ratings yet
C Programming
28 pages
DR - Chao Tan, Carnegie Mellon University: Computer Organization Computer Architecture
No ratings yet
DR - Chao Tan, Carnegie Mellon University: Computer Organization Computer Architecture
221 pages
Transport in Metal-Oxide-Semiconductor Structures - Mobile Ions Effects On The Oxide Properties (2011)
100% (1)
Transport in Metal-Oxide-Semiconductor Structures - Mobile Ions Effects On The Oxide Properties (2011)
119 pages
[Ebooks PDF] download Silicon Photonics for High-Performance Computing and Beyond 1st Edition Mahdi Nikdast full chapters
100% (3)
[Ebooks PDF] download Silicon Photonics for High-Performance Computing and Beyond 1st Edition Mahdi Nikdast full chapters
37 pages
Xilinx PLC Handbook
No ratings yet
Xilinx PLC Handbook
210 pages
Project Report
No ratings yet
Project Report
66 pages
Klayout-0 21 16
No ratings yet
Klayout-0 21 16
511 pages
05 Thermoresistive Sensors
No ratings yet
05 Thermoresistive Sensors
23 pages
Energy Band Diagram
No ratings yet
Energy Band Diagram
30 pages
EC8452 2marks PDF
No ratings yet
EC8452 2marks PDF
21 pages
HC26.11.310 HBM Bandwidth Kim Hynix Hot Chips HBM 2014 v7
No ratings yet
HC26.11.310 HBM Bandwidth Kim Hynix Hot Chips HBM 2014 v7
24 pages
Electronics and Communication Engineering
No ratings yet
Electronics and Communication Engineering
154 pages
Aerial Lab Manual
No ratings yet
Aerial Lab Manual
48 pages
Embedded Systems (PDFDrive)
No ratings yet
Embedded Systems (PDFDrive)
255 pages
Advanced CMOS Circuits
No ratings yet
Advanced CMOS Circuits
20 pages
Ece Seminar Topics
No ratings yet
Ece Seminar Topics
17 pages
Read Chapter 3, The 8051 Microcontroller Architecture, Programming and Applications by Kenneth .J.Ayala
No ratings yet
Read Chapter 3, The 8051 Microcontroller Architecture, Programming and Applications by Kenneth .J.Ayala
32 pages
Tanner EDA Tutorials PDF
100% (2)
Tanner EDA Tutorials PDF
53 pages
Soc Final Project
100% (1)
Soc Final Project
24 pages
Ds187 XC7Z010 XC7Z020 Data Sheet Xilinx
No ratings yet
Ds187 XC7Z010 XC7Z020 Data Sheet Xilinx
73 pages
SILVACO
No ratings yet
SILVACO
14 pages
CMOS Mixed Signal Circuit Design
No ratings yet
CMOS Mixed Signal Circuit Design
261 pages
Project Report
No ratings yet
Project Report
24 pages
Vijay BTP Report EndSem
No ratings yet
Vijay BTP Report EndSem
46 pages
Isscc2018 31 Digest
No ratings yet
Isscc2018 31 Digest
17 pages
Anti Fuse
No ratings yet
Anti Fuse
2 pages
MOS Analysis
No ratings yet
MOS Analysis
11 pages
3CCRX Broadcom JSSCC 16
No ratings yet
3CCRX Broadcom JSSCC 16
14 pages
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
No ratings yet
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
42 pages
Unit - 4 Embedded Software Development Process and Tools
No ratings yet
Unit - 4 Embedded Software Development Process and Tools
25 pages
Microcontroller Application Development
100% (2)
Microcontroller Application Development
17 pages
System Memory ARM
No ratings yet
System Memory ARM
15 pages
Digital and Microprocessor Techniques V11
From Everand
Digital and Microprocessor Techniques V11
Clive W. Humphris
4.5/5 (2)
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Emerging Technologies in Information and Communications Technology
From Everand
Emerging Technologies in Information and Communications Technology
Fouad Sabry
No ratings yet
"The Future of Connectivity: Unleashing the Power of 5G and Beyond": GoodMan, #1
From Everand
"The Future of Connectivity: Unleashing the Power of 5G and Beyond": GoodMan, #1
Patrick Mukosha
No ratings yet
Pic® Micro Principles on Your Mobile
From Everand
Pic® Micro Principles on Your Mobile
Clive W. Humphris
No ratings yet
National Institute of Technology Calicut: Department of Electronics and Communication Engineering
No ratings yet
National Institute of Technology Calicut: Department of Electronics and Communication Engineering
17 pages
Basic HLS Tutorial-2022.2
No ratings yet
Basic HLS Tutorial-2022.2
95 pages
Ug901 Vivado Synthesis
No ratings yet
Ug901 Vivado Synthesis
284 pages
A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA
No ratings yet
A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA
8 pages
Xilinx Answer 53786 7-Series Integrated Block For PCI Express in Vivado
No ratings yet
Xilinx Answer 53786 7-Series Integrated Block For PCI Express in Vivado
36 pages
pg338 Dpu
No ratings yet
pg338 Dpu
57 pages
Zynq7000 Embedded Design Tutorial
No ratings yet
Zynq7000 Embedded Design Tutorial
126 pages
Xilinx LogiCore
No ratings yet
Xilinx LogiCore
3 pages
Ug903 Vivado Using Constraints
No ratings yet
Ug903 Vivado Using Constraints
152 pages
Ug871 Vivado High Level Synthesis Tutorial
No ratings yet
Ug871 Vivado High Level Synthesis Tutorial
264 pages
Ug949 Vivado Design Methodology
No ratings yet
Ug949 Vivado Design Methodology
364 pages
Pg193 Partial Reconfiguration Controller
No ratings yet
Pg193 Partial Reconfiguration Controller
81 pages
26 Creating An Accelerator
No ratings yet
26 Creating An Accelerator
18 pages
Ug1209 Embedded Design Tutorial
No ratings yet
Ug1209 Embedded Design Tutorial
165 pages
Integrated Logic Analyzer V6.1: Logicore Ip Product Guide
No ratings yet
Integrated Logic Analyzer V6.1: Logicore Ip Product Guide
31 pages
Dsp48 Macro V3.0: Logicore Ip Product Guide
No ratings yet
Dsp48 Macro V3.0: Logicore Ip Product Guide
31 pages
Vivado Power Analysis Optimization Tutorial
No ratings yet
Vivado Power Analysis Optimization Tutorial
88 pages
Pg020 Axi Vdma
No ratings yet
Pg020 Axi Vdma
90 pages
Ug894 Vivado TCL Scripting
100% (1)
Ug894 Vivado TCL Scripting
111 pages
Pg073 Axi Apb Bridge
No ratings yet
Pg073 Axi Apb Bridge
30 pages
Zynq-7000 All Programmable Soc Software Developers Guide: Ug821 (V12.0) September 30, 2015
No ratings yet
Zynq-7000 All Programmable Soc Software Developers Guide: Ug821 (V12.0) September 30, 2015
68 pages
Ug949 Vivado Design Methodology
No ratings yet
Ug949 Vivado Design Methodology
316 pages
Ug949 Vivado Design Methodology
No ratings yet
Ug949 Vivado Design Methodology
243 pages
Jesd204 V6.1: Logicore Ip Product Guide
No ratings yet
Jesd204 V6.1: Logicore Ip Product Guide
128 pages
Xilinx Vivado License - LabVIEW FPGA - English
No ratings yet
Xilinx Vivado License - LabVIEW FPGA - English
8 pages
Video 9 Debug
No ratings yet
Video 9 Debug
21 pages
Vivado Design Suite User Guide: Release Notes, Installation, and Licensing
No ratings yet
Vivado Design Suite User Guide: Release Notes, Installation, and Licensing
81 pages
Vivado - HLS - To - Zynq - Design - Summary - Jgarrigos
No ratings yet
Vivado - HLS - To - Zynq - Design - Summary - Jgarrigos
24 pages
RFNoC Tutorial: FPGA Design
No ratings yet
RFNoC Tutorial: FPGA Design
64 pages

A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA

Uploaded by

A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA

Uploaded by

Received 10 April 2023, accepted 9 May 2023, date of publication 12 June 2023, date of current version 19 June 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3285279

A Reconfigurable CNN-Based Accelerator Design

many diverse object detection-adopted devices, whether it is

formance and reduce power consumption, many researchers

VOLUME 11, 2023 59439

FIGURE 5. Local explicit clock gating (LECG).

FIGURE 3. Conventional register.

FIGURE 4. Local explicit clock enable(LECE).

FIGURE 6. Bus-specific clock gating (BSCG).

59440 VOLUME 11, 2023

FIGURE 8. Proposed processing system and programmable logic unit.

Bus-Specific Clock Gating (BSCG) [37], [38], [39] utilizes

VOLUME 11, 2023 59441

FIGURE 10. Flexible accelerator design overview.

FIGURE 13. Proposed MAC unit design.

The optimized multiplier design using a power-efficient

B. PROPOSED MAC HARDWARE DESIGN

59442 VOLUME 11, 2023

FIGURE 15. FPGA testbed (Xilinx PYNQ-Z1 FPGA).

VOLUME 11, 2023 59443

TABLE 1. Comparison result of FPGA implementation.

59444 VOLUME 11, 2023

VOLUME 11, 2023 59445

You might also like