Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
ABSTRACT In the modern-day era of technology, a paradigm shift has been witnessed in the areas involving
applications of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). Specifically,
Deep Neural Networks (DNNs) have emerged as a popular field of interest in most AI applications such as
computer vision, image and video processing, robotics, etc. In the context of developed digital technologies
and the availability of authentic data and data handling infrastructure, DNNs have been a credible choice
for solving more complex real-life problems. The performance and accuracy of a DNN is a way better than
human intelligence in certain situations. However, it is noteworthy that the DNN is computationally too
cumbersome in terms of the resources and time to handle these computations. Furthermore, general-purpose
architectures like CPUs have issues in handling such computationally intensive algorithms. Therefore, a lot
of interest and efforts have been invested by the research fraternity in specialized hardware architectures
such as Graphics Processing Unit (GPU), Field Programmable Gate Array (FPGA), Application Specific
Integrated Circuit (ASIC), and Coarse Grained Reconfigurable Array (CGRA) in the context of effective
implementation of computationally intensive algorithms. This paper brings forward the various research
works on the development and deployment of DNNs using the aforementioned specialized hardware
architectures and embedded AI accelerators. The review discusses the detailed description of the specialized
hardware-based accelerators used in the training and/or inference of DNN. A comparative study based on
factors like power, area, and throughput, is also made on the various accelerators discussed. Finally, future
research and development directions, such as future trends in DNN implementation on specialized hardware
accelerators, are discussed. This review article is intended to guide hardware architects to accelerate and
improve the effectiveness of deep learning research.
INDEX TERMS Machine learning, field programmable gate array (FPGA), deep neural networks (DNN),
deep learning (DL), application specific integrated circuits (ASIC), artificial intelligence (AI), central
processing unit (CPU), graphics processing unit (GPU), hardware accelerators.
I. INTRODUCTION term AI was coined in 1956 by John McCarthy, who defined
Deep neural networks (DNNs), also known as deep learning, it as ‘‘the science and engineering of making intelligent
are a subset of the Artificial Intelligence (AI) discipline. The machines’’. Machine learning is a broad topic of artificial
intelligence that was first defined by Arthur Samuel in
The associate editor coordinating the review of this manuscript and 1959 as the study of how computers may learn without being
approving it for publication was Vivek Kumar Sehgal . explicitly programmed. Machine Learning uses traditional
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
131788 VOLUME 10, 2022
P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey
techniques to perform tasks like classification, regression, Second (GOPS) to process a single image of size 224×224
and clustering. Deep learning is a subfield of machine with a top-1 accuracy of 61%, while ResNet-152 [108] takes
learning that uses a multi-layered algorithm structure known 22.6 GOPS with a top-1 accuracy of 79.3%. DNN’s superior
as a neural network, which was developed mostly between accuracy and performance are due to its capacity to extract
2006 and 2010. The relationship between deep learning, more complex high-level features, such as objects and facial
machine learning, and AI is illustrated in Fig. 1. structures, from raw input data.
DNNs are computationally expensive and need lots
of computational resources and memory for training and
inference. CPUs inherently support a limited number of
parallel workloads, though they can context switch with
hyper-threading. They are not sequential in nature. CPUs may
have more resources than their counter architectures (like
GPUs or FPGAs). CPUs have a limited number of registers
to support concurrent threads. But they may have higher
cache sizes, larger branch control logic, and higher on-chip
bandwidth than GPUs. However, the limited number of cores
on the CPU limits its ability to process large amounts of data
in parallel, which is required for DNN acceleration. Although
CPUs dominate the IoT industry in DNN inference on low-
power edge devices, they struggle to realize complex DNNs.
Therefore, specialized hardware designs are required for the
acceleration of DNNs. DNNs can be implemented using
customized hardware accelerators instead of a CPU. The
heterogeneous computing platforms viz. Field Programmable
Gate Array (FPGA), Application-Specific Integrated Circuits
(ASIC), and Graphical Processing Units (GPU) are widely
FIGURE 1. AI vs. Machine Learning vs. Deep Learning.
used to accelerate DNNs. The specialized hardware-based
Nowadays, DNNs are used in many modern AI appli- DNN accelerators can be categorized into two classes:
cations, including bioinformatics [60], natural language the first class of accelerators efficiently implements the
processing [147], image restoration [185], speech recogni- computational primitives, such as convolutional operations,
tion [34], computer vision [194], machine translation [36], fully connected operations, etc., for the DNNs [85], [175] and
healthcare [43], finance [221], robotics [94], visual art the second class of DNN accelerators efficiently optimize the
processing [193], etc. Furthermore, the recent applications data movement and memory access [56], [177]. These two
of DNN include aerospace and defence, automated driving, generations of specialized hardware-based DNN accelerators
recommendation systems, and industrial automation [71], improve the speed and energy efficiency of running DNNs.
[86], [101], [215]. DNNs are also useful in a variety of appli- There are two ways to improve the performance of the
cations, such as news aggregation and fraud detection [124], DNN acceleration. The first method is optimizing the
virtual assistants [61], chatbots [35], and customer relation- DNN algorithm, and the second is optimizing the hardware
ship management systems [203]. In addition, DNNs have also architecture. Therefore, we need to co-design the algorithm
been used to diagnose Covid-19 by classifying it based on and the hardware to achieve superior performance.
different lung and chest imaging modalities [40]. Because of their high throughput and memory band-
DNNs contain many layers, and each layer is capable of width, GPUs are one of the most often employed hard-
detecting features at different levels. For instance, in pattern ware accelerators for improving inference and training
recognition, where the input is available in pixel form, the processes in DNNs [218]. In floating-point matrix-based
first layer of DNN extracts minor details of the image, such as calculations, GPU-based hardware accelerators are extremely
curves and edges. The outputs of this first layer act as inputs efficient [205]. GPU-based hardware accelerators, on the
to the second layer. The second layer extracts the image’s other hand, consume a lot of power. ASIC and FPGA-
primary details, such as squares and semi-circles. The outputs based hardware accelerators have limited computational and
of the second layer act as inputs to the third layer. The third memory resources compared to GPU-based accelerators.
layer extracts the part of objects. Furthermore, the subsequent Nevertheless, they can achieve a moderate performance level
layer uses the previous layer’s output and extracts more while using less energy [153]. ASIC-based DNN accelerators
aspects of the objects. As the number of layers increases, provide superior performance compared to GPU and FPGA
the DNN extracts increasingly complicated features and counterparts at the cost of reconfigurability. However, ASIC-
complete objects [73]. DNNs provide superior accuracy and based accelerators have some limitations, including the high
performance at the cost of high computational complexity. cost of development, long time to market, inflexibility,
For instance, AlexNet [130] takes 1.4 Giga Operations Per etc [77], [103]. FPGA-based accelerators can be used as
an alternative to ASIC-based accelerators, and they can This survey is different and unique with respect to many exist-
provide superior performance at an affordable cost with ing papers in this area in the following ways. Few studies [44],
reconfigurability and low power dissipation [213]. FPGA, [97], [126], [154], [210] focused only on the developments
ASIC, and GPU-based AI accelerators have been the subject of FPGA-based accelerators. Few other studies [55], [136],
of numerous research [97], [150], [154], [155], [158], [210]. [150], [158] have presented the details of ASIC-based
This survey, however, also looks at various embedded AI accelerators. Some research reviews [48], [200], [201] have
accelerators for DNN acceleration. explored both FPGA and ASIC-based accelerators. Very
This survey supplements the existing work and contributes limited studies [178], [201] have dealt with the progress
towards providing the complete background on DNN accel- of GPU-based accelerators. On the other hand, studies on
eration using various specialized hardware architectures. The embedded AI and CGRA-based accelerators haven’t been
contributions of this survey can be summarized as follows: explored much. Many of these reviews do not mention the
1) The survey discusses the various research works carried compiler/mapping frameworks and SDKs available for these
out on the development and deployment of DNN using accelerators, making it difficult for someone to choose the
FPGA-based accelerators. appropriate accelerator. This review, therefore, aims to bring
2) The survey covers the work done in ASIC-based AI a comprehensive study of all the aforementioned hardware
accelerators in the last decade, from 2012 to 2022. accelerators in the context of the implementation of DNNs.
3) The survey describes the various GPU-based DNN Furthermore, this survey uniquely classifies the FPGA-
accelerators. and ASIC-based accelerators and briefly discusses the key
4) The survey provides a comprehensive overview of architectural features and the available compiler or mapping
CGRA-based accelerators for DNN implementation. frameworks. Accelerators for each category are summarized
5) The survey covers the research works carried out on the and compared. A comprehensive survey of GPU-based
implementation of DNNs on edge using embedded AI accelerators by Nvidia is also presented. The need for edge
accelerators. AI computing is emphasized and state-of-the-art embedded
6) The survey provides a comparative study of existing AI accelerators, including Arm-based accelerators, are also
hardware architectures: FPGAs, GPUs, ASICs, and discussed and compared. This survey also briefly discusses
embedded AI accelerators. the recent developments in tinyML. Table 1 compares this
7) The survey highlights the future research trends in survey paper with recently published review articles on DNN
DNN acceleration on specialized hardware architec- implementation using specialized hardware architectures.
tures, including FPGA, ASIC, GPU, CGRA, and Edge Researchers in the fields of artificial intelligence, system
AI accelerators. design, and hardware architecture are expected to benefit
from this survey.
A. SCOPE OF THE SURVEY
This paper focuses on research trends in FPGA, ASIC, and B. ORGANIZATION
GPU-based accelerators for implementing DNNs. We have
This paper is organized as follows: Section II provides a brief
also briefly discussed the current trends in Arm-based
overview of neural networks and DNNs, including the basic
machine learning processors and embedded edge AI accel-
architecture of hardware for DNN acceleration. Section III
erators. The review categorizes the FPGA-based accelerator
describes various architectures implemented on the FPGA
into three categories and briefly discusses the key features
platform for DNN acceleration. Section IV describes various
of the accelerators, including the frameworks available. The
ASIC-based accelerator architectures for DNN acceleration.
three categories include accelerators for a specific application
Section V shows a detailed review of GPU-based accelerators
such as speech recognition, object detection, natural language
for the acceleration of DNN. Section VI discusses various
processing, etc., accelerators for a specific algorithm such
CGRA-based accelerator architectures for DNN acceleration.
as CNN, RNN, etc., and accelerator frameworks with
Section VII discusses in detail the embedded edge AI
hardware templates. Furthermore, ASIC-based accelerators
accelerators for DNN acceleration. Section VIII provides
are categorized into three types: ALU-based accelerators,
the comparisons between the various hardware architectures
dataflow-based accelerators, and sparsity-based accelera-
used for the DNN acceleration. Section IX provides the
tors. A comparative study of these hardware accelerators
future research directions of various hardware architectures
based on performance metrics like power, throughput, and
for DNN acceleration. Finally, the conclusion of this review
area has been presented. The review also focuses on the
is presented in Section X.
mapping frameworks available for these accelerators and
briefly discusses the implementation details. In addition, the
recent research contributions in Arm-based machine learning II. BACKGROUND
processors, a few embedded AI hardware accelerators, and A. NEURAL NETWORKS
CGRA-based accelerators are discussed and compared in A Neural Network (NN) is a computational model inspired by
terms of their cores, performance, power, availability of Soft- biological neural networks. It is also known as an Artificial
ware Development Kits (SDKs), and supported frameworks. Neural Network (ANN). An ANN comprises hundreds or
thousands of interconnected artificial neurons, also called n inputs from the input layer and generates the output y.
processing units. Three or more interconnected layers are These inputs are multiplied by the weight coefficients
formed by these neurons. The input neurons are in the first (w1 , w2 , . . . , wn ) and combined together with a bias value b
layer. The input neurons receive external signals and pass for each neuron. A non-linear function σ (.), also called as
them on to the subsequent layers, which eventually provide an activation function, is then used to calculate the neuron’s
the final output data to the final output layer. The intermediate output, see Eq. (1). In this scenario, the activation function
layers in the ANN are called as hidden layers. Fig. 2 depicts causes a neuron to produce an output only if the input to
the architecture of a typical NN, which includes an input it exceeds a specified threshold value. Common non-linear
layer, an output layer, and two hidden layers. functions used in NN are Sigmoid, Rectified Linear Unit
(ReLU), and Hyperbolic tangent. The graphical model and
mathematical representation of artificial neuron is shown
in Fig. 3 and Eq. (1), respectively.
N
X
y = σ( x[n]w[n] + b) (1)
n=1
FIGURE 2. An architecture of NN. FIGURE 3. A single ANN neuron with its elements (inputs, weights, bias,
summer, activation function, and output).
In NN shown in Fig. 3, the input layer contains n inputs In neural networks, weights are initialized with some
(x1 , x2 , . . . , xn ). The following layer (hidden layer) gets all random values. However, during the training process, all
these weights get updated iteratively to predict the correct some input data has already been matched to the correct
output. The weights are updated using the cost function, output. Unsupervised learning is another learning technique
which is nothing more than the mean square error. The in which the network/model is trained using unlabeled data.
mathematical representation of mean square error is shown The trained network generates the clusters or structures in
in Eq. (2). Here, MSE is mean squared error, n represents the the unlabeled data. Semi-supervised learning uses partially
number of input data points, yi and ŷi are true and predicted labeled data sets and it falls in between supervised and
outputs, respectively. Once the neural network is trained, unsupervised learning approaches. Finally, reinforcement
it may be used for classification problems. learning is a type of training that rewards positive behaviours
while punishes undesirable ones. Reinforcement learning is
1X
MSE = (yi − ŷi )2 (2) bound to learn from its previous experience. The pictorial rep-
n resentation of the aforementioned deep learning approaches
i=1
is shown in Fig. 5.
B. DEEP NEURAL NETWORK (DNN)
The Deep Neural Network (DNN) is a type of neural network
that has more than three hidden layers and is well-suited to
complicated tasks [37]. In today’s DNN, the typical number
of layers used ranges from five to over a thousand. A DNN
with N hidden layers is shown in Fig. 4. In DNNs, the model
and its parameters are learned through an extensive training
process.
feature map can be paired with each output feature map. In a FIGURE 7. Various forms of pooling.
2-D convolution operation between an input image matrix x
(size R × C) and a filter f (size W × L), the convolution
layer performs point-wise multiplication and addition of the ability to converge faster than other activation functions like
corresponding pixels. The filter size is often smaller than the hyperbolic tangent and sigmoid [72], [197], ReLU [160] has
input matrix size. The filter multiplies the input matrix with gained a lot of traction in recent years. The mathematical
the W × L sized block, accumulates the result, slides to the representation of ReLU is shown in Eq. (4). Some popular
next block of the input matrix, and repeats the operation. extensions of ReLU, for instance, exponential LU [64],
The input matrix is processed one block at a time until parametric ReLU [107], and leaky ReLU [149] are also being
it has processed all of the image’s R × C elements. The used in CNNs for improved performance and accuracy.
2-D convolution operation is given in Eq. (3) where y(r, c) f (x) = max(0, x) (4)
signifies one output pixel in the output matrix y, with each
pixel’s coordinates expressed as (r, c). The iterators over the 4) FULLY CONNECTED LAYER
filter’s length (L) and width (W ) are l and w, respectively, Fully connected layers do the final classification in the CNN
in Eq. (3). Finally, the resulting feature maps apply non-linear network after multiple convolutions, ReLU, and pooling
activation functions such as sigmoid, hyperbolic tangent, layers. Weights, biases, and neurons are all part of the fully
or rectified linear units. connected layer. All input and output neurons are connected
W
X −1 L−1
X
W
L in the fully connected layer. A CNN network typically has
y(r, c) = f (w, l)x(r + w − ,c + l − ) one or more fully connected layers. The final output of CNN
2 2 comes from the last fully connected layer, often known as
w=0 l=0
(3) the classification layer. The fully connected layer in the CNN
contains a large number of inputs and outputs. Therefore, it is
2) POOLING LAYER challenging to implement fully connected layer operations on
The pooling layer shrinks the spatial dimensions of the input hardware platforms with limited resources.
image after convolution, thereby reducing the computation
and number of parameters in the network. Pooling layers 5) DECONVOLUTION LAYER
are also known as subsampling layers. In CNN, the pooling To increase the size of the feature map, a deconvolu-
layer is used between two convolution layers. The MAX tion layer, also known as a transposed convolution layer,
operation is used to resize each slice of the input image is employed [52]. Upsampling (inserting zeros in the feature
spatially, on which the pooling layers operate individually. map) and then convolving the upsampled feature maps with
A pooling layer with filters of size 2×2 is found in many CNN the kernel coefficients are used to accomplish this.
topologies. Over the four samples in the filter, the pooling
operation, which is nothing but the MAX operation, is done. 6) DILATED CONVOLUTION LAYER
The operation yielding the maximum value is retained while The filter coefficients are up-sampled and convolved with the
discarding the other values [123]. It is noteworthy that input image in a dilated convolution layer to capture a broader
additional operations like MIN operation and AVG operation receptive field [114]. Image segmentation, for example,
can also be used in the pooling layer, particularly in some uses it to capture the larger global context in each output
CNNs [197]. The MAX and AVG pooling operations for the pixel.
filters of size 2 × 2 are shown in Fig. 7. With millions of weight coefficients, CNNs are extremely
complex. They are computationally expensive and necessitate
3) RECTIFIED LINEAR UNIT (ReLU) LAYER a significant amount of memory to store the input, output
In a CNN network, the ReLU layer is usually employed after feature maps, and weight coefficients, causing CPUs to
the convolution and fully connected layers. The ReLU layer underperform. To boost the performance of the CNNs,
is generally used after the convolution and fully connected specific hardware accelerators are used. As a result, different
layers in the CNN network. By substituting all the negative techniques for implementing CNNs efficiently on hardware
valued outputs with 0, it introduces non-linearity into the platforms must be explored in order to reduce resource and
CNN. Because of its computational simplicity, sparsity, and memory requirements.
D. HARDWARE ARCHITECTURES FOR DNN result, only a small number of processes can be performed
ACCELERATION in parallel, limiting throughput. GPUs are commonly used
DNNs have been increasingly popular in recent years, to train and infer DNNs. They have thousands of cores
allowing for their development and deployment on a variety to run highly parallel algorithms efficiently, for instance,
of hardware platforms. These hardware platforms are of matrix multiplication. Throughput is enhanced by lowering
various types, right from general-purpose architectures such the number of multiplications in both CPUs and GPUs. There
as CPUs and GPUs, programmable architectures (FPGAs) are software libraries that optimize matrix multiplication
to special-purpose chips (ASICs). In many DNN models, for GPUs (e. g., cuBLAS, cuDNN [59], etc.) and CPUs
multiply-accumulate (MAC) operations are the most impor- (e. g., Intel MKL [2], OpenBLAS, etc.). Another well-
tant computations, and they can be easily parallelized. Since known technique to reduce the matrix multiplications is Fast
these MAC operations can be executed in parallel, hardware Fourier Transform (FFT) [80], [151]. Furthermore, several
architectures that enable parallel operations are required to techniques, such as Winogra’s algorithm [132] and Strassen’s
process DNNs. To achieve superior performance, highly algorithm [67], are used to reduce the matrix multiplications
parallel computing models, encompassing both spatial and and thereby reduce the resource and memory requirements.
temporal computing architectures, are often employed for
DNN acceleration. The spatial and temporal architectures 2) SPATIAL ARCHITECTURES
have a similar computational structure, with a set of In spatial architectures, each ALU can have its own local
Processing Elements (PEs). However, processing units can memory and control logic. The local memory is also
have internal control in a spatial architecture, whereas control referred to as the register file. The development and deploy-
in a temporal architecture is centralized, as shown in Fig. 8. ment of DNNs on Field-Programmable-Gate-Arrays (FPGA)
Each PE can have a register file (RF) to store data in spatial and Application-Specific-Integrated-Circuits (ASIC) comes
architecture; however, PEs do not have the memory capacity under the category of spatial architectures. FPGAs are less
in a temporal architecture. The PEs can also be connected to expensive and have a faster time to market than ASICs, and
exchange data in spatial computing designs. To summarize, the design flow is simpler. However, FPGAs are less energy-
the PEs in the temporal architectures contain only Arithmetic efficient and consume more power than ASICs since FPGAs,
and Logic Units (ALUs). The PEs consist of ALU as a unlike ASICs, contain a significant chip area dedicated to
computation unit, RF to store the data, and a control unit in reconfigurability. ASICs, on the other hand, are mainly
spatial architectures. designed for a particular application and cannot support
reconfigurability. The design flow of ASICs is more complex
than FPGAs [46]. ASIC chips are expensive, but they are
highly optimized and energy-efficient and provide superior
performance than FPGAs. Memory accesses are the real
bottleneck in DNN computations; therefore, off-chip DRAM
accesses must be minimized, as they have a high energy
cost and delay. The memory accesses (off-chip) can be
reduced by reusing data stored in smaller, quicker, and
low-energy memories. In spatial computing architectures,
weight stationary, row stationary, output stationary, and other
FIGURE 8. Spatial and temporal architectures. specialized processing dataflows can be designed to improve
data reuse from memories in the memory hierarchy and
reduce energy dissipation. At each level of the memory
1) TEMPORAL ARCHITECTURES hierarchy, the dataflow defines what data is read and when it is
The temporal architectures exploit parallelism by support- processed. In spatial architectures, dataflows can be classified
ing a variety of techniques, such as Single Instruction as follows:
Multiple Threads (SIMT) or Single Instruction Multiple
Data (SIMD). The temporal computing architectures appear a: WEIGHT STATIONARY (WS)
mostly in CPUs and GPUs. In temporal designs, ALUs can In weight stationary dataflow, the weights are kept fixed
only access data from the memory hierarchy and cannot and are stored in the register files of the PEs, whereas the
communicate directly with one another. The memory (i.e., inputs and partial sums are distributed across the PEs. Weight
register file) and control are shared by all ALUs in the stationary dataflow maximizes filter and convolutional reuse
temporal architecture. In temporal architectures like CPUs of weights. Weight stationary dataflow examples are found
or GPUs, all the convolution or fully connected operations in [168], [182], [195], and [50].
are mapped to matrix multiplication. CPU cores are the
least employed among the several temporal architectures b: OUTPUT STATIONARY (OS)
for DNN training and inference. CPUs contain a small Each partial sum is held fixed in a PE in the output stationary
number of processing cores, ranging from one to ten. As a dataflow, and accumulation is done until the final total is
obtained. In the meantime, the PEs’ weights and inputs are hardware computation resources due to inefficient off-chip
dispersed in a variety of ways. The convolutional reuse is communication.
maximized with output stationary dataflow. This dataflow
reduces the amount of energy used while writing and reading
partial sums. Output stationary dataflow examples are found
in [98] and [169].
of speech recognition algorithms using FPGA-based acceler- the FPGA board, which is connected through Peripheral
ators is also presented in several earlier studies [62], [112], Component Interconnect (PCI) interface. VIP uses the low
[137], [199]. accuracy arithmetic because of the limitations of resources on
Wang et al. [209] proposed a reconfigurable YOLOv3 Altera EPF81500 FPGA. Fortunately, recent FPGAs contains
FPGA hardware accelerator for object detection. In this large numbers of computing units and memory resources and
context, YOLOv3 (You Only Look Once, Version 3) is a real- allow fast CNN implementations. FPGA implementations
time object detection algorithm that detects specific objects in of DNNs mainly focused on accelerating the convolution
images or videos. The proposed accelerator is built using the operations, which are reported in [38] and [49].
ARM + FPGA architecture. Experiment results show that the Farabet et al. [85] presented ConvNet Processor (CNP):
FPGA-based YOLOv3 accelerator consumes less energy and an FPGA-based accelerator to implement the CNNs. CNP
achieves higher throughput than the GPU counterpart. The uses dedicated hardware convolver for the data processing
proposed accelerator is compatible with several frameworks, and also uses soft-processor for controlling. CNP is designed
such as Tensorflow, Caffe, PyTorch, etc. The proposed on the Virtex4 SX35 FPGA and also equipped with external
accelerator is implemented on Xilinx ZCU104 running at a memory to store the input and filter coefficients. CNP
frequency of 300 MHz. Several previous works [82], [148], consists of Vector Arithmetic and Logic Units (VALU), one
[161] also used the FPGA to implement object detection of the main components in the architecture that implements
algorithms. the CNN operations viz. 2-D convolutions, sub-sampling, and
Hamza et al. [125] proposed the FPGA-based acceler- non-linear activation functions. The implementation of 2-D
ator named NPE to efficiently implement various Natural convolution, represented using Eq. (6), is shown in Fig. 10 for
Language Processing (NLP) models. NPE provides a single K = 3, i. e. 3 × 3 kernel. In Eq. (6), xij is the data in the input
framework for processing arbitrarily complex nonlinear func- plane, wmn is the weight value in K ×K kernel, yij is the partial
tions with software-like programmability. NPE consumes 4× sum, zij is the result in the output plane, and W is the width of
and 6× less power than CPU and GPU. NPE is implemented the input image. At each clock cycle, the convolution module
on the Xilinx Zynq Z-7100 FPGA running at a frequency performs k 2 multiply-accumulate operations simultaneously.
of 200 MHz. CNP uses the First In First Out (FIFO) buffers between the
Serkan et al. [184] developed an FPGA-based CNN accel- external memory and FPGA to provides the continuous flow
erator to classify malaria disease cells. The proposed acceler- of data in both directions. CNP uses the 32-bit soft processor
ator is implemented on Xilinx Zynq-7000 FPGA running at that provides the macro instructions, generally higher level
a frequency of 168 MHz. The proposed accelerator achieves instructions than most traditional processors, to the VALU for
an accuracy of 94.76%. Zhu et al. [228] proposed an FPGA- implementing the basic CNN operations. CNP has a compiler
based accelerator to recognize liver dynamic CT images. that converts network implementations with Torch directly
Xiong et al. [217] developed an FPGA-based CNN accel- into CNP instructions. The proposed architecture has been
erator to improve the automatic segmentation of 3D brain used to implement the face detection system.
tumors. FPGA-based accelerators are also used to implement K −1 K −1
various applications such as autonomous driving [105], [129], X X
zij = yij + xi+m,j+n · wmn (6)
image classification [45], [70], fraud detection [128], cancer
m=0 n=0
detection [186], etc. Table 2 summarizes the reviewed FPGA-
based accelerators for specific applications.
used to map the CNNs on the proposed accelerator, which used to transfer the data between the processing elements
enables the user to use the high-level accelerator description and the external memory, which provides independent data
in C and to use HLS directives to specify the hardware streams. The proposed architecture uses the weight stationary
configuration. The performance of the proposed accelerator dataflow to improve energy efficiency. The nn-X accelerator
will be improved with the use of the DMA controller. is implemented using the Xilinx ZC706 platform, which has a
In [171], an accelerator for DNNs is introduced, and it dual ARM Cortex-A9 processor, Xilinx Zynq XC7Z045 chip,
is implemented on the Xilinx Kintex 7 FPGA platform. and 1 GB DDR3 memory. The experimental results show that
It is built using a set of Neural Processing Units (NPUs), the nn-X can achieve a peak performance of 240 GOPS.
see Fig. 15. The number of NPUs in the proposed design
depends on the available FPGA resources. NPUs are mainly
used to compute the majority of operations (multiplications
and additions) in parallel. A multiply and accumulate (MAC)
unit and control logic are the essential components of each
NPU. The proposed accelerator utilizes the available FPGA
resources efficiently by using pipelined architecture, time
division multiplexing (TDM) processing scheme, and page-
mirror algorithm. In the proposed accelerator, NPUs get
the inputs from the host computer through the Ethernet
interface, and weight coefficients are fetched from page
mirror memory. The serializer sends the output of NPUs to
the activation function blocks. For each sample, the proposed
accelerator requires a long time to transfer the appropriate
weight coefficient from the host computer to the accelerator FIGURE 16. Architecture of nn-X system, adopted from [91].
core.
A scalable and low-power accelerator referred to as neural Zhang et.al. [222] proposed a roofline-based model [212]
network next (nn-X) is presented in [91] to accelerate the to implement CNNs on FPGAs. The authors analyzed the
DNNs. The nn-X accelerator mainly contains a co-processor, throughput and required bandwidth for a given CNN design
a host processor, and external memory as shown in Fig. 16. using various optimization techniques, such as loop tiling
The host processor controls the input and configuration and loop transformation. With the help of the roofline
data transfer to the coprocessor, parses the DNN, and model, they identified the solutions with the best performance
converts it into instructions for the coprocessor. The co- and lowest FPGA resource requirement. This roofline-based
processor mainly contains an array of processing elements model optimizes both the memory accesses as well as
called collections, configuration bus, and memory router. computations in the convolutional layers. The accelerator
The collections in the nn-X accelerator are mainly composed design is implemented with the Vivado HLS tool, which
of convolution engines, pooling modules, and non-linear enables the accelerator implementation in C language.
operators and are used to perform the most common CNN The proposed accelerator achieves a maximum throughput
operations, such as convolution, sub-sampling, and activation of 61.62 GFLOPS (Giga Floating-point Operations Per
functions. The memory router in the nn-X accelerator is Second).
Implementing DNN in embedded devices is tough due mainly contains modules such as DMA, embedded processor,
to resource and power constraints. In this regard, authors DLAU, and DDR3 memory controller as shown in Fig. 18.
in [172] have developed novel FPGA-based accelerators for The DLAU module mainly contains three processing units,
implementing trained and fully connected DNNs. Since it viz. Partial Sum Accumulation Unit (PSAU), Tiled Matrix
is difficult to map a DNN with a large number of neurons Multiplication Unit (TMMU), and Activation Function
and corresponding weights, directly onto an FPGA, the Acceleration Unit (AFAU). TMMU is used to perform
authors in [172] used a time division multiplexing scheme. multiplication operations and also generate partial sums.
Batch processing is used in the proposed architecture, PSAU is used to add the partial sums derived from TMMU.
which distributes different weights over many input samples. Finally, AFAU is used to perform the non-linear activation
In addition, the suggested accelerator employs a pipelined functions, for instance, the sigmoid function. The DLAU
architecture to make the most of the FPGA resources while module reads the tiled input data through the DDR3 memory.
staying within power and resource limits. The concept The embedded processor provides the programming interface
of pruning has also been incorporated into the proposed to the users and communicates with DLAU via JTAG-UART.
architecture to reduce data transfer from the external memory The proposed architecture is implemented on Xilinx Zynq
to the accelerator [173]. Both Batch processing and weight Zedboard with ARM Cortex-A9 processors operating at
pruning can enhance the throughput of DNN accelerators. 667 MHz.
Qiu et al. [177] proposed FPGA based CNN accelerator,
which will efficiently accelerate all the layers of CNN,
including the fully connected layers. The proposed acceler-
ator improves bandwidth and resource usage by employing
a dynamic-precision data quantization method and a unique
design of the convolver hardware module. The proposed
accelerator applies singular value decomposition (SVD) on
weight coefficients to minimize the memory footprint at the
fully connected layer. The convolver hardware module can
be used for both convolutional and fully connected layers
to reduce resource consumption. The adder tree, convolver
complex, non-linearity, max-pooling, bias shift, and data shift
are the main elements of the convolver hardware module,
as shown in Fig. 17. Convolutions and fully connected layer
operations are both performed using the convolver complex
module. The max pooling action is carried out using the max-
pooling module. CNN’s non-linearity function is calculated FIGURE 18. DLAU accelerator architecture, adopted from [208].
using the non-linearity module. The convolver complex
module generates partial sums, which are added by the adder Lian et al. [140] proposed a block-floating-point (BFP)
tree. Finally, for dynamic quantization, bias shift and data arithmetic-based CNN accelerator for DNN inference. The
shift modules are used. The proposed accelerator supports the proposed accelerator mainly contains three elements: Pro-
Caffe deep learning framework. The proposed accelerator has cessing Array (PEA), on-chip buffer, and external memory,
been implemented on the Xilinx Zynq platform. as shown in Fig. 19. The onboard DDR3 modules receive
input data and network parameters from the host computer
via PCIe3.0 × 8. Conv PEA performs the convolutional
operations, and FC PEA performs the fully connected
layer operations. The proposed accelerator uses 8-bit and
16-bit formats to represent the feature maps and modal
parameters (activations and weights), which can reduce
off-chip bandwidth and memory compared to the 32-bit
floating point counterpart with only a tiny accuracy loss.
The accelerator design is implemented with the Vivado
HLS tool, and the proposed BFP arithmetic is conducted
on the Caffe [119] scheme. The proposed accelerator is
implemented on the Xilinx VC709 evaluation board, running
FIGURE 17. Convolver architecture, adopted from [177].
at a frequency of 200 MHz, and achieves a throughput of
Wang et al. [208] proposed a scalable design called Deep 760.83 GOP/s.
Learning Accelerator Unit (DLAU) for accelerating deep Xiao et al. [216] presented the DNN accelerator architec-
learning algorithms. DLAU utilizes the tiling technique to ture specially designed for the sparse and compressed DNN
produce a scalable architecture. The proposed accelerator models. The proposed DNN accelerator mainly contains
and weight buffers are updated with input feature maps and
weights. The PE arrays perform the convolution operations,
whereas the special function buffer performs pooling, Batch
Normalization (BN), and activation functions. The proposed
accelerator uses the SOW (sparse optimization of weight) and
CO (convolutional optimization) optimizations to reduce the
sizes of weights and feature maps, respectively, which also
minimizes the number of hardware resources needed. The
proposed accelerator uses 16-bit, 8-bit, and 4-bit fixed point
formats to represent the feature maps, convolution (CONV)
layer weights, and fully connected (FC) layer weights,
respectively. This work uses the Xilinx Vivado HLS toolchain
to convert C++ code to RTL implementation. The proposed
accelerator is implemented on Xilinx Zynq 7020 FPGA FIGURE 22. RCNN accelerator architecture, adopted from [92].
board.
Table 3 summarizes the reviewed FPGA-based accelera-
tors for a specific algorithm. The year the accelerator was DAG is converted into an SDF hardware intermediate
introduced, the deep learning model used, the FPGA platform format, which corresponds to an utterly parallel hardware
used, the precision used for input feature maps and weights, implementation. After several transformations on ConvNet’s
the clock frequency, the number of resources available in SDF hardware model, the design space is searched, and
terms of DSPs, LUTs, BRAMs, and FFs, the percentage of this procedure provides a set of hardware mappings of
resources utilized, the performance in GOPS, and finally, the the ConvNet onto the specific FPGA-based platform. The
power efficiency (GOPS/W) are all listed for each accelerator. fpgaConvNet front-end parser can examine models written
Fig. 23 shows the power efficiency and throughput of various in the Caffe and Torch machine-learning libraries. This
FPGA-based accelerators listed in Table 3. framework accomplishes efficient design space explorations
through graph segmentation, reconfiguration, folding, and
C. ACCELERATOR FRAMEWORKS WITH HARDWARE weight reloading. This framework can be used to map small
TEMPLATES CNN models, for instance, LeNet-5 on FPGAs.
Several frameworks for mapping AI models onto FPGAs Wang et al. [211] developed a design automation tool
have been developed in recent years. Venieris et al. [206] referred to as DeepBurning that contains a library of building
developed a framework called fpgaConvNet to map CNNs blocks that mimic the behavior of typical neural network
on FPGAs. The fpgaConvNet framework employs the components. The general design flow of the DeepBurning
synchronous dataflow (SDF) paradigm to capture the CNN framework is shown in Fig. 25. The DeepBurning Neural
workloads. The processing flow of fpgaConvNet is shown Network Generator (NN-Gen) takes a model descriptive
in Fig. 24. Firstly, the Deep Learning expert uses a domain- script ( Caffe-compatible script) as input, which describes a
specific language to provide a high-level description of a high-level view of network topology and layer definition. The
ConvNet architecture as well as information on the target DeepBurning NN-Gen also takes user-specified constraints
FPGA-based platform as inputs. The ConvNet description is such as area and power as input. DeepBurning NN-Gen
passed through a DSL (Domain-Specific Language) proces- consists of a hardware generator and compiler that generate
sor, which parses the input script and populates the ConvNet’s the control flow and data layout based on the user’s
semantic model as a Directed Acyclic Graph (DAG), and also specifications. The DeepBurning automation tool’s hardware
extracts platform-specific resource constraints. The ConvNet generator builds a neural network architecture for a given
FIGURE 23. Power efficiency and throughput of FPGA-based accelerators listed in Table 3.
network structure by selecting and instantiating blocks from Caffe as its programming interface. DNNWeaver consists of
the library with the required interconnections. DeepBurning three software components: translator, design weaver, and
supports a wide range of NN models and simplifies the integrator. The translator transforms the Caffe specification
design flow of NN-based accelerators for machine learning of a DNN into a macro data flow graph. Design weaver
applications. accepts macro data flow graph as an input and generates
A framework referred to as DNNWeaver is presented a synthesizable Verilog implementation of the accelerator
in [187] that generates bitstream and host code to implement code. The integrator adds the memory interface code to
DNNs on various FPGA boards. DNNWeaver employs the accelerator code. DNNWeaver generates accelerator
FIGURE 29. DPU architecture overview, adopted from [227], Vitis AI stack, and development flow.
IV. ASIC BASED ACCELERATORS prevent zero multiplications. The following sections provide
Application Specific Integrated Circuit (ASIC) is a powerful a comprehensive overview of ALU, Dataflow, and Sparsity-
platform to accelerate the DNNs. ASICs are customized based accelerators.
chips designed for a specific application. They are smaller
in size, consume less power, and provide higher speeds, A. ALU BASED ACCELERATORS
making them suitable solutions for DNN acceleration [76]. NeuFlow is the ASIC based CNN accelerator presented
ASIC based hardware accelerators have limited computing in [170] to accelerate the NNs and other ML algorithms. The
resources, memory resources, and I/O bandwidths compared architecture of the proposed accelerator is the same as the
with GPU based accelerators, but they can achieve moderate accelerator discussed in [84] and shown in Fig. 14, but is
performance and consume less power [165]. Furthermore, implemented using IBM 45 nm Silicon-On-Insulator (SOI)
ASIC exhibits the best computation speed and energy effi- process. The NeuFlow accelerator uses a compiler named
ciency than GPU and FPGA at the cost of reconfigurability. luaFlow to process CNNs. The luaFlow compiler converts
Many researchers are focused on building custom ASICs for high-level data flow graph representations of deep learning
accelerating CNNs inference workloads to achieve the best algorithms in the Torch5 environment into machine code for
performance and energy efficiency. In this section, we would Neuflow. The proposed architecture provides higher power
like to review the recent ASIC-based DNN accelerators. efficiency and is suitable for vision-based applications, such
There are three broad types of ASIC-based DNN accel- as autonomous vehicle navigation, driving assistance, etc.
erators depending on how the architecture has been opti- The proposed architecture achieves the maximum throughput
mized/designed: ALU (Arithmetic Logical Unit), Dataflow, of 320 GOPS with a power consumption of 0.6 W; in
and Sparsity-based accelerators. The main building block, contrast, the NeuFlow architecture implemented on Xilinx
the MAC unit (or an array of MAC units), in ALU- Virtex6 FPGA presented in [84] has a maximum throughput
based accelerators is modified to have ample computational of 16 GOPS with power consumption of 10 W.
resources and flexibility to obtain the best performance with Chen et al. [53] proposed the ASIC-based hardware
varying bit accuracy. In dataflow-based accelerators, the accelerator, also called DianNao, to accelerate the large-
activations, weights, and partial sums are managed to reduce scale CNNs and DNNs. The proposed architecture provides
the energy needed to move data within the chip and achieve the quick and energy-efficient execution of the inference
high arithmetic intensity. In Sparsity-based accelerators, the of large-scale CNNs and DNNs. The architecture contains
unstructured sparse data is handled in such a way that the the Neural Functional Unit (NFU), buffers, and control
matrix multiplication units (2-D array of MAC units) can processor (CP), see Fig. 30. The NFU module is used to
compared to DianNao. The design is implemented in Verilog convolutional operations, and FRP performs matrix multipli-
and synthesized by the design compiler, and IC compiler cation operations. DNPU is the first CNN/RNN accelerator
is used to place and route the synthesized design. The with the highest energy efficiency of 8.1 TOPS/W on 65 nm
energy cost of DRAM accesses is calculated using CACTI CMOS technology. DNPU has some limitations; for instance,
6.0 [159]. The ShiDianNao accelerator will not support its area limits the number of processing elements (PEs)
the acceleration of large-scale CNNs. The ShiDianNao for convolutional layers (CL) and recurrent layers (CL).
accelerator is implemented using 65 nm CMOS technology. As a result, performance was sub-optimal in cases that just
DianNao [53], DaDianNao [54], [146], PuDianNao [141], required CLs or RLs. Furthermore, DNPU only supports
and ShiDianNao [78] are not built utilizing reconfigurable a limited number of weight-bit precisions, such as 4 bits,
hardware, hence they cannot be adapted to changing 8 bits, or 16 bits. Lee et al. [133] proposed the Unified
application demands such as NN sizes. Neural Processing Unit (UNPU) architecture to process
Lu et al. [145] proposed a flexible dataflow architecture CNNs and RNNs. UNPU contains a bit-serial MAC unit
called FlexFlow to accelerate the CNNs, exploiting all kinds to perform the required computations. UNPU supports CLs,
of parallelisms viz., inter-kernel, intra-kernel, and inter- RLs, and fully connected layers (FCLs) with fully-variable
output on a two-dimensional array of PEs. FlexFlow has weight bit-precision from 1 to 16 bits. UNPU achieves an
additional interconnections between on-chip memories and energy efficiency of 3.08, 11.6, and 50.6 TOPS/W for the
PEs, which provides the flexibility to fetch any neuron case of 16-bit, 4-bit, and 1-bit weights, respectively. UNPU
from any feature map. The proposed accelerator minimizes achieves 1.43× higher energy efficiency than the DNPU for
the interconnections between the PEs at the cost of energy convolutional layers with 4-bit weights.
because of data movement from on-chip memory to PEs.
In FlexFlow, all the PEs are operated in parallel, therefore, B. DATAFLOW BASED ACCELERATOR
helping in improving the overall throughput. The proposed The accelerators based on dataflow put a special emphasis on
architecture has high scalability and supports different sizes data management to minimize off-chip memory reads/writes.
of CNNs with stable resource utilization. FlexFlow only When it is feasible, reusing parameters between layers can
implements CNNs and is confined to within a layer rather enhance dataflow. For instance, in a convolutional layer, both
than across layers. The design is simulated, synthesized, activations and weights can be reused. In a fully connected
placed & routed using Synopsys’ tools. The FlexFlow layer, each neuron has a unique set of weights; as a result,
accelerator is implemented using TSMC 65 nm technology. weights cannot be reused, but input data may. In order to
Hardik et al. [188] developed a bit-level dynamically minimize data movement between a computing unit and
composable architecture called Bit Fusion for accelerating higher-level memory, the reusable parameters are kept in
DNNs. Bit fusion mainly consists of an array of bit-level local registers.
computation elements, called BitBricks, that dynamically Cavigelli et al. [50] proposed the Origami CNN accel-
fuse to match the bit width of individual DNN layers and erator, which is scalable to different network sizes. The
execute DNN operations with the required bit width, without proposed architecture uses the Weight Stationary (WS)
any loss of accuracy. Furthermore, Bit Fusion supports dataflow to improve energy efficiency during the acceleration
the multiplication of 2, 4, 8, and 16 bits spatially. Bit process. WS dataflow minimizes energy consumption by
Fusion decomposes a 16-bit multiplication into multiple maximizing the access of weight coefficients. WS dataflow
2-bit multiplications to achieve the flexibility to efficiently used in the Origami maximizes the convolution and filter
map various layers of CNN with different bit widths and reuse of weights. The proposed accelerator was implemented
minimize the computation and the communication with no using UMC 65 nm CMOS technology and having a core
loss of accuracy. Bit Fusion architecture comes with an area of 3.09 mm2 . The proposed CNN accelerator can
Instruction Set Architecture (ISA) that minimizes the data achieve a throughput of 274 GOPS and a power efficiency
transfer and maximizes the parallelism in computations. of 369 GOPS/W with an external memory bandwidth of
The proposed design is implemented in Verilog and is 525 MB/S full-duplex. The proposed architecture is only used
synthesized using the Design Compiler, which estimates to perform the convolution operation and is unsuitable for
the area, frequency, and power. The proposed accelerator implementing the fully connected layer operations.
architecture is implemented on 45 nm CMOS technology. Bit Eyeriss [56] is an ASIC based CNN accelerator that
Fusion accelerator achieves 5.1× energy saving and 3.9× uses a row-stationary (RS) dataflow that minimizes data
speedup over Eyeriss accelerator. movement energy consumption on a spatial computing
Shin et al. [190] proposed Deep Neural Processing Unit architecture. RS dataflow is adaptable to various CNN shapes
(DNPU) architecture to process CNNs and Recurrent Neural and minimizes energy consumption by reusing the filter
Networks (RNNs). DNPU is a SIMD MAC-based CNN/RNN coefficients and input feature maps. The proposed accelerator
accelerator that uses dynamic precision control to minimize mainly contains a 12 × 14 PE array, feature map compression
kernel data size. DNPU consists of a convolutional layer units, and a 108 KB global buffer; ReLU as shown in Fig. 32.
processor (CP), a fully connected and RNN-LSTM layer The global buffer enables the reuse of loaded data from off-
processor (FRP), and a RISC controller. CP performs chip DRAM and the generated results by PEs and is also
First-Out) register. The results from the previous layers and array to perform the required MAC operations. SCNN
the input activation function are stored in the unified local exploits all three kinds of parallelisms viz., inter-kernel,
buffer. In order to perform a convolution operation on a intra-kernel, and inter-output. SCNN requires additional
matrix multiply unit, a systolic data setup block is used in optimization circuitry to implement the fully connected layer
order to rearrange the data. Efficient running of machine operations. SCNN improves the performance by skipping
learning model tasks and inference tasks like search and the zeros in the input feature maps and weights. SCNN is
image recognition, and language translation have been the implemented in system C and Catapult High-Level Synthesis
focus of the first version of TPU, called TPU1. Since 2015, (HLS) [30] tool is used to generate the Verilog RTL. Synopsys
TPU1 has been operational in Google’s data center. A second Design Compiler synthesizes the Verilog version of the
version TPU2, also called Cloud TPU is operational in data design. SCNN is implemented using TSMC 16 nm FinFET
centers for the purpose of training and interference. Cloud technology.
TPU supports several frameworks, including TensorFlow, Eyeriss [56] also looked into input sparsity as a way
PyTorch, and JAX/FLAX. to save energy. The gating mechanism deactivates MAC
units that correspond to zero inputs. Gating saves energy
while not increasing throughput. With sparse models, The
processing speed and energy efficiency of Eyeriss V2 [57]
have improved due to its ability to process sparse data directly
in compressed format for both the weights and activations.
Zhang et al. [225] developed Sparse Neural Acceleration
Processor (SNAP) to exploit unstructured sparsity in DNNs.
To ensure that data is distributed evenly throughout the MAC
units, SNAP employs parallel associative search. SNAP is
fabricated using 16 nm CMOS technology and achieves a
peak energy efficiency of 21.55 TOPS/W (FP16) for CONV
layers with 10% weight and activation density.
Lee et al. [135] proposed an energy-efficient on-chip
accelerator called LNPU for sparse DNN model learning.
In the LNPU accelerator, Sparsity is exploited with intra-
channel as well as inter-channel accumulation. The input
load buffer module of the LNPU evenly distributes workload
FIGURE 35. Block Diagram of TPU, adopted from [120].
among the PEs while considering irregular sparsity. LNPU
uses the fine-grained mixed precision (FGMP) of FP8-FP16
that optimizes data precision while maintaining training
C. SPARSITY BASED ACCELERATORS accuracy. LNPU maintains an average hardware utilization
The fraction of zeros in a CNN layer’s weights and input of 100%. LNPU is fabricated using 65 nm CMOS technology
activation matrices is called sparsity. Since multiplying by and has an energy efficiency of 3.48 TFLOPS/W (FP8) at 0%
zero should produce a zero, there should be no effort required. sparsity and 25.3 TFLOPS/W (FP8) at 90% sparsity.
As a result, typical layers can cut work by a factor of four, and SIGMA is a scalable and flexible accelerator proposed
in some instances, by a factor of ten. Also, the addition is not in [176] to implement large, irregular, and sparse general
needed because the zero products won’t add anything to the matrix-matrix multiplications (GEMMs). The basic building
total of which they are a part. Moreover, data with many zeros block in SIGMA is the Flexible Dot Product Engine (Flex-
can be compressed—these traits, when combined, open up a DPE). All the Flex-DPE modules can be interconnected via
lot of possibilities for improvement. This section provides a simple NoC. In SIGMA, all the Flex-DPE multipliers are
comprehensive overview of accelerators that explore sparsity. arranged in a 1-D fashion, and it performs the multiple
A CNN accelerator referred to as Sparse CNN (SCNN) is variable-sized dot-products in parallel. SIGMA uses scalable
presented in [167] for inference of CNNs. SCNN employs interconnects to efficiently map the GEMMs of different
a novel dataflow referred to as sparse Planar-Tiled Input- dimensions and sparsity levels to the PEs. SIGMA outper-
Stationary Cartesian Product (PT-IS-CP-sparse) dataflow forms systolic array architectures by 5.7× for irregular sparse
that maximizes the reuse of activations and weights and matrices. SIGMA is implemented using the 28 nm CMOS
removes needless data transfers and reduces storage and technology and achieves a throughput of 10.8 TFLOPS with
power requirements. The dataflow used in SCNN eliminates a power dissipation of 22.33 W.
all multiplications with a zero and keeps both activations and Zhang et al. [224] proposed an accelerator called GAMMA
weights in compressed form. SCNN mainly contains an array to perform the Sparse matrix-sparse matrix multiplication
of processing elements arranged in a 2-D fashion with systolic (spMspM) operations. The proposed accelerator uses Gus-
connections to transfer partial sums. The proposed dataflow tavson’s algorithm [99] to compute the spMspM operations.
efficiently delivers activations and weights to the multiplier GAMMA accelerator mainly consists of an array of
processing elements(PEs), on-chip storage referred to as accelerators together without hampering the SIMT execution
FiberCache, and a scheduler, as shown in Fig. 36. The PEs model. NGPU provides significant energy and performance
are used to perform the required spMspM operations that benefits at the cost of reasonably low hardware overhead.
combine sparse input rows to produce each output row. NGPU achieves 2.44× average speedup and 2.8× average
FiberCache is a specialized memory structure that stores energy reduction compared to the baseline GPU architecture
the non-zero elements and their coordinates. The scheduler across different sets of benchmarks.
distributes computational workloads among PEs to maximize Danial et al. [196] presented a framework for accelerating
resource efficiency while reducing unnecessary access to the training and classification of arbitrary CNNs on the
shared memory. GAMMA is implemented using 45nm GPU. The proposed method improves the performance
CMOS technology. by moving the computationally intensive tasks of a CNN
to the GPU. Training and classification of CNN on the
GPU performs 2 to 24 times faster than on the CPU
based on the network topology. Li et al. [139] proposed
an efficient GPU implementation to accelerate the training
process of large-scale Recurrent Neural Networks (RNN).
When compared to the CPU-based solution with Intel’s
Math Kernel Library (MKL), the proposed method yields
a speedup of 2 to 11 times. Kim et al. [127] proposed a
new memory management scheme to enhance the overall
GPU memory utilization in multi-GPU systems for deep
learning algorithms acceleration. The authors extended the
concept of vDNN to a multi-GPU environment employing
PCIe-bus, where vDNN [179] virtualizes the GPU and
FIGURE 36. Block Diagram of GAMMA, adopted from [224].
memory of the CPU so that it can be used simultaneously
We summarized the reviewed ASIC-based accelerators to train DL algorithms in a hybrid fashion. The suggested
for DNN in Table 5. For each accelerator, we list the year memory scheme increases batch size by 60% in multi-
the accelerator was introduced, the process technology, the GPU systems and enhances training throughput by 46.6%.
clock frequency, the dataflow, the architecture type, the High-performance GPU dedicated architecture referred to
power dissipation, the area, the performance in GOPS, and as TResNet is presented in [180] to accelerate CNNs. The
finally, the power efficiency. Fig. 37 shows the plots of proposed architecture effectively utilizes GPU resources and
various metrics, such as power, throughput, area, and power achieves better accuracy and efficiency.
efficiency of ASIC-based accelerators. Nvidia GPUs are the most popular for Deep Learning (DL)
implementations. Table 6 lists the accelerators that Nvidia has
V. GPU BASED ACCELERATORS released, which are used for the inference and training of deep
Over the last few decades, Graphics Processing Units (GPUs) learning (DL) algorithms and have both a Central Processing
are widely used in training DL algorithms or CNNs for Unit (CPU) and a GPU integrated on a single chip.
face recognition [109], object detection [220], [226], data
mining [88], and other AI applications. GPU supports VI. CGRA-BASED ACCELERATORS
parallelism due to lots of parallel cores in the architecture Coarse Grain Reconfigurable Architectures (CGRAs) pri-
and offers significant computation speed. GPU exploits marily consist of an array of Processing Elements (PEs)
large degrees of data-level parallelism in the applications connected using reconfigurable interconnects. When com-
through the Single Instruction Multiple Thread (SIMT) pared to FPGAs, CGRAs often have a shorter reconfiguration
execution models. The high computational capacity of the time. CGRAs have emerged as a popular option for real-
GPUs makes them the primary choice for DNN acceleration. time computing due to their low power consumption, high
In this section, we would like to review some of the recent efficiency, fast reconfiguration time, and ability to perform
GPU-based DNN accelerators. both spatial and temporal calculations. In recent years,
The study of implementing a standard backpropagation CGRAs have become increasingly significant in accelerating
algorithm for training multiple perceptrons simultaneously DNNs, particularly CNNs, thanks to their ability to combine
on GPU using NVIDIA CUDA technology is presented FPGAs’ flexibility with ASICs’ efficiency. In this section,
in [100]. For a given program, GPU-based implementation we would like to review some of the recent CGRA-based
on NVIDIA GTX 260 GPU achieves 50× to 150× speedup DNN accelerators.
compared to the CPU-based implementation. A neurally Jafri et al. [118] proposed a CGRA-based accelerator
accelerated architecture for GPU, called NGPU (neurally named NeuroCGRA to realize neural networks and digital
accelerated GPU) is presented in [218] to enable scalable signal processing applications. The authors have opted to
integration of neural acceleration with a large number of GPU investigate the viability of deploying neural networks on an
cores. The proposed architecture brings the neural and GPU actual CGRA using a Dynamically Reconfigurable Resource
Array (DRRA). DRRA mainly consists of four elements, viz., The authors have implemented edge detection on DRRA
Data Path Units (DPUs), register files (Reg-files), Switch using the proposed framework.
Boxes (SB), and sequencers, as shown in Fig. 38. The DPUs EMAX is an energy-efficient, low-power CGRA architec-
are the functional units that perform the required computa- ture with on-chip distributed memory proposed in [202] to
tions. The Reg-files store the data for the DPUs. Intercon- implement CNNs. EMAX supports both CNN training and
nectivity between various DRRA components is provided inference. EMAX is composed primarily of an array of PEs
through SBs. The sequencers configure the DPU, switch and an interconnection network, as shown in Fig. 39. Each
boxes, and register files. Distributed Memory Architecture PE is connected to its neighbors by local interconnections,
(DiMArch) is essentially a scratch pad providing enough and each row of the PE array has a shared bus. The results
data to the DRRA. The authors have embedded dedicated of calculations performed on the PEs are passed on to the
hardware, known as neuroDPU, with each DPU of DRRA PEs that exist in the next row. The PEs can access external
to implement neural networks on it. The authors proposed memory (DRAM) via the memory interface. Each PE has
a neural network translator that provides a framework for two execution units that perform the arithmetic and logical
mapping neural networks onto CGRAs. The translator takes operations. Each PE also has a local memory to store the
three inputs, viz., network model, weights, and network required data, reducing the memory bandwidth pressure.
specifications, and generates three outputs: DPU, Reg- Experimental results show that EMAX performs better than
file, and SB instructions. NeuroCGRA is synthesized using GPUs in terms of per memory bandwidth and per area.
65nm technology running at a frequency of 500 MHz. A CGRA-based accelerator referred to as stream dual-
A framework called FIST is presented in [162] that allows track CGRA (SDT-CGRA), which targets the implementation
the NeuroCGRA [118] to realize both DSP applications of object inference algorithms, is presented in [83]. SDT-
and neural networks, depending on the target applications. CGRA employs stream processing and uses both static and
dynamic configurations for stream processing. The SDT- compiler is used to synthesize the design. The proposed SDT-
CGRA accelerator mainly contains an array of PEs known CGRA is implemented using SMIC 55nm CMOS technology.
as reconfigurable cells (RS) and stream buffer units (SBUs), Experimental results show that SDT-CGRA outperforms
as shown in Fig. 40. The SDT-CGRA architecture is divided EMAX by three times in terms of operations per memory
into two sections: global memory and computing array. bandwidth.
The global memory section is dynamically configured and In [110], the authors proposed mapping of CNNs onto
stores data streams. On the other hand, the computing array Tightly Coupled Processor Array (TCPA) efficiently. TCPA
section operates in a static configuration mode. It comprises belongs to the class of CGRA, containing an array of tightly
several RC columns and one special RC column. The Special coupled VLIW Processing Elements (PEs) [104]. TCPA
RC is used for operations like power (represented as PRC offers multiple levels of parallelism, for instance, task-level,
in Fig. 40) and piece-wise functions (represented as IRC loop-level, iteration-level, instruction-level parallelism, etc.
in Fig. 40). The crossbar switch serves as a bridge to connect TCPAs are suited for accelerating computationally expen-
the RC array and SBUs. Data can be transferred from off-chip sive nested loop programs exhibiting a high degree of
memory to SBUs using the external direct memory access parallelism, such as CNNs. CNN layers are based on
interface. Static and dynamic interfaces are used for static and matrix multiplications which can be written as 6-dimensional
dynamic configurations, respectively. The proposed SDT- nested loops, making them suitable for acceleration. It was
CGRA is realized in Verilog HDL, and Synopsys design demonstrated that TCPAs use techniques such as loop
permutation, loop unrolling, and layer-parallel processing to FIGURE 40. SDT-CGRA architecture, adopted from [83].
exploit the parallelism offered by the TCPA architecture.
Layer fusion allows the processing of multiple layers of There is a lot of room for CGRA research to develop and
CNN in the overlapped fashion [33], which was exploited by expand as a topic of study for future architecture; this is
TCPA to save the intermediate memory needed between the especially true when developing high-performance CGRAs
layers. Loop permutation allows the computation of multiple tailored to specialized or general-purpose computing. Some
convolution filters in an interspersed way. TCPA allows the key issues that require further research in this area include
parallel execution of multiple layers by different PEs. A CNN developing tools to program the architecture efficiently,
model for the MNIST benchmark on an array of size 4×4 was memory management, scalability, adaptability, productivity,
evaluated and the performance of the layer-parallel approach virtualization, etc.
over layer-by-layer processing was compared.
A CGRA-based accelerator called Neural Processing VII. EMBEDDED AI ACCELERATORS
CGRA (NP-CGRA) is presented in [134] to accelerate The AI hardware requirements are more critical in the
lightweight CNNs. The authors have proposed a set of edge environment, typically represented by the Internet of
extensions to the baseline CGRA [152] to improve the Things (IoT) devices (e.g., smart speaker, mobile, sensors
performance of CGRAs and to efficiently implement depth- and actuators) with limited computing resources, as opposed
wise convolution (DWC) and pointwise convolution (PWC). to cloud infrastructure with relatively sufficient computing
The authors have presented three architectural extensions: capability. For the sake of real-time immediacy, latency,
a crossbar-style memory bus, dual-mode MAC unit, and offline capabilities, security, and privacy, AI models are
operand reuse network. The crossbar-style memory bus increasingly required to be implemented on edge. In this
contains horizontal and vertical buses, and each bus is context, Small Form Factor (SFF) devices such as micro-
accessible to all the PEs connected to it. Dual-mode MAC controllers, which dominate the market, are of particular
unit works in MAC mode and MUL/ALU mode. The interest, and having AI capabilities on these devices can
FIGURE 41. TCPA accelerator showing PE array of size 4 × 4 and a CNN that is mapped onto it for recognizing digits from MNIST database.
FIGURE 43. Prototyping boards from Coral [27] having edge TPU.
FIGURE 49. Ultra96-V2 [1] and PYNQ-Z2 [18] development boards from
Xilinx.
also be used together with PYNQ board for creating a network of more complex neural networks. There are two ways
with the desired number of layers, activation functions etc.. to deploy deep learning at IoT end devices. 1) Deploy
Vivado [22], Vitis [21] and Python can be used to work with feature vector and model architecture on the Server machine
PYNQ board. and call with API using Web service to IoT. 2) Deploy
feature vector and model architecture on resource-constraint
platforms like Raspberry Pi, also called on-device computing.
The first method has network latency issues, security
risks, and high communication costs. The second method
has difficulty in implementing large DNN models due to
the limited memory and computational resources of IoT-
enabled devices like Raspberry Pi. Furthermore, devices
with limited resources, such as the Raspberry Pi, are
only used for DNN inference. The trained DNN model
can be transferred to the Raspberry Pi through network
connectivity. However, network connectivity can introduce
delays, data loss, and other security concerns, limiting DNN
deployment on the Raspberry Pi [41]. Bhosale et al. [42]
proposed Deep Convolutional Neural Network (DCNN) for
FIGURE 50. Xilinx’s Kria KV260 SOM, adopted from [122]. Covid-19 classification. In this work, the DCNN architecture
is deployed on the cloud and uses radiology x-ray images
Xilinx’s Kria KV260 [23], [122] is an AI starter kit for classification. On the other hand, the authors in [41]
targeted for vision AI applications in smart cities, smart fac- proposed a lightweight Deep Learning model (LDC-Net) for
tories, robotics, home automation, etc., see Fig. 50. KV260 Covid-19 classification with lung disease. In this work, LDC-
includes a Zynq MPSoC, and it supports the Python-based Net was trained on High-Performance Computing (HPC).
PYNQ framework. The trained models can be implemented Furthermore, the trained LDC-Net and weights have been
in the DPU [7] and are loaded with PYNQ using hardware deployed in an IoT-enabled Raspberry Pi with network
overlays. In [122], authors have demonstrated pre-trained connectivity for Covid-19 classification.
models based on the MNIST dataset, RESNET based on
Caffe framework, and InceptionV1 based on Tensorflow.
Furthermore, to exercise the features of KV260, many models
from Vitis AI Model Zoo [24] repository are implemented.
Traffic detection, lane detection and segmentation algorithms
were also implemented and tested in real-time. Silicon
Lab has recently introduced BG24/MG24 [19] SoCs with
built-in AI accelerators and a new software toolkit. These
new devices with optimized hardware and software will
help execute AL/ML applications on battery-powered edge
devices. The MAX78000 [14] from Maxim Integrated is an
AI microcontroller that runs neural networks at extremely low
power. It has a hardware-based CNN accelerator, enabling
the battery-powered applications to execute AI inferences. FIGURE 51. Raspberry Pi computer [230].
AlphaICs’ Gluon AI co-processor [9] is optimized for vision
AI applications. It comes with an SDK for easy porting of An Arm processor is a general-purpose processor that
neural networks. belongs to the family of CPUs and uses Reduced Instruc-
Deep neural networks (DNN) are increasingly being used tion Set Computer (RISC) architecture. Because of their
on IoT-enabled devices like the Raspberry Pi to improve efficiency and flexibility, Arm processors are used in many
efficiency, security, and privacy. However, the size and electronic products, including smartphones, tablets, and
complexity of the machine-learning (ML) model that can wearables. Arm’s new portfolio of hardware solutions is now
be deployed in such systems are limited by the available aimed toward Machine Learning (ML) and Deep Neural
computational and memory resources. The Raspberry Pi Network (DNN) applications. In recent times, ARM-based
is a low-cost, small, and portable computer board with processors have been developed for the acceleration of
built-in software that allows users to create scripts or machine learning applications from various manufacturers
programs in Python [229]. There are two main limitations viz., Marvell (ThunderX2), Fujitsu (A64FX), Huawei (Kun-
to utilizing a Raspberry Pi for deep learning: 1) the small peng 920), and Ampere (eMAG). With the help of its recently
amount of memory available and 2) the slow processing released Neural Processing Units (NPUs), Arm processors
speed. These limitations severely hamper the implementation bring machine learning to low-end edge devices.
Arm ML processor uses the Neural Network (NN) software computing devices [89]. This combination achieves a 32×
development kit provided by the company to interface the improvement in ML processing compared to the base Cortex-
ML software and corresponding hardware [214]. The Arm- M55 core. Furthermore, TinyML [20] advancements have
based ML accelerator consists of a number of computing made it possible to use ML models on the microcontroller
engines up to 16, each of which includes a programmable hardware found in our household appliances, including
layer engine and a MAC convolution engine, see Fig. 52. printers, TVs, smartwatches, and pacemakers, which can now
Each computing engine has its own local memory to carry out tasks that were previously only capable of being
process the ML models. Starting with weights applied to done by computers and smartphones. The machine learning
incoming data, processing via the MAC convolution engine, and embedded ultra-low power systems communities have
and finally, results processed by the Programmable Layer joined forces to create TinyML foundation. This joint effort
Engine (PLE), the flow is typical of DNN implementations. has paved the way for innovative and captivating alternative
There are 128 multiple-accumulate (MAC) units in the MAC uses of on-device machine learning. TinyML supports various
convolution engine. MAC convolution engine receives the frameworks, including TensorFlow Lite Micro (TFLM),
input data from the input feature map read block, weights TensorFlow-Native, Embedded Learning Library (ELL),
from the weight decoder, and performs the required MAC Graph Lowering (GLOW), etc. Google developed an open-
operation. The result of the convolution is processed by the source framework called CFU Playground [174] for TinyML
PLE, which is a vectorized microcontroller. It is more akin to acceleration on FPGA. CFU playground toolchain com-
a RISC platform designed to wrap up the processing of a layer bines open-source software (TensorFlow), RTL generators
for a piece of a DNN model with several layers. The PLE is (LiteX, Migen, etc.), and FPGA tools for synthesis (yosys),
in charge of tasks like pooling and activation. The throughput place, and route (vpr). The CFU playground framework
of the proposed ML processor is 4.6 TOPS. The proposed makes it possible to investigate custom architectures for
design is implemented using 7 nm chip technology, and it is the acceleration of Tiny ML for embedded ML systems.
scalable, and can achieve the throughput of 150 TOPS for TinyML is used in many applications, including medical
high-end applications. face mask detection [156], eating detection [166], Li-Ion
The Arm AI platform, also known as Project Trillium, batteries parameter estimation [69], etc. The most in-demand
is a heterogeneous compute platform that includes Arm research areas among the TinyML community include sound
Cortex CPUs, Ethos NPUs, Mali GPUs, and microNPUs to recognition, computer vision, and the development of low-
accelerate the ML algorithms [142]. Arm supports various power accurate ML models. More research is needed to
ML frameworks such as TensorFlow Lite, Caffe, PyTorch, fully comprehend the advantages and drawbacks of the
etc. and accelerates the ML applications using software topics under discussion, even if many applications have
libraries including arm NN, arm COMPUTE LIBRARY, demonstrated TinyML’s promise. Some key issues that
and Common Microcontroller Software Interface Standard- require further research in this area include developing
NN (CMSIS-NN). The hardware products such as Arm benchmarks, memory constraints, energy, processor capacity,
Cortex CPUs, Ethos NPUs, Mali GPUs, microNPUs, FPGAs, cost reduction, etc.
DSPs, etc. ARM’s new Cortex-A55/A75 and Mali-G72
combination targets machine learning on edge computing VIII. COMPARISON BETWEEN VARIOUS HARDWARE
devices. ARCHITECTURES FOR DNN ACCELERATION
Arm has developed its Ethos series of ML processors for The performance of the various hardware accelerators for
machine learning applications. Ethos series is classified into the DNN acceleration depends on the target application.
N-series and U-series [25]. Ethos N-series was introduced However, researchers defined some standard metrics, namely,
in October 2019, containing NPUs identical to the Cortex area, power, and throughput, to measure the performance
family. Ethos U-series was introduced in early 2020, and it of the hardware accelerators for the development and
contains microNPUs. MicroNPUs are paired with the CPU, deployment of DNNs. Here, the area is nothing but the
like the Cortex-M55, to process the ML algorithms. Ethos- portion of silicon required for the DNN acceleration, which
U55 achieves a throughput of 0.5 TOPS, containing 32 to is generally represented in squared millimeters or squared
256 8-bit MAC units [143]. Ethos-U55 supports 8-bit and 16- micrometers. The area depends on the size of the on-
bit integer data types. Ethos-U65 achieves a throughput of chip memory and the technology used during the hardware
1 TOPS, containing 256 to 312 8-bit MAC units. Ethos-N57 synthesis process. Power is nothing but the amount of
achieves a throughput of 2 TOPS, containing 1024 8-bit MAC power consumed by the specific hardware during the DNN
units. Ethos-N77 is a highly efficient ML inference processor acceleration. The power consumption mainly depends on off-
that achieves a throughput of 5 TOPS and is best suitable chip and on-chip memories. Throughput is used to measure
for mobile devices. Ethos-N77 ML processors can be used the productivity of the hardware accelerator. The comparison
for facial or object recognition applications. Ethos-N78 is a between the various hardware accelerator architectures for
scalable and efficient ML inference processor that achieves DNN acceleration is shown in Table 7. Due to a lack of
a throughput of 1 to 10 TOPS [144]. Arm’s Cortex-M55 data on their footprint, power consumption, and throughput,
and the Ethos-U55 can be used as an AI accelerator in edge CGRA-based accelerators are not represented in Table 7.
As expected, temporal or general purpose architectures and other nanomaterials. Researchers at MIT and Stanford
such as CPU and GPU have greater power consumption have developed a new 3-D architecture based on a network
and area than special purpose architectures such as FPGA of millions of carbon nanotubes [191]. Computations in
and ASIC because they are not tailored for a particular optical computing technology can happen at the speed of
application. The essential hardware metrics like power, light, much faster than conventional electron-driven chips.
area, technology, and throughput are reported for each MIT is driving research in advanced optical materials,
hardware architecture. In Table 8, we have compared the few switches, lasers, and nano-optics [106] to advance optical
embedded development boards discussed above concerning computing. We may expect a greater deployment of optical
general-purpose CPUs/GPUs, specialized co-processors they chips in the future. DNA computing is a type of parallel
contain, performance, power, SDKs, and supported ML computing in which many different DNA molecules are
frameworks. used to test many possibilities simultaneously [138]. The
major advantage of DNA is its potential for memory storage.
IX. FUTURE DIRECTIONS A single gram of DNA can store 215 petabytes (215
In the future, hardware AI acceleration is set to become ubiq- million gigabytes) [6]. Although DNA information storage
uitous. In recent processors, some AI accelerator hardware has enormous application potential, many issues, such as the
becoming a standard feature, indicating that AI acceleration high cost of writing and reading information and techniques
is an essential general-purpose task. This paper reviewed sev- to erase and rewrite the information in DNA that is still
eral FPGA-based, ASIC-based, GPU-based, CGRA-based, unknown, must be addressed before its widespread use [75].
and edge AI hardware accelerators. However, looking at In FPGA-based architectures, following future directions
the industry trends and startups in this space indicates that seems to be promising. The combination of FPGAs and
we are still in the early stage of the AI revolution. Many cloud computing opens new avenues for developing deep
more energy-efficient architectures will emerge in the future. learning applications. The FPGA cloud service is still in
In particular, architectures with transprecision or approximate its early stages. Many imperfections must be investigated,
computing, high-bandwidth memories, and emerging non- such as the virtualization of FPGA hardware resources, task
volatile memories such as MRAM and ReRAM may migration, etc. Most current research focuses on lowering
appear in the market. Evolving architectures involving the bandwidth requirements for off-chip memory access.
the Tsetlin machine are another promising future research The performance of multiple FPGA chips combined is
direction. favorable. However, dealing with processing scheduling
Emerging technologies such as nanomaterials, optical and chip allocation remains a significant challenge. Future
computing, and DNA computing may accelerate DNNs in the research could focus on the development of in-memory-
near future. Carbon nanomaterials, such as carbon nanotubes computing processors. Moreover, further improvements are
(CNTs) and graphene, are particularly intriguing due to required in the computation of the activation functions used
their rapid electron transport [58]. CNT and graphene have in DNNs. Because most studies focus on loop optimization,
desirable switching and optical properties, making them only a few researchers are currently working on activation
well-suited to electronic and optical architectures [189]. New function optimization. There will be frameworks to integrate
chip architectures become possible with the help of CNTs existing or new architectures, which will help quickly deploy
[2] Accelerate Fast Math With Intel Oneapi Math Kernel Library. [29] (2022). Edge TPU Compiler. [Online]. Available: https://coral.
Accessed: Jun. 5, 2021. [Online]. Available: https://software.intel.com/ ai/docs/edgetpu/compiler/#system-requirements
content/www/us/en/develop/tools/oneapi/components/onemkl.html#gs.
[30] High-Level Synthesis & Verification. Accessed: Jul. 2022. [Online].
3595x9
Available: https://eda.sw.siemens.com/en-US/ic/ic-design/high-level-
[3] Advanced AI Embedded Systems: NVIDIA Jetson: The AI Platform for
synthesis-and-verification-platform/
Autonomous Machines. Accessed: Aug. 2, 2022. [Online]. Available: [31] NVIDIA Tesla T4 Specs. Accessed: Jun. 17, 2022. [Online]. Available:
https://www.nvidia.com/en-in/autonomous-machines/embedded- https://www.nvidia.com/en-us/data-center/tesla-t4/
systems/ [32] Vitis AI. Accessed: Jun. 17, 2022. [Online]. Available: https://www.
[4] BeagleBone AI: Fast Track to Embedded Artificial Intelligence. [Online]. xilinx.com/products/design-tools/vitis/vitis-ai.html
Available: https://beagleboard.org/AI Accessed: Jan. 2, 2022. [33] M. Alwani, H. Chen, M. Ferdman, and P. Milder, ‘‘Fused-layer CNN
[5] BitMain Neural Network SDK: Introduction. Accessed: Jan. 2, 2022. accelerators,’’ in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchitec-
[Online]. Available: https://sophon-edge.gitbook.io/project/ ture (MICRO), Oct. 2016, pp. 1–12.
[6] R. F. Service, ‘‘DNA could store all of the world’s data in one [34] D. Amodei et al., ‘‘Deep speech 2: End-to-end speech recognition in
room,’’ Science, Mar. 2017. [Online]. Available: https://www.science. English and Mandarin,’’ in Proc. 33rd Int. Conf. Mach. Learn., vol. 48,
org/content/article/dna-could-store-all-worlds-data-one-room M. F. Balcan and K. Q. Weinberger, Eds. New York, NY, USA: arXiv,
[7] DPU for Convolutional Neural Network. Accessed: May 1, 2022. Jun. 2016, pp. 173–182.
[Online]. Available: https://www.xilinx.com/products/intellectual- [35] A. Argal, S. Gupta, A. Modi, P. Pandey, S. Shim, and C. Choo, ‘‘Intelligent
property/dpu.html travel chatbot for predictive recommendation in echo platform,’’ in Proc.
[8] Edge TPU Developer Board. [Online]. Available: https://www.sophon. IEEE 8th Annu. Comput. Commun. Workshop Conf. (CCWC), Jan. 2018,
ai/product/introduce/edb.html Accessed: Jan. 2, 2022. pp. 176–183.
[9] Gluon AI Co-Processor. Accessed: Jan. 10, 2022. [Online]. Available: [36] D. Bahdanau, K. Cho, and Y. Bengio, ‘‘Neural machine translation by
https://alphaics.ai/products/gluon-ai-accelerator/ jointly learning to align and translate,’’ 2015, arXiv:1409.0473.
[10] Intel Movidius Myriad X Vision Processing Unit. Accessed: Apr. 2, 2022. [37] Y. Bengio, ‘‘Learning deep architectures for AI,’’ Found. Trends Mach.
[Online]. Available: https://www.intel.com/content/www/us/en/products/ Learn., vol. 2, no. 1, pp. 1–127, 2009.
details/processors/movidius-vpu/movidius-myriad-x.html [38] K. Benkrid and S. Belkacemi, ‘‘Design and implementation of a 2D
[11] Jetson Nano Developer Kit. Accessed: Aug. 2, 2022. [Online]. Available: convolution core for video applications on FPGAs,’’ in Proc. 3rd Int.
https://www.nvidia.com/en-in/autonomous-machines/embedded- Workshop Digit. Comput. Video (DCV), 2002, pp. 85–92.
[39] M. Bergeron. Real-Time Face Recognition on Ultra96-V2.
systems/jetson-nano-developer-kit/
Accessed: Feb. 1, 2022. [Online]. Available: https://www.hackster.io/
[12] Kendryte K210. Accessed: Sep. 2, 2022. [Online]. Available:
AlbertaBeef/real-time-face-recognition-on-ultra96-v2-94de9b
https://canaan.io/product/kendryteai
[40] Y. H. Bhosale and K. S. Patnaik, ‘‘Application of deep learning techniques
[13] Maixduino. Accessed: Sep. 2, 2022. [Online]. Available: https:// in diagnosis of COVID-19 (Coronavirus): A systematic review,’’ Neural
www.seeedstudio.com/Sipeed-Maixduino-Kit-for-RISC-V-AI-IoT-p- Process. Lett., pp. 1–53, Sep. 2022.
4047.html [41] Y. H. Bhosale and K. Sridhar Patnaik, ‘‘IoT deployable lightweight
[14] MAX78000—Artificial Intelligence Microcontroller with Ultra- deep learning application for COVID-19 detection with lung diseases
Low-Power Convolutional Neural Network Accelerator. Accessed: using RaspberryPi,’’ in Proc. Int. Conf. IoT Blockchain Technol. (ICIBT),
Jan. 10, 2022. [Online]. Available: https://www.maximintegrated.com/en/ May 2022, pp. 1–6.
products/microcontrollers/MAX78000.html [42] Y. H. Bhosale, S. Zanwar, Z. Ahmed, M. Nakrani, D. Bhuyar, and
[15] Myriad 2 MA2x5x Vision Processor: Transforming Devices Through U. Shinde, ‘‘Deep convolutional neural network based COVID-19
Ultra Low-Power Machine Vision—Google Search. Accessed: classification from radiology X-ray images for IoT enabled devices,’’ in
Apr. 2, 2022. [Online]. Available: https://www.intel.com/content/www/ Proc. 8th Int. Conf. Adv. Comput. Commun. Syst. (ICACCS), Mar. 2022,
us/en/products/details/processors/movidius-vpu/movidius-myriad- pp. 1398–1402.
x.html,www.movidius.com [43] L. Bishnoi and S. N. Singh, ‘‘Artificial intelligence techniques used in
[16] (2020). Nvidia A100 Tensor Core GPU Architecture. Accessed: medical sciences: A review,’’ in Proc. 8th Int. Conf. Cloud Comput., Data
Jun. 13, 2021. [Online]. Available: https://www.nvidia.com/content/dam/ Sci. Eng. (Confluence), Jan. 2018, pp. 1–8.
en-zz/solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf [44] A. G. Blaiech, K. Ben Khalifa, C. Valderrama, M. A. Fernandes, and
[17] (2017). Nvidia Tesla V100 GPU Architecture. Accessed: M. H. Bedoui, ‘‘A survey and taxonomy of FPGA-based deep learning
Jun. 13, 2021. [Online]. Available: https://images.nvidia.com/content/ accelerators,’’ J. Syst. Archit., vol. 98, pp. 331–345, Sep. 2019.
technologies/volta/pdf/437317-volta-v100-ds-nv-us-web.pdf [45] S. Bouguezzi, H. B. Fredj, T. Belabed, C. Valderrama, H. Faiedh, and
[18] PYNQ-Z2. Accessed: Jan. 7, 2022. [Online]. Available: http://www. C. Souani, ‘‘An efficient FPGA-based convolutional neural network for
pynq.io/board.html classification: Ad-MobileNet,’’ Electronics, vol. 10, no. 18, p. 2272,
[19] Silicon Labs BG24 and MG24 SoCs. Accessed: Jan. 10, 2022. [Online]. Sep. 2021.
Available: https://www.silabs.com/wireless/zigbee/efr32mg24-series-2- [46] A. Boutros, S. Yazdanshenas, and V. Betz, ‘‘You cannot improve what you
socs do not measure: FPGA vs. ASIC efficiency gaps for convolutional neural
[20] TinyML Foundation. Accessed: Sep. 2, 2022. [Online]. Available: network inference,’’ ACM Trans. Reconfigurable Technol. Syst., vol. 11,
https://www.tinyml.org/ no. 3, pp. 1–23, Sep. 2018.
[21] Vitis Unified Software Platform. Accessed: Oct. 1, 2022. [Online]. [47] S. Cadambi, A. Majumdar, M. Becchi, S. Chakradhar, and H. P. Graf,
Available: https://www.xilinx.com/products/design-tools/vitis/vitis- ‘‘A programmable parallel accelerator for learning and classification,’’ in
platform.html Proc. 19th Int. Conf. Parallel Archit. Compilation Techn. (PACT), 2010,
pp. 273–283.
[22] Vivado. Accessed: Jan. 15, 2022. [Online]. Available: https://www.
[48] M. Capra, B. Bussolino, A. Marchisio, M. Shafique, G. Masera, and
xilinx.com/products/design-tools/vivado.html
M. Martina, ‘‘An updated survey of efficient hardware architectures
[23] Xilinx Kria—Adaptive System-on-Module. Accessed: Oct. 1, 2022.
for accelerating deep convolutional neural networks,’’ Future Internet,
[Online]. Available: https://www.xilinx.com/products/som/kria.html
vol. 12, no. 7, p. 113, Jul. 2020.
[24] Xilinx Vitis AI Model Zoo. Accessed: Oct. 1, 2022. [Online]. Available: [49] F. Cardells-Tormo, P.-L. Molinet, J. Sempere-Agullo, L. Baldez, and
https://github.com/Xilinx/AI-Model-Zoo M. Bautista-Palacios, ‘‘Area-efficient 2D shift-variant convolvers for
[25] (Jul. 2021). Ethos—ARM—WikiChip. Accessed: Aug. 7, 2021. [Online]. FPGA-based digital image processing,’’ in Proc. Int. Conf. Field
Available: https://en.wikichip.org/wiki/arm_holdings/ethos Program. Log. Appl., 2005, pp. 578–581.
[26] (Aug. 2021). Jetson Xavier NX. Accessed: Jul. 15, 2022. [Online]. Avail- [50] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and L. Benini,
able: https://www.nvidia.com/en-us/autonomous-machines/embedded- ‘‘Origami: A convolutional network accelerator,’’ in Proc. 25th, Ed.,
systems/jetson-xavier-nx/ Great Lakes Symp. (VLSI), New York, NY, USA, May 2015, pp. 199–204.
[27] (2022). Coral Products. [Online]. Available: https://coral.ai/products/ [51] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, ‘‘A dynam-
[28] (Jul. 2022). Deploy AI-Powered Autonomous Machines at Scale. ically configurable coprocessor for convolutional neural networks,’’ in
Accessed: Jul. 15, 2022. [Online]. Available: https://www.nvidia.com/en- Proc. 37th Annu. Int. Symp. Comput. Archit. (ISCA), New York, NY, USA,
gb/autonomous-machines/embedded-systems/jetson-agx-xavier/ 2010, pp. 247–257.
[52] J.-W. Chang and S.-J. Kang, ‘‘Optimizing FPGA-based convolutional [74] J. Domke, E. Vatai, A. Drozd, P. ChenT, Y. Oyama, L. Zhang, S. Salaria,
neural networks accelerator for image super-resolution,’’ in Proc. D. Mukunoki, A. Podobas, M. WahibT, and S. Matsuoka, ‘‘Matrix
23rd Asia South Pacific Design Autom. Conf. (ASP-DAC), Jan. 2018, engines for high performance computing: A paragon of performance or
pp. 343–348. grasping at straws?’’ in Proc. IEEE Int. Parallel Distrib. Process. Symp.
[53] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, (IPDPS). Los Alamitos, CA, USA: IEEE Computer Society, May 2021,
‘‘Diannao: A small-footprint high-throughput accelerator for ubiquitous pp. 1056–1065.
machine-learning,’’ in Proc. 19th Int. Conf. Architectural Support [75] Y. Dong, F. Sun, Z. Ping, Q. Ouyang, and L. Qian, ‘‘DNA storage:
Program. Lang. Operating Syst., vol. 14, New York, NY, USA, 2014, Research landscape and future prospects,’’ Nat. Sci. Rev., vol. 7, no. 6,
pp. 269–284. pp. 1092–1107, Jun. 2020.
[76] L. Du and Y. Du, ‘‘Hardware accelerator design for machine learning,’’
[54] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu,
in Machine Learning, H. Farhadi, Ed. Rijeka, Croatia: IntechOpen, 2018,
N. Sun, and O. Temam, ‘‘DaDianNao: A machine-learning supercom-
ch. 1.
puter,’’ in Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchitecture, [77] L. Du, Y. Du, Y. Li, J. Su, Y.-C. Kuan, C.-C. Liu, and
Dec. 2014, pp. 609–622. M.-C. F. Chang, ‘‘A reconfigurable streaming deep convolutional
[55] Y. Chen, Y. Xie, L. Song, F. Chen, and T. Tang, ‘‘A survey of accelerator neural network accelerator for Internet of Things,’’ IEEE Trans. Circuits
architectures for deep neural networks,’’ Engineering, vol. 6, no. 3, Syst. I, Reg. Papers, vol. 65, no. 1, pp. 198–208, Jan. 2018.
pp. 264–274, Mar. 2020. [78] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and
[56] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, ‘‘Eyeriss: An O. Temam, ‘‘Shidiannao: Shifting vision processing closer to the sensor,’’
energy-efficient reconfigurable accelerator for deep convolutional neural SIGARCH Comput. Archit. News, vol. 43, no. 3S, pp. 92–104, Jun. 2015.
networks,’’ IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, [79] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
Jan. 2017. and O. Temam, ‘‘ShiDianNao: Shifting vision processing closer to the
[57] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, ‘‘Eyeriss v2: A flexible sensor,’’ in Proc. 42nd Annu. Int. Symp. Comput. Archit., New York, NY,
accelerator for emerging deep neural networks on mobile devices,’’ USA, Jun. 2015, pp. 92–104.
IEEE J. Emerging Sel. Topics Circuits Syst., vol. 9, no. 2, pp. 292–308, [80] C. Dubout and F. Fleuret, ‘‘Exact acceleration of linear object detectors,’’
Jun. 2019. in Proc. 12th Eur. Conf. Comput. Vis. Berlin, Germany: Springer-Verlag,
2012, pp. 301–311.
[58] Z. Chen, H.-S. Philip Wong, S. Mitra, A. Bol, L. Peng, G. Hills, and
[81] A. J. A. El-Maksoud, M. Ebbed, A. H. Khalil, and H. Mostafa, ‘‘Power
N. Thissen, ‘‘Carbon nanotubes for high-performance logic,’’ MRS Bull.,
efficient design of high-performance convolutional neural networks
vol. 39, no. 8, pp. 719–726, Aug. 2014.
hardware accelerator on FPGA: A case study with GoogLeNet,’’ IEEE
[59] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, Access, vol. 9, pp. 151897–151911, 2021.
and E. Shelhamer, ‘‘CuDNN: Efficient primitives for deep learning,’’ [82] H. Fan, S. Liu, M. Ferianc, H.-C. Ng, Z. Que, S. Liu, X. Niu, and
2014, arXiv:1410.0759. W. Luk, ‘‘A real-time object detection accelerator with compressed
[60] D. Chicco, P. Sadowski, and P. Baldi, ‘‘Deep autoencoder neural networks SSDLite on FPGA,’’ in Proc. Int. Conf. Field-Programmable Technol.
for gene ontology annotation predictions,’’ in Proc. 5th ACM Conf. (FPT), Dec. 2018, pp. 14–21.
Bioinf., Comput. Biol., Health Informat., New York, NY, USA, Sep. 2014, [83] X. Fan, D. Wu, W. Cao, W. Luk, and L. Wang, ‘‘Stream processing dual-
pp. 533–540. track CGRA for object inference,’’ IEEE Trans. Very Large Scale Integr.
[61] P.-S. Chiu, J.-W. Chang, M.-C. Lee, C.-H. Chen, and D.-S. Lee, (VLSI) Syst., vol. 26, no. 6, pp. 1098–1111, Jun. 2018.
‘‘Enabling intelligent environment by the design of emotionally aware [84] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and
virtual assistant: A case of smart campus,’’ IEEE Access, vol. 8, Y. Lecun, ‘‘NeuFlow: A runtime-reconfigurable dataflow processor for
pp. 62032–62041, 2020. vision,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
Workshops (CVPRW), Jun. 2011, pp. 109–116.
[62] Y.-k. Choi, K. You, J. Choi, and W. Sung, ‘‘A real-time FPGA-based 20
[85] C. Farabet, C. Poulet, J. Han, and Y. LeCun, ‘‘CNP: An FPGA-based
000-word speech recognizer with optimized DRAM access,’’ IEEE Trans.
processor for convolutional networks,’’ in Proc. 19th Int. Conf. Field
Circuits Syst. I, Reg. Papers, vol. 57, no. 8, pp. 2119–2131, Aug. 2010.
Program. Log. Appl., 2009, pp. 32–37.
[63] J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky, [86] X. Feng, H. Zhang, Y. Ren, P. Shang, Y. Zhu, Y. Liang, R. Guan, and
‘‘NVIDIA A 100 tensor core GPU: Performance and innovation,’’ IEEE D. Xu, ‘‘The deep learning—Based recommender system ‘pubmende’ for
Micro, vol. 41, no. 2, pp. 29–35, Mar. 2021. choosing a biomedical publication venue: Development and validation
[64] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, ‘‘Fast and accurate study,’’ J. Med. Internet Res., vol. 21, no. 5, May 2019, Art. no. e12957.
deep network learning by exponential linear units (ELUs),’’ 2015, [87] K. Fukushima, ‘‘Neocognitron: A hierarchical neural network capable
arXiv:1511.07289. of visual pattern recognition,’’ Neural Netw., vol. 1, no. 2, pp. 119–130,
[65] J. Cloutier, E. Cosatto, S. Pigeon, F. R. Boyer, and P. Y. Simard, ‘‘VIP: 1988.
An FPGA-based processor for image processing and neural networks,’’ [88] A. Gainaru, E. Slusanschi, and S. Trausan-Matu, ‘‘Mapping data mining
in Proc. 5th Int. Conf. Microelectron. Neural Netw., 1996, pp. 330–336. algorithms on a GPU architecture: A study,’’ in Proc. Found. Intell.
Syst. 19th Int. Symp., (ISMIS), in Lecture Notes in Computer Science,
[66] R. Collobert, K. Kavukcuoglu, and C. Farabet, ‘‘Torch7: A MATLAB-
vol. 6804. M. Kryszkiewicz, H. Rybinski, A. Skowron, and Z. W. Ras,
like environment for machine learning,’’ in Proc. NIPS, 2011, pp. 1–6.
Eds. Warsaw, Poland: Springer, Jun. 2011, pp. 102–112.
[67] J. Cong and B. Xiao, ‘‘Minimizing computation in convolutional neural [89] C. Gartenberg, ‘‘ARM’s new edge AI chips promise IoT devices
networks,’’ in Proc. ICANN, 2014, pp. 281–290. that won’t need the cloud,’’ Verge, Washington, DC, USA,
[68] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, Tech. Rep., Feb. 2020. [Online]. Available: https://www.theverge.com/
‘‘Binarized neural networks: Training deep neural networks with weights 2020/2/10/21130800/arm-new-edge-ai-chips-processing-npu-cortex-
and activations constrained to +1 or −1,’’ 2016, arXiv:1602.02830. m55-u55-iot
[69] G. Crocioni, D. Pau, J.-M. Delorme, and G. Gruosso, ‘‘Li-ion batteries [90] A. Ghaffari and Y. Savaria, ‘‘CNN2Gate: An implementation of
parameter estimation with tiny neural networks embedded on intelligent convolutional neural networks inference on FPGAs with automated
IoT microcontrollers,’’ IEEE Access, vol. 8, pp. 122135–122146, 2020. design space exploration,’’ Electronics, vol. 9, no. 12, p. 2200, Dec. 2020.
[70] D. Danopoulos, C. Kachris, and D. Soudris, ‘‘Acceleration of image [91] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, ‘‘A 240 G-
classification with Caffe framework using FPGA,’’ in Proc. 7th Int. Conf. ops/s mobile coprocessor for deep neural networks,’’ in Proc. IEEE Conf.
Modern Circuits Syst. Technol. (MOCAST), May 2018, pp. 1–4. Comput. Vis. Pattern Recognit. Workshops, Jun. 2014, pp. 696–701.
[92] K. M. V. Gowda, S. Madhavan, S. Rinaldi, P. B. Divakarachari, and
[71] L. Deng and D. Yu, ‘‘Deep learning: Methods and applications,’’ Found. A. Atmakur, ‘‘FPGA-based reconfigurable convolutional neural network
Trends Signal Process., vol. 7, nos. 3–4, pp. 197–387, Jun. 2014. accelerator using sparse and convolutional optimization,’’ Electronics,
[72] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. De Freitas, ‘‘Predicting vol. 11, no. 10, p. 1653, May 2022.
parameters in deep learning,’’ in Proc. 26th Int. Conf. Neural Inf. [93] H. Graf, S. Cadambi, V. Jakkula, M. Sankaradass, E. Cosatto,
Process. Syst., vol. 2. Red Hook, NY, USA: Curran Associates, 2013, S. Chakradhar, and I. Dourdanovic, ‘‘A massively parallel digital learning
pp. 2148–2156. processor,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 21, D. Koller,
[73] A. Deshpande, A Beginner’s Guide To Understanding Convolutional D. Schuurmans, Y. Bengio, and L. Bottou, Eds. Red Hook, NY, USA:
Neural Networks. Los Angeles, CA, USA: University of California, 2018. Curran Associates, 2009, pp. 1–8.
[94] S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu, ‘‘A survey of deep [115] Texas Instruments. (2015). Am5729 Sitara Processor. [Online]. Avail-
learning techniques for autonomous driving,’’ 2019, arXiv:1910.07738. able: https://www.ti.com/product/AM5729
[95] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, [116] H. Irmak, N. Alachiotis, and D. Ziener, ‘‘An energy-efficient FPGA-based
and J. Cong, ‘‘FP-DNN: An automated framework for mapping deep convolutional neural network implementation,’’ in Proc. 29th Signal
neural networks onto FPGAs with RTL-HLS hybrid templates,’’ in Proc. Process. Commun. Appl. Conf. (SIU), Jun. 2021, pp. 1–4.
IEEE 25th Annu. Int. Symp. Field-Program. Custom Comput. Mach. [117] H. Irmak, F. Corradi, P. Detterer, N. Alachiotis, and D. Ziener,
(FCCM), Apr. 2017, pp. 152–159. ‘‘A dynamic reconfigurable architecture for hybrid spiking and convo-
[96] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and lutional FPGA-based neural network designs,’’ J. Low Power Electron.
H. Yang, ‘‘Angel-Eye: A complete design flow for mapping CNN onto Appl., vol. 11, no. 3, p. 32, Aug. 2021.
embedded FPGA,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits [118] S. M. A. H. Jafri, T. N. Gia, S. Dytckov, M. Daneshtalab, A. Hemani,
Syst., vol. 37, no. 1, pp. 35–47, Jan. 2018. J. Plosila, and H. Tenhunen, ‘‘NeuroCGRA: A CGRA with support for
[97] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang, ‘‘[DL] A survey neural networks,’’ in Proc. Int. Conf. High Perform. Comput. Simul.
of FPGA-based neural network inference accelerators,’’ ACM Trans. (HPCS), Jul. 2014, pp. 506–511.
Reconfigurable Technol. Syst., vol. 12, no. 1, pp. 1–26, 2019. [119] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick,
[98] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, ‘‘Deep S. Guadarrama, and T. Darrell, ‘‘Caffe: Convolutional architecture for fast
learning with limited numerical precision,’’ in Proc. 32nd Int. Conf. Mach. feature embedding,’’ 2014, arXiv:1408.5093.
Learn. (ICML), vol. 37, 2015, pp. 1737–1746.
[120] N. P. Jouppi et al., ‘‘In-datacenter performance analysis of a tensor
[99] F. G. Gustavson, ‘‘Two fast algorithms for sparse matrices: Multiplication
processing unit,’’ in Proc. 44th Annu. Int. Symp. Comput. Archit., 2017,
and permuted transposition,’’ ACM Trans. Math. Softw., vol. 4, no. 3,
pp. 1–12.
pp. 250–269, Sep. 1978.
[121] D. Justus, J. Brennan, S. Bonner, and A. S. McGough, ‘‘Predicting the
[100] A. Guzhva, S. Dolenko, and I. Persiantsev, ‘‘Multifold acceleration
computational cost of deep learning models,’’ in Proc. IEEE Int. Conf.
of neural network computations using gpu,’’ in Proc. 19th Int. Conf.
Big Data (Big Data), Dec. 2018, pp. 3873–3882.
Artif. Neural Networks, I. Berlin, Germany: Springer-Verlag, 2009,
pp. 373–380. [122] S. Kalapothas, G. Flamis, and P. Kitsos, ‘‘Efficient edge-AI application
[101] R. Hadsell, P. Sermanet, J. Ben, A. Erkan, M. Scoffier, K. Kavukcuoglu, deployment for FPGAs,’’ Information, vol. 13, no. 6, p. 279, May 2022.
U. Müller, and Y. LeCun, ‘‘Learning long-range vision for autonomous [123] A. Karpathy, ‘‘Convolutional neural networks for visual recognition,’’
off-road driving,’’ J. Field Robot., vol. 26, no. 2, pp. 120–144, Feb. 2009. GitHub, Tech. Rep., 2018. [Online]. Available: https://cs231n.github.
[102] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, io/convolutional-networks/
Y. Wang, H. Yang, and W. J. Dally, ‘‘ESE: Efficient speech recognition [124] M. Kavitha, R. Srinivasan, and R. Bhuvanya, Fake News Detection Using
engine with sparse LSTM on FPGA,’’ in Proc. ACM/SIGDA Int. Symp. Machine Learning Algorithms. Hoboken, NJ, USA: Wiley, 2022, ch. 10,
Field-Program. Gate Arrays, Feb. 2017, pp. 75–84. pp. 181–207.
[103] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, [125] H. Khan, A. Khan, Z. Khan, L. B. Huang, K. Wang, and L. He, ‘‘NPE:
‘‘EIE: Efficient inference engine on compressed deep neural network,’’ An FPGA-based overlay processor for natural language processing,’’ in
in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit. (ISCA), Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2021,
Jun. 2016, pp. 243–254. pp. 1–11.
[104] F. Hannig, V. Lari, S. Boppu, A. Tanase, and O. Reiche, ‘‘Invasive tightly- [126] J.-Y. Kim, ‘‘FPGA based neural network accelerators,’’ in Hardware
coupled processor arrays: A domain-specific architecture/compiler co- Accelerator Systems for Artificial Intelligence and Machine Learning
design approach,’’ ACM Trans. Embedded Comput. Syst., vol. 13, no. 4S, (Advances in Computers), vol. 122, S. Kim and G. C. Deka, Eds.
pp. 1–29, Jul. 2014. Amsterdam, The Netherlands: Elsevier, 2021, pp. 135–165.
[105] C. Hao, A. Sarwari, Z. Jin, H. Abu-Haimed, D. Sew, Y. Li, X. Liu, B. Wu, [127] Y. Kim, J. Lee, J.-S. Kim, H. Jei, and H. Roh, ‘‘Efficient multi-GPU
D. Fu, J. Gu, and D. Chen, ‘‘A hybrid GPU + FPGA system design for memory management for deep learning acceleration,’’ in Proc. IEEE 3rd
autonomous driving cars,’’ in Proc. IEEE Int. Workshop Signal Process. Int. Workshops Found. Appl. Self Syst. (FASW), Sep. 2018, pp. 37–43.
Syst. (SiPS), Oct. 2019, pp. 121–126. [128] J. P. Klock, J. Correa, M. Bessa, J. Arias-Garcia, F. Barboza, and
[106] L. Hardesty, ‘‘Researchers build an all-optical transistor,’’ Massachusetts C. Meinertz, ‘‘A new automated energy meter fraud detection system
Inst. Technol., Cambridge, MA, USA, Tech. Rep., 2013. [Online]. based on artificial intelligence,’’ in Proc. 11th Brazilian Symp. Comput.
Available: https://news.mit.edu/2013/computing-with-light-0704 Syst. Eng. (SBESC), Nov. 2021, pp. 1–8.
[107] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Delving deep into rectifiers: [129] A. Kojima and Y. Nose, ‘‘Development of an autonomous driving robot
Surpassing human-level performance on ImageNet classification,’’ in car using FPGA,’’ in Proc. Int. Conf. Field-Programmable Technol.
Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1026–1034. (FPT), Dec. 2018, pp. 411–414.
[108] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for
[130] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification
image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
with deep convolutional neural networks,’’ Commun. ACM, vol. 60, no. 6,
(CVPR), Jun. 2016, pp. 770–778.
pp. 84–90, May 2017.
[109] D. Hefenbrock, J. Oberg, N. T. N. Thanh, R. Kastner, and S. B. Baden,
[131] H. Kwon, A. Samajdar, and T. Krishna, ‘‘MAERI: Enabling flexible
‘‘Accelerating Viola–Jones face detection to FPGA-level using GPUs,’’
dataflow mapping over DNN accelerators via reconfigurable intercon-
in Proc. 18th IEEE Annu. Int. Symp. Field-Program. Custom Comput.
nects,’’ ACM Architectural Support Program. Lang. Operating Syst.,
Mach., May 2010, pp. 11–18.
vol. 53, pp. 461–475, Mar. 2018.
[110] C. Heidorn, M. Witterauf, F. Hannig, and J. Teich, ‘‘Efficient mapping of
CNNs onto tightly coupled processor arrays,’’ J. Comput., vol. 14, no. 8, [132] A. Lavin and S. Gray, ‘‘Fast algorithms for convolutional neural
pp. 541–556, 2019. networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
[111] A. Howard and S. Gupta. (2020). Introducing the Next Generation Jun. 2016, pp. 4013–4021.
of on-Device Vision Models: Mobilenetv3 and Mobilenetedgetpu. [133] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, ‘‘UNPU:
[Online]. Available: https://ai.googleblog.com/2019/11/introducing- A 50.6 TOPS/W unified deep neural network accelerator with 1 b-to-
next-generation-on-device.html 16 b fully-variable weight bit-precision,’’ in Proc. IEEE Int. Solid-State
[112] H. Hu, J. Li, C. Wu, X. Li, and Y. Chen, ‘‘Design and implementation of Circuits Conf. (ISSCC), Feb. 2018, pp. 218–220.
intelligent speech recognition system based on FPGA,’’ J. Phys., Conf., [134] J. Lee and J. Lee, ‘‘NP-CGRA: Extending CGRAs for efficient processing
vol. 2171, no. 1, Jan. 2022, Art. no. 012010. of light-weight deep neural networks,’’ in Proc. Design, Autom. Test Eur.
[113] A. S. Hussein, A. Anwar, Y. Fahmy, H. Mostafa, K. N. Salama, and Conf. Exhib. (DATE), Feb. 2021, pp. 1408–1413.
M. Kafafy, ‘‘Implementation of a DPU-based intelligent thermal imaging [135] J. Lee, J. Lee, D. Han, J. Lee, G. Park, and H.-J. Yoo, ‘‘LNPU: A
hardware accelerator on FPGA,’’ Electronics, vol. 11, no. 1, p. 105, 25.3 TFLOPS/W sparse deep-neural-network learning processor with
Dec. 2021. fine-grained mixed precision of FP8-FP16,’’ in Proc. IEEE Int. Solid-
[114] D. Im, D. Han, S. Choi, S. Kang, and H.-J. Yoo, ‘‘DT-CNN: Dilated and State Circuits Conf. (ISSCC), Feb. 2019, pp. 142–144.
transposed convolution neural network accelerator for real-time image [136] J. Lee and H.-J. Yoo, ‘‘An overview of energy-efficient hardware
segmentation on mobile devices,’’ in Proc. IEEE Int. Symp. Circuits Syst. accelerators for on-device deep-neural-network training,’’ IEEE Open J.
(ISCAS), May 2019, pp. 1–5. Solid-State Circuits Soc., vol. 1, pp. 115–128, 2021.
[137] M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, and W. Sung, ‘‘FPGA-based [159] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, ‘‘Optimizing
low-power speech recognition with recurrent neural networks,’’ in Proc. NUCA organizations and wiring alternatives for large caches with CACTI
IEEE Int. Workshop Signal Process. Syst. (SiPS), Oct. 2016, pp. 230–235. 6.0,’’ in Proc. 40th Annu. IEEE/ACM Int. Symp. Microarchitecture
[138] D. I. Lewin, ‘‘DNA computing,’’ Computing Sci. Eng., vol. 4, no. 3, (MICRO), Dec. 2007, pp. 3–14.
pp. 5–8, May 2002. [160] V. Nair and G. E. Hinton, ‘‘Rectified linear units improve restricted
[139] B. Li, E. Zhou, B. Huang, J. Duan, Y. Wang, N. Xu, J. Zhang, and H. Yang, Boltzmann machines,’’ in Proc. 27th Int. Conf. Int. Conf. Mach. Learn.
‘‘Large scale recurrent neural network on GPU,’’ in Proc. Int. Joint Conf. Madison, WI, USA: Omnipress, 2010, pp. 807–814.
Neural Netw. (IJCNN), Jul. 2014, pp. 4062–4069. [161] D. T. Nguyen, T. N. Nguyen, H. Kim, and H. J. Lee, ‘‘A high-throughput
[140] X. Lian, Z. Liu, Z. Song, J. Dai, W. Zhou, and X. Ji, ‘‘High-performance and power-efficient FPGA implementation of YOLO CNN for object
FPGA-based CNN accelerator with block-floating-point arithmetic,’’ detection,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27,
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27, no. 8, no. 8, pp. 1861–1873, Aug. 2019.
pp. 1874–1885, Aug. 2019. [162] T. Ngyen, S. M. A. H. Jafri, M. Daneshtalab, A. Hemani, S. Dytckov,
[141] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, J. Plosila, and H. Tenhunen, ‘‘FIST: A framework to interleave spiking
and Y. Chen, ‘‘Pudiannao: A polyvalent machine learning accelerator,’’ neural networks on CGRAs,’’ in Proc. 23rd Euromicro Int. Conf. Parallel,
in Proc. 20th Int. Conf. Architectural Support Program. Lang. Operating Distrib., Network-Based Process., Mar. 2015, pp. 751–758.
Syst., New York, NY, USA, 2015, pp. 369–381. [163] R. Nikhil, ‘‘Bluespec System Verilog: Efficient, correct RTL from high
[142] ‘‘Learn more about the Linaro machine learning initiative,’’ Arm level specifications,’’ in Proc. 2nd ACM IEEE Int. Conf. Formal Methods
The Architecture for the Digital World, Linaro, Cambridge, U.K., Models Co-Design, Jun. 2004, pp. 69–70.
Tech. Rep., Jan. 2019. [Online]. Available: https://www.linaro.org/ [164] E. Nurvitadhi, D. Sheffield, J. Sim, A. Mishra, G. Venkatesh, and D. Marr,
news/linaro-announces-launch-of-machine-intelligence-initiative/ ‘‘Accelerating binarized neural networks: Comparison of FPGA, CPU,
[143] (Aug. 2021). Ethos-U55 Arm Developer. Accessed: Aug. 7, 2021. GPU, and ASIC,’’ in Proc. Int. Conf. Field-Program. Technol. (FPT),
[Online]. Available: https://developer.arm.com/Processors/Ethos-U55 Dec. 2016, pp. 77–84.
[144] (Aug. 2021). High-Performing AI Solutions to Transform our Digital [165] E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. O. G. Hock,
World. Accessed: Aug. 7, 2021. [Online]. Available: https://www.google. Y. T. Liew, K. Srivatsan, D. Moss, S. Subhaschandra, and G. Boudoukh,
com/search?client=firefox-b-d&q=High-Performing+AI+Solutions+ ‘‘Can FPGAs beat GPUs accelerating next-generation deep neural
to+Transform+our+Digital+World networks?’’ in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate
[145] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, ‘‘FlexFlow: A flexible Arrays, New York, NY, USA, 2017, pp. 5–14.
dataflow accelerator architecture for convolutional neural networks,’’ in [166] M. T. Nyamukuru and K. M. Odame, ‘‘Tiny Eats: Eating detection on a
Proc. IEEE Int. Symp. High Perform. Comput. Archit. (HPCA), Feb. 2017, microcontroller,’’ in Proc. IEEE 2nd Workshop Mach. Learn. Edge Sensor
pp. 553–564. Syst. (SenSys-ML), Apr. 2020, pp. 19–23.
[146] T. Luo, S. Liu, L. Li, Y. Wang, S. Zhang, T. Chen, Z. Xu, O. Temam, and [167] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,
Y. Chen, ‘‘Dadiannao: A neural network supercomputer,’’ IEEE Trans. B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, ‘‘SCNN: An
Comput., vol. 66, no. 1, pp. 73–88, Jan. 2017. accelerator for compressed-sparse convolutional neural networks,’’ in
[147] M.-T. Luong, H. Pham, and C. D. Manning, ‘‘Effective approaches Proc. 44th Annu. Int. Symp. Comput. Archit., New York, NY, USA,
to attention-based neural machine translation,’’ in Proc. EMNLP. Jun. 2017, pp. 27–40.
Lisbon, Portugal: Association for Computational Linguistics, Aug. 2015, [168] S.-W. Park, J. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J. Yoo,
pp. 1412–1421. ‘‘An energy-efficient and scalable deep learning/inference processor with
[148] P. Lv, W. Liu, and J. Li, ‘‘A FPGA-based accelerator implementaion tetra-parallel MIMD architecture for big data applications,’’ IEEE Trans.
for YOLOv2 object detection using Winograd algorithm,’’ in Proc. Biomed. Circuits Syst., vol. 9, no. 6, pp. 838–848, Dec. 2015.
5th Int. Conf. Mech., Control Comput. Eng. (ICMCCE), Dec. 2020, [169] M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal, ‘‘Memory-
pp. 1894–1898. centric accelerator design for convolutional neural networks,’’ in Proc.
[149] A. L. Maas, ‘‘Rectifier nonlinearities improve neural network acoustic IEEE 31st Int. Conf. Comput. Design (ICCD), Oct. 2013, pp. 13–19.
models,’’ Stanford Univ., Stanford, CA, USA, Tech. Rep., 2013. [170] P.-H. Pham, D. Jelaca, C. Farabet, B. Martini, Y. LeCun, and
[150] R. Machupalli, M. Hossain, and M. Mandal, ‘‘Review of ASIC E. Culurciello, ‘‘NeuFlow: Dataflow vision processing system-on-a-
accelerators for deep neural network,’’ Microprocessors Microsyst., chip,’’ in Proc. IEEE 55th Int. Midwest Symp. Circuits Syst. (MWSCAS),
vol. 89, Mar. 2022, Art. no. 104441. Aug. 2012, pp. 1044–1047.
[151] M. Mathieu, M. Henaff, and Y. LeCun, ‘‘Fast training of convolutional [171] M. Pietras, ‘‘Hardware conversion of neural networks simulation models
networks through FFTs,’’ 2014, arXiv:1312.5851. for neural processing accelerator implemented as FPGA-based SoC,’’
[152] B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, ‘‘Adres: in Proc. 24th Int. Conf. Field Program. Log. Appl. (FPL), Sep. 2014,
An architecture with tightly coupled VLIW processor and coarse-grained pp. 1–4.
reconfigurable matrix,’’ in Field Programmable Logic and Application, [172] T. Posewsky and D. Ziener, ‘‘Efficient deep neural network acceleration
P. Y. K. Cheung and G. A. Constantinides, Eds. Berlin, Germany: through FPGA-based batch processing,’’ in Proc. Int. Conf. ReConFig-
Springer, 2003, pp. 61–70. urable Comput. FPGAs (ReConFig), Nov. 2016, pp. 1–8.
[153] J. Misra and I. Saha, ‘‘Artificial neural networks in hardware: A survey [173] T. Posewsky and D. Ziener, ‘‘Throughput optimizations for FPGA-based
of two decades of progress,’’ Neurocomputing, vol. 74, nos. 1–3, deep neural network inference,’’ Microprocessors Microsyst., vol. 60,
pp. 239–255, Dec. 2010. pp. 151–161, Jul. 2018.
[154] S. Mittal, ‘‘A survey of FPGA-based accelerators for convolutional [174] S. Prakash, T. Callahan, J. Bushagour, C. Banbury, A. V. Green,
neural networks,’’ Neural Comput. Appl., vol. 32, no. 4, pp. 1109–1139, P. Warden, T. Ansell, and V. J. Reddi, ‘‘CFU playground: Full-stack open-
Feb. 2020. source framework for tiny machine learning (tinyML) acceleration on
[155] S. Mittal and J. S. Vetter, ‘‘A survey of CPU-GPU heterogeneous FPGAs,’’ 2022, arXiv:2201.01863.
computing techniques,’’ ACM Comput. Surveys, vol. 47, no. 4, pp. 1–35, [175] W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and
Jul. 2015. M. Horowitz, ‘‘Convolution engine: Balancing efficiency and flexibility
[156] P. Mohan, A. J. Paul, and A. Chirania, ‘‘A tiny CNN architecture for in specialized computing,’’ Commun. ACM, vol. 58, no. 4, pp. 85–93,
medical face mask detection for resource-constrained endpoints,’’ in Mar. 2015.
Innovations in Electrical and Electronic Engineering (Lecture Notes in [176] E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul,
Electrical Engineering). Singapore: Springer, 2021, pp. 657–670. and T. Krishna, ‘‘SIGMA: A sparse and irregular GEMM accelerator with
[157] J. J. Moolayil, ‘‘A Layman’s guide to deep neural networks—Towards flexible interconnects for DNN training,’’ in Proc. IEEE Int. Symp. High
data science,’’ Medium, May 2020. [Online]. Available: https:// Perform. Comput. Archit. (HPCA), Feb. 2020, pp. 58–70.
towardsdatascience.com/a-laymans-guide-to-deep-neural-networks- [177] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu,
ddcea24847fb S. Song, Y. Wang, and H. Yang, ‘‘Going deeper with embedded FPGA
[158] D. Moolchandani, A. Kumar, and S. R. Sarangi, ‘‘Accelerating CNN platform for convolutional neural network,’’ in Proc. ACM/SIGDA Int.
inference on ASICs: A survey,’’ J. Syst. Archit., vol. 113, Feb. 2021, Symp. Field-Program. Gate Arrays, New York, NY, USA, Feb. 2016,
Art. no. 101887. pp. 26–35.
[178] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner, [199] D.-F. Syu, S.-W. Syu, S.-J. Ruan, Y.-C. Huang, and C.-K. Yang,
‘‘AI accelerator survey and trends,’’ in Proc. IEEE High Perform. Extreme ‘‘FPGA implementation of automatic speech recognition system in a
Comput. Conf. (HPEC), Sep. 2021, pp. 1–9. car environment,’’ in Proc. IEEE 4th Global Conf. Consum. Electron.
[179] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, (GCCE), Oct. 2015, pp. 485–486.
‘‘VDNN: Virtualized deep neural networks for scalable, memory- [200] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, ‘‘Efficient processing
efficient neural network design,’’ in Proc. 49th Annu. IEEE/ACM Int. of deep neural networks: A tutorial and survey,’’ Proc. IEEE, vol. 105,
Symp. Microarchitecture (MICRO). Piscataway, NJ, USA: IEEE Press, no. 12, pp. 2295–2329, Dec. 2017.
Oct. 2016, pp. 1–13. [201] M. A. Talib, S. Majzoub, Q. Nasir, and D. Jamal, ‘‘A systematic literature
[180] T. Ridnik, H. Lawen, A. Noy, E. Ben Baruch, G. Sharir, and review on hardware implementation of artificial intelligence algorithms,’’
I. Friedman, ‘‘TResNet: High performance GPU-dedicated architecture,’’ J. Supercomput., vol. 77, no. 2, pp. 1897–1938, Feb. 2021.
2020, arXiv:2003.13630. [202] M. Tanomoto, S. Takamaeda-Yamazaki, J. Yao, and Y. Nakashima,
[181] S. Saha, ‘‘A comprehensive guide to convolutional neural networks—The ‘‘A CGRA-based approach for accelerating convolutional neural net-
ELI5 way,’’ Towards Data Sci., Toronto, ON, Canada, Tech. Rep., 2018. works,’’ in Proc. IEEE 9th Int. Symp. Embedded Multicore/Many-Core
[Online]. Available: https://towardsdatascience.com/a-comprehensive- Syst. Chip, Sep. 2015, pp. 73–80.
guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53 [203] Y. Tkachenko, ‘‘Autonomous CRM control via CLV approximation with
[182] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic, deep reinforcement learning in discrete and continuous action space,’’
E. Cosatto, and H. P. Graf, ‘‘A massively parallel coprocessor for 2015, arXiv:1504.01840.
convolutional neural networks,’’ in Proc. 20th IEEE Int. Conf. Appl.- [204] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre,
Specific Syst., Archit. Processors, Jul. 2009, pp. 53–60. and K. Vissers, ‘‘FINN: A framework for fast, scalable binarized neural
[183] V. Sati, S. M. Sánchez, N. Shoeibi, A. Arora, and J. M. Corchado, ‘‘Face network inference,’’ in Proc. ACM/SIGDA Int. Symp. Field-Program.
detection and recognition, face emotion recognition through NVIDIA Gate Arrays, Feb. 2017, pp. 65–74.
Jetson Nano,’’ in Proc. Int. Symp. Ambient Intell. Cham, Switzerland: [205] A. Vasudevan, A. Anderson, and D. Gregg, ‘‘Parallel multi channel
Springer, 2020, pp. 177–185. convolution using general matrix multiplication,’’ in Proc. IEEE 28th
[184] S. Saglam, F. Tat, and S. Bayar, ‘‘FPGA implementation of CNN Int. Conf. Appl.-Specific Syst., Archit. Processors (ASAP), Jul. 2017,
algorithm for detecting malaria diseased blood cells,’’ in Proc. Int. Symp. pp. 19–24.
Adv. Electr. Commun. Technol. (ISAECT), Nov. 2019, pp. 1–5. [206] S. I. Venieris and C.-S. Bouganis, ‘‘FpgaConvNet: A framework for
[185] U. Schmidt and S. Roth, ‘‘Shrinkage fields for effective image restora- mapping convolutional neural networks on FPGAs,’’ in Proc. IEEE
tion,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, 24th Annu. Int. Symp. Field-Program. Custom Comput. Mach. (FCCM),
pp. 2774–2781. May 2016, pp. 40–47.
[186] D. Selvathi, R. D. Nayagam, D. J. Hemanth, and V. E. Balas, ‘‘FPGA [207] T. V. Huynh, ‘‘FPGA-based acceleration for convolutional neural
implementation of on-chip ANN for breast cancer diagnosis,’’ Intell. networks on PYNQ-Z2,’’ Int. J. Comput. Digit. Syst., vol. 11, no. 1,
Decis. Technol., vol. 10, no. 4, pp. 341–352, Dec. 2016. pp. 441–449, Jan. 2022.
[187] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra,
[208] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou, ‘‘DLAU: A scalable
and H. Esmaeilzadeh, ‘‘From high-level deep neural models to FPGAs,’’
deep learning accelerator unit on FPGA,’’ IEEE Trans. Comput.-Aided
in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO),
Design Integr. Circuits Syst., vol. 36, no. 3, pp. 513–517, Mar. 2017.
Oct. 2016, pp. 1–12.
[209] J. Wang and S. Gu, ‘‘FPGA implementation of object detection
[188] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and
accelerator based on Vitis-AI,’’ in Proc. 11th Int. Conf. Inf. Sci. Technol.
H. Esmaeilzadeh, ‘‘Bit fusion: Bit-level dynamically composable archi-
(ICIST), May 2021, pp. 571–577.
tecture for accelerating deep neural network,’’ in Proc. ACM/IEEE 45th
Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2018, pp. 764–775. [210] T. Wang, C. Wang, X. Zhou, and H. Chen, ‘‘A survey of FPGA
based deep learning accelerators: Challenges and opportunities,’’ 2018,
[189] R. Shi, H. Xu, B. Chen, Z. Zhang, and L.-M. Peng, ‘‘Scalable fabrication
arXiv:1901.04988.
of graphene devices through photolithography,’’ Appl. Phys. Lett.,
vol. 102, no. 11, Mar. 2013, Art. no. 113102. [211] Y. Wang, J. Xu, Y. Han, H. Li, and X. Li, ‘‘DeepBurning: Automatic
[190] D. Shin, J. Lee, J. Lee, and H.-J. Yoo, ‘‘DNPU: An 8.1 TOPS/W generation of FPGA-based learning accelerators for the neural network
reconfigurable CNN-RNN processor for general-purpose deep neural family,’’ in Proc. 53rd Annu. Design Autom. Conf., Jun. 2016, pp. 1–6.
networks,’’ in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), [212] S. Williams, A. Waterman, and D. Patterson, ‘‘Roofline: An insightful
Feb. 2017, pp. 240–241. visual performance model for multicore architectures,’’ Commun. ACM,
[191] M. M. Shulaker, G. Hills, R. S. Park, R. T. Howe, K. Saraswat, vol. 52, no. 4, pp. 65–76, 2009.
H.-S. P. Wong, and S. Mitra, ‘‘Three-dimensional integration of nan- [213] W. Vanderbauwhede and K. Benkrid, High-Performance Computing
otechnologies for computing and data storage on a single chip,’’ Nature, Using FPGAs. New York, NY, USA: Springer, 2013.
vol. 547, pp. 74–78, Jul. 2017. [214] W. G. Wong, ‘‘More details emerge about arm’s machine learning,’’
[192] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for Electron. Des. Mag., Hasbrouck Heights, NJ, USA, Tech. Rep., Jun. 2018.
large-scale image recognition,’’ 2014, arXiv:1409.1556. [Online]. Available: https://www.electronicdesign.com/industrial-
[193] G. Smith and F. F. Leymarie, ‘‘The machine as artist: An introduction,’’ automation/article/21806582/more-details-emerge-about-arms-machine-
Arts, vol. 6, no. 4, p. 5, Apr. 2017. learning
[194] S. Srinivas, R. K. Sarvadevabhatla, K. R. Mopuri, and N. Prabhu, [215] B. Wu, A. Wan, F. Iandola, P. H. Jin, and K. Keutzer, ‘‘SqueezeDet:
‘‘Kruthiventi, and R. V. Babu,’’ A taxonomy of deep convolutional neural Unified, small, low power fully convolutional neural networks for
nets for computer vision,’’ Frontiers Robot. AI, vol. 2, p. 36, Jan. 2016. real-time object detection for autonomous driving,’’ in Proc. IEEE
[195] V. Sriram, D. Cox, K. H. Tsoi, and W. Luk, ‘‘Towards an embedded Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jul. 2017,
biologically-inspired machine vision processor,’’ in Proc. Int. Conf. Field- pp. 446–454.
Programmable Technol., Dec. 2010, pp. 273–278. [216] H. Xiao, K. Zhao, and G. Liu, ‘‘Efficient hardware accelerator for
[196] D. Strigl, K. Kofler, and S. Podlipnig, ‘‘Performance and scalability of compressed sparse deep neural network,’’ IEICE Trans. Inf. Syst.,
GPU-based convolutional neural networks,’’ in Proc. 18th Euromicro vol. 104, no. 5, pp. 772–775, May 2021.
Conf. Parallel, Distrib. Network-Based Process., Feb. 2010, pp. 317–324. [217] S. Xiong, G. Wu, X. Fan, X. Feng, Z. Huang, W. Cao, X. Zhou, S. Ding,
[197] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J. Yu, L. Wang, and Z. Shi, ‘‘MRI-based brain tumor segmentation using
J.-S. Seo, and Y. Cao, ‘‘Throughput-optimized OpenCL-based FPGA FPGA-accelerated neural network,’’ BMC Bioinf., vol. 22, no. 1, pp. 1–15,
accelerator for large-scale convolutional neural networks,’’ in Proc. Dec. 2021.
ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, New York, [218] A. Yazdanbakhsh, J. Park, H. Sharma, P. Lotfi-Kamran, and
NY, USA, Feb. 2016, pp. 16–25. H. Esmaeilzadeh, ‘‘Neural acceleration for GPU throughput processors,’’
[198] M. Svedin, S. W. D. Chien, G. Chikafa, N. Jansson, and A. Podobas, in Proc. 48th Int. Symp. Microarchitecture, Dec. 2015, pp. 482–493.
‘‘Benchmarking the NVIDIA GPU lineage: From early K 80 to modern [219] K. Seshadri, B. Akin, J. Laudon, R. Narayanaswami, and
A 100 with asynchronous memory transfers,’’ in Proc. 11th Int. Symp. A. Yazdanbakhsh, ‘‘An evaluation of edge TPU accelerators for
Highly Efficient Accel. Reconfigurable Technol., Jun. 2021, pp. 1–6. convolutional neural networks,’’ 2021, arXiv:2102.10423.
[220] X. Yin, L. Chen, X. Zhang, and Z. Gao, ‘‘Object detection implementation M. SABARIMALAI MANIKANDAN (Senior
and optimization on embedded GPU system,’’ in Proc. IEEE Int. Symp. Member, IEEE) received the B.E. degree in
Broadband Multimedia Syst. Broadcast. (BMSB), Jun. 2018, pp. 1–5. electronic and communication engineering from
[221] R. Zanc, T. Cioara, and I. Anghel, ‘‘Forecasting financial markets using Bharathiar University, Coimbatore, India, the M.E.
deep learning,’’ in Proc. IEEE 15th Int. Conf. Intell. Comput. Commun. degree in microwave and optical engineering
Process. (ICCP), Sep. 2019, pp. 459–466. from Madurai Kamaraj University, Madurai, India,
[222] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, ‘‘Optimizing and the Ph.D. degree in cardiovascular signal
FPGA-based accelerator design for deep convolutional neural networks,’’
processing from the Department of Electronics
in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays,
and Communication Engineering, IIT Guwahati,
Feb. 2015, pp. 161–170.
[223] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong, ‘‘Caffeine: Guwahati, India. He was an Assistant Professor at
Toward uniformed representation and acceleration for deep convolutional Amrita Vishwa Vidyapeetham University, Ettimadai, India. He was the Chief
neural networks,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits Engineer at the Advanced Technology Group, Samsung India Electronic
Syst., vol. 38, no. 11, pp. 2072–2085, Nov. 2019. Pvt., Ltd., Noida, India. He was an Assistant Professor at the Biomedical
[224] G. Zhang, N. Attaluri, J. S. Emer, and D. Sánchez, ‘‘Gamma: Leveraging System Laboratory, School of Electrical Sciences, IIT Bhubaneswar, India.
Gustavson’s algorithm to accelerate sparse matrix multiplication,’’ in He is currently an Associate Professor of electrical engineering with IIT
Proc. 26th ACM Int. Conf. Architectural Support Program. Lang. Palakkad. He has published more than 70 research papers in reputed journals
Operating Syst., Apr. 2021, pp. 687–701. and conference proceedings. His research interests include signal and image
[225] J.-F. Zhang, C.-E. Lee, C. Liu, Y. S. Shao, S. W. Keckler, and Z. Zhang, processing, adaptive machine learning, the Internet of Things, VLSI signal
‘‘SNAP: A 1.67—21.55 TOPS/W sparse neural acceleration processor for processing, machine learning architectures, application system development:
unstructured sparse deep neural network inference in 16 nm CMOS,’’ in health (human, machine, structural) monitoring systems, audio and speech
Proc. Symp. VLSI Circuits, Jun. 2019, pp. C306–C307. processing systems for human–machine interactions, biometric and data
[226] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, ‘‘Object detection with deep security for authentication and authorization, environmental monitoring
learning: A review,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 30,
systems for ambient assisted living, UAV-assisted IoT for smart surveillance
no. 11, pp. 3212–3232, Nov. 2019.
systems, and context and quality aware pattern learning networks for event
[227] J. Zhu, L. Wang, H. Liu, S. Tian, Q. Deng, and J. Li, ‘‘An efficient
task assignment framework to accelerate DPU-based convolutional neural recognition. He was a recipient of the 2012 Outstanding Performance Award
network inference on FPGAs,’’ IEEE Access, vol. 8, pp. 83224–83237, during his tenure at Samsung India Electronic Pvt., Ltd. He served as a
2020. Reviewer for many reputed journals of the IEEE, IET, Springer, Hindawi,
[228] J. Zhu, T. Yang, R. Liu, X. Xu, and X. Zhu, ‘‘Image recognition of CT PLOS One, Frontiers, and Elsevier.
diagnosis for cholangiocarcinoma treatment based on FPGA processor
and neural network,’’ Microprocessors Microsyst., vol. 81, Mar. 2021,
Art. no. 103645.
[229] S. Monk, Programming the Raspberry Pi: Getting Started With Python.
New York, NY, USA: McGraw-Hill Education, 2016.
[230] (Nov. 2022). Photos of the Raspberry Pi Through the Ages: From the
Prototype to Pi3 B+. Accessed: Nov. 14, 2022. [Online]. Available:
https://www.zdnet.com/pictures/photos-of-the-raspberry-pi-through-the-
ages-from-the-prototype-to-pi-3/