0% found this document useful (0 votes)

172 views

Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey

Uploaded by

tippars

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

172 views

Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey

Uploaded by

tippars

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Received 25 November 2022, accepted 11 December 2022, date of publication 15 December 2022,

date of current version 22 December 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3229767

Efficient Hardware Architectures for Accelerating

Deep Neural Networks: Survey
PUDI DHILLESWARARAO 1 , (Graduate Student Member, IEEE),
SRINIVAS BOPPU 1 , (Member, IEEE), M. SABARIMALAI MANIKANDAN 2, (Senior Member, IEEE),
AND LINGA REDDY CENKERAMADDI 3 , (Senior Member, IEEE)
1 School of Electrical Sciences, Indian Institute of Technology Bhubaneswar, Bhubaneswar 752050, India
2 Department of Electrical Engineering, Indian Institute of Technology Palakkad, Palakkad 678557, India
3 Department of ICT, University of Agder, 4879 Grimstad, Norway
Corresponding author: Linga Reddy Cenkeramaddi ([email protected])
This work was supported in part by the Indo-Norwegian Collaboration in Autonomous Cyber-Physical Systems (INCAPS) of the
International Partnerships for Excellent Education, Research and Innovation (INTPART) Program from the Research Council of Norway
under Project 287918, and in part by the Seed Grant of IIT Bhubaneswar (TAML: Timing Analysis with Machine Learning) under Project
SP088.

ABSTRACT In the modern-day era of technology, a paradigm shift has been witnessed in the areas involving
applications of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). Specifically,
Deep Neural Networks (DNNs) have emerged as a popular field of interest in most AI applications such as
computer vision, image and video processing, robotics, etc. In the context of developed digital technologies
and the availability of authentic data and data handling infrastructure, DNNs have been a credible choice
for solving more complex real-life problems. The performance and accuracy of a DNN is a way better than
human intelligence in certain situations. However, it is noteworthy that the DNN is computationally too
cumbersome in terms of the resources and time to handle these computations. Furthermore, general-purpose
architectures like CPUs have issues in handling such computationally intensive algorithms. Therefore, a lot
of interest and efforts have been invested by the research fraternity in specialized hardware architectures
such as Graphics Processing Unit (GPU), Field Programmable Gate Array (FPGA), Application Specific
Integrated Circuit (ASIC), and Coarse Grained Reconfigurable Array (CGRA) in the context of effective
implementation of computationally intensive algorithms. This paper brings forward the various research
works on the development and deployment of DNNs using the aforementioned specialized hardware
architectures and embedded AI accelerators. The review discusses the detailed description of the specialized
hardware-based accelerators used in the training and/or inference of DNN. A comparative study based on
factors like power, area, and throughput, is also made on the various accelerators discussed. Finally, future
research and development directions, such as future trends in DNN implementation on specialized hardware
accelerators, are discussed. This review article is intended to guide hardware architects to accelerate and
improve the effectiveness of deep learning research.

INDEX TERMS Machine learning, field programmable gate array (FPGA), deep neural networks (DNN),
deep learning (DL), application specific integrated circuits (ASIC), artificial intelligence (AI), central
processing unit (CPU), graphics processing unit (GPU), hardware accelerators.
I. INTRODUCTION term AI was coined in 1956 by John McCarthy, who defined
Deep neural networks (DNNs), also known as deep learning, it as ‘‘the science and engineering of making intelligent
are a subset of the Artificial Intelligence (AI) discipline. The machines’’. Machine learning is a broad topic of artificial
intelligence that was first defined by Arthur Samuel in
The associate editor coordinating the review of this manuscript and 1959 as the study of how computers may learn without being
approving it for publication was Vivek Kumar Sehgal . explicitly programmed. Machine Learning uses traditional

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
131788 VOLUME 10, 2022
P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

techniques to perform tasks like classification, regression, Second (GOPS) to process a single image of size 224×224
and clustering. Deep learning is a subfield of machine with a top-1 accuracy of 61%, while ResNet-152 [108] takes
learning that uses a multi-layered algorithm structure known 22.6 GOPS with a top-1 accuracy of 79.3%. DNN’s superior
as a neural network, which was developed mostly between accuracy and performance are due to its capacity to extract
2006 and 2010. The relationship between deep learning, more complex high-level features, such as objects and facial
machine learning, and AI is illustrated in Fig. 1. structures, from raw input data.
DNNs are computationally expensive and need lots
of computational resources and memory for training and
inference. CPUs inherently support a limited number of
parallel workloads, though they can context switch with
hyper-threading. They are not sequential in nature. CPUs may
have more resources than their counter architectures (like
GPUs or FPGAs). CPUs have a limited number of registers
to support concurrent threads. But they may have higher
cache sizes, larger branch control logic, and higher on-chip
bandwidth than GPUs. However, the limited number of cores
on the CPU limits its ability to process large amounts of data
in parallel, which is required for DNN acceleration. Although
CPUs dominate the IoT industry in DNN inference on low-
power edge devices, they struggle to realize complex DNNs.
Therefore, specialized hardware designs are required for the
acceleration of DNNs. DNNs can be implemented using
customized hardware accelerators instead of a CPU. The
heterogeneous computing platforms viz. Field Programmable
Gate Array (FPGA), Application-Specific Integrated Circuits
(ASIC), and Graphical Processing Units (GPU) are widely
FIGURE 1. AI vs. Machine Learning vs. Deep Learning.
used to accelerate DNNs. The specialized hardware-based
Nowadays, DNNs are used in many modern AI appli- DNN accelerators can be categorized into two classes:
cations, including bioinformatics [60], natural language the first class of accelerators efficiently implements the
processing [147], image restoration [185], speech recogni- computational primitives, such as convolutional operations,
tion [34], computer vision [194], machine translation [36], fully connected operations, etc., for the DNNs [85], [175] and
healthcare [43], finance [221], robotics [94], visual art the second class of DNN accelerators efficiently optimize the
processing [193], etc. Furthermore, the recent applications data movement and memory access [56], [177]. These two
of DNN include aerospace and defence, automated driving, generations of specialized hardware-based DNN accelerators
recommendation systems, and industrial automation [71], improve the speed and energy efficiency of running DNNs.
[86], [101], [215]. DNNs are also useful in a variety of appli- There are two ways to improve the performance of the
cations, such as news aggregation and fraud detection [124], DNN acceleration. The first method is optimizing the
virtual assistants [61], chatbots [35], and customer relation- DNN algorithm, and the second is optimizing the hardware
ship management systems [203]. In addition, DNNs have also architecture. Therefore, we need to co-design the algorithm
been used to diagnose Covid-19 by classifying it based on and the hardware to achieve superior performance.
different lung and chest imaging modalities [40]. Because of their high throughput and memory band-
DNNs contain many layers, and each layer is capable of width, GPUs are one of the most often employed hard-
detecting features at different levels. For instance, in pattern ware accelerators for improving inference and training
recognition, where the input is available in pixel form, the processes in DNNs [218]. In floating-point matrix-based
first layer of DNN extracts minor details of the image, such as calculations, GPU-based hardware accelerators are extremely
curves and edges. The outputs of this first layer act as inputs efficient [205]. GPU-based hardware accelerators, on the
to the second layer. The second layer extracts the image’s other hand, consume a lot of power. ASIC and FPGA-
primary details, such as squares and semi-circles. The outputs based hardware accelerators have limited computational and
of the second layer act as inputs to the third layer. The third memory resources compared to GPU-based accelerators.
layer extracts the part of objects. Furthermore, the subsequent Nevertheless, they can achieve a moderate performance level
layer uses the previous layer’s output and extracts more while using less energy [153]. ASIC-based DNN accelerators
aspects of the objects. As the number of layers increases, provide superior performance compared to GPU and FPGA
the DNN extracts increasingly complicated features and counterparts at the cost of reconfigurability. However, ASIC-
complete objects [73]. DNNs provide superior accuracy and based accelerators have some limitations, including the high
performance at the cost of high computational complexity. cost of development, long time to market, inflexibility,
For instance, AlexNet [130] takes 1.4 Giga Operations Per etc [77], [103]. FPGA-based accelerators can be used as

VOLUME 10, 2022 131789

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

an alternative to ASIC-based accelerators, and they can This survey is different and unique with respect to many exist-
provide superior performance at an affordable cost with ing papers in this area in the following ways. Few studies [44],
reconfigurability and low power dissipation [213]. FPGA, [97], [126], [154], [210] focused only on the developments
ASIC, and GPU-based AI accelerators have been the subject of FPGA-based accelerators. Few other studies [55], [136],
of numerous research [97], [150], [154], [155], [158], [210]. [150], [158] have presented the details of ASIC-based
This survey, however, also looks at various embedded AI accelerators. Some research reviews [48], [200], [201] have
accelerators for DNN acceleration. explored both FPGA and ASIC-based accelerators. Very
This survey supplements the existing work and contributes limited studies [178], [201] have dealt with the progress
towards providing the complete background on DNN accel- of GPU-based accelerators. On the other hand, studies on
eration using various specialized hardware architectures. The embedded AI and CGRA-based accelerators haven’t been
contributions of this survey can be summarized as follows: explored much. Many of these reviews do not mention the
1) The survey discusses the various research works carried compiler/mapping frameworks and SDKs available for these
out on the development and deployment of DNN using accelerators, making it difficult for someone to choose the
FPGA-based accelerators. appropriate accelerator. This review, therefore, aims to bring
2) The survey covers the work done in ASIC-based AI a comprehensive study of all the aforementioned hardware
accelerators in the last decade, from 2012 to 2022. accelerators in the context of the implementation of DNNs.
3) The survey describes the various GPU-based DNN Furthermore, this survey uniquely classifies the FPGA-
accelerators. and ASIC-based accelerators and briefly discusses the key
4) The survey provides a comprehensive overview of architectural features and the available compiler or mapping
CGRA-based accelerators for DNN implementation. frameworks. Accelerators for each category are summarized
5) The survey covers the research works carried out on the and compared. A comprehensive survey of GPU-based
implementation of DNNs on edge using embedded AI accelerators by Nvidia is also presented. The need for edge
accelerators. AI computing is emphasized and state-of-the-art embedded
6) The survey provides a comparative study of existing AI accelerators, including Arm-based accelerators, are also
hardware architectures: FPGAs, GPUs, ASICs, and discussed and compared. This survey also briefly discusses
embedded AI accelerators. the recent developments in tinyML. Table 1 compares this
7) The survey highlights the future research trends in survey paper with recently published review articles on DNN
DNN acceleration on specialized hardware architec- implementation using specialized hardware architectures.
tures, including FPGA, ASIC, GPU, CGRA, and Edge Researchers in the fields of artificial intelligence, system
AI accelerators. design, and hardware architecture are expected to benefit
from this survey.
A. SCOPE OF THE SURVEY
This paper focuses on research trends in FPGA, ASIC, and B. ORGANIZATION
GPU-based accelerators for implementing DNNs. We have
This paper is organized as follows: Section II provides a brief
also briefly discussed the current trends in Arm-based
overview of neural networks and DNNs, including the basic
machine learning processors and embedded edge AI accel-
architecture of hardware for DNN acceleration. Section III
erators. The review categorizes the FPGA-based accelerator
describes various architectures implemented on the FPGA
into three categories and briefly discusses the key features
platform for DNN acceleration. Section IV describes various
of the accelerators, including the frameworks available. The
ASIC-based accelerator architectures for DNN acceleration.
three categories include accelerators for a specific application
Section V shows a detailed review of GPU-based accelerators
such as speech recognition, object detection, natural language
for the acceleration of DNN. Section VI discusses various
processing, etc., accelerators for a specific algorithm such
CGRA-based accelerator architectures for DNN acceleration.
as CNN, RNN, etc., and accelerator frameworks with
Section VII discusses in detail the embedded edge AI
hardware templates. Furthermore, ASIC-based accelerators
accelerators for DNN acceleration. Section VIII provides
are categorized into three types: ALU-based accelerators,
the comparisons between the various hardware architectures
dataflow-based accelerators, and sparsity-based accelera-
used for the DNN acceleration. Section IX provides the
tors. A comparative study of these hardware accelerators
future research directions of various hardware architectures
based on performance metrics like power, throughput, and
for DNN acceleration. Finally, the conclusion of this review
area has been presented. The review also focuses on the
is presented in Section X.
mapping frameworks available for these accelerators and
briefly discusses the implementation details. In addition, the
recent research contributions in Arm-based machine learning II. BACKGROUND
processors, a few embedded AI hardware accelerators, and A. NEURAL NETWORKS
CGRA-based accelerators are discussed and compared in A Neural Network (NN) is a computational model inspired by
terms of their cores, performance, power, availability of Soft- biological neural networks. It is also known as an Artificial
ware Development Kits (SDKs), and supported frameworks. Neural Network (ANN). An ANN comprises hundreds or

131790 VOLUME 10, 2022

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

TABLE 1. Comparison among state-of-the-art surveys.

thousands of interconnected artificial neurons, also called n inputs from the input layer and generates the output y.
processing units. Three or more interconnected layers are These inputs are multiplied by the weight coefficients
formed by these neurons. The input neurons are in the first (w1 , w2 , . . . , wn ) and combined together with a bias value b
layer. The input neurons receive external signals and pass for each neuron. A non-linear function σ (.), also called as
them on to the subsequent layers, which eventually provide an activation function, is then used to calculate the neuron’s
the final output data to the final output layer. The intermediate output, see Eq. (1). In this scenario, the activation function
layers in the ANN are called as hidden layers. Fig. 2 depicts causes a neuron to produce an output only if the input to
the architecture of a typical NN, which includes an input it exceeds a specified threshold value. Common non-linear
layer, an output layer, and two hidden layers. functions used in NN are Sigmoid, Rectified Linear Unit
(ReLU), and Hyperbolic tangent. The graphical model and
mathematical representation of artificial neuron is shown
in Fig. 3 and Eq. (1), respectively.
N
X
y = σ( x[n]w[n] + b) (1)
n=1

FIGURE 2. An architecture of NN. FIGURE 3. A single ANN neuron with its elements (inputs, weights, bias,
summer, activation function, and output).

In NN shown in Fig. 3, the input layer contains n inputs In neural networks, weights are initialized with some
(x1 , x2 , . . . , xn ). The following layer (hidden layer) gets all random values. However, during the training process, all

VOLUME 10, 2022 131791

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

these weights get updated iteratively to predict the correct some input data has already been matched to the correct
output. The weights are updated using the cost function, output. Unsupervised learning is another learning technique
which is nothing more than the mean square error. The in which the network/model is trained using unlabeled data.
mathematical representation of mean square error is shown The trained network generates the clusters or structures in
in Eq. (2). Here, MSE is mean squared error, n represents the the unlabeled data. Semi-supervised learning uses partially
number of input data points, yi and ŷi are true and predicted labeled data sets and it falls in between supervised and
outputs, respectively. Once the neural network is trained, unsupervised learning approaches. Finally, reinforcement
it may be used for classification problems. learning is a type of training that rewards positive behaviours
while punishes undesirable ones. Reinforcement learning is
1X
MSE = (yi − ŷi )2 (2) bound to learn from its previous experience. The pictorial rep-
n resentation of the aforementioned deep learning approaches
i=1
is shown in Fig. 5.
B. DEEP NEURAL NETWORK (DNN)
The Deep Neural Network (DNN) is a type of neural network
that has more than three hidden layers and is well-suited to
complicated tasks [37]. In today’s DNN, the typical number
of layers used ranges from five to over a thousand. A DNN
with N hidden layers is shown in Fig. 4. In DNNs, the model
and its parameters are learned through an extensive training
process.

FIGURE 5. Deep learning approaches.

C. CONVOLUTIONAL NEURAL NETWORK (CNN)

Convolutional neural networks (CNNs) are a type of neural
network which have been widely used for image recognition
tasks. CNN is made up of several stages, each of which
is referred to as a layer. Each layer extracts a feature
from the data it receives. The identifying features get more
sophisticated or complex as we proceed. CNN structure was
first proposed by Fukushima in 1988 [87]. As shown in Fig. 6,
a CNN consists of four layers: convolution, fully connected
layer, pooling layer, and Rectified Linear Unit (ReLU) layer.
Optionally, CNN might also have non-traditional layers
FIGURE 4. DNN with N hidden layers [157]. such as dilated convolution layer [114] and deconvolution
layer [52]. The CNN’s overall design may be divided into two
Training and inference are the two critical phases in sections: feature learning and classification. Each layer of the
accelerating a task using DNN. Specific tasks such as object CNN gets data from the layer before it as input and delivers
detection, and pattern recognition etc. are part of the training, its output to the following layer as input during the feature
in which the DNN is taught to perform such specific tasks learning phase. The feature learning phase includes three
using available data. The known data is supplied to DNN types of layers: convolution, RELU, and pooling. At each
throughout the training process, allowing the network to node of the convolution layer, convolution operations on the
predict what the data represents. As a result, the prediction input nodes detect features from the input feature maps. The
error is used to adjust the weights of the neurons. The weights output of the feature extraction phase’s final layer is delivered
are adjusted till the predictions are made with a considerable to a fully connected network known as the classification layer.
degree of accuracy. Backpropagation is a popular method for The following sections discuss each type of layer in the CNN
updating weights, as mentioned before in the training phase. in brief.
DNN is ready to make predictions on fresh and unknown
data once it has been fully trained. This stage is known 1) CONVOLUTION LAYER
as inference, and it involves testing the trained model with The convolution layer is also known as the feature extraction
completely new and unknown data. layer since it extracts the features of images. The inputs
There are four types of deep learning approaches: super- and outputs of the convolution layer are defined as feature
vised, unsupervised, reinforcement, and semi-supervised maps (FMs) which are organized in two-dimensional grids.
learning. The labeled data is used in supervised learning to The FM from the previous layers of the convolution layer is
train or model the network. The labeled data indicates that convolved with the filter coefficients. More than one input

131792 VOLUME 10, 2022

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

FIGURE 6. CNN architecture (adopted from [181]).

feature map can be paired with each output feature map. In a FIGURE 7. Various forms of pooling.
2-D convolution operation between an input image matrix x
(size R × C) and a filter f (size W × L), the convolution
layer performs point-wise multiplication and addition of the ability to converge faster than other activation functions like
corresponding pixels. The filter size is often smaller than the hyperbolic tangent and sigmoid [72], [197], ReLU [160] has
input matrix size. The filter multiplies the input matrix with gained a lot of traction in recent years. The mathematical
the W × L sized block, accumulates the result, slides to the representation of ReLU is shown in Eq. (4). Some popular
next block of the input matrix, and repeats the operation. extensions of ReLU, for instance, exponential LU [64],
The input matrix is processed one block at a time until parametric ReLU [107], and leaky ReLU [149] are also being
it has processed all of the image’s R × C elements. The used in CNNs for improved performance and accuracy.
2-D convolution operation is given in Eq. (3) where y(r, c) f (x) = max(0, x) (4)
signifies one output pixel in the output matrix y, with each
pixel’s coordinates expressed as (r, c). The iterators over the 4) FULLY CONNECTED LAYER
filter’s length (L) and width (W ) are l and w, respectively, Fully connected layers do the final classification in the CNN
in Eq. (3). Finally, the resulting feature maps apply non-linear network after multiple convolutions, ReLU, and pooling
activation functions such as sigmoid, hyperbolic tangent, layers. Weights, biases, and neurons are all part of the fully
or rectified linear units. connected layer. All input and output neurons are connected
W
X −1 L−1
X
W

L in the fully connected layer. A CNN network typically has
y(r, c) = f (w, l)x(r + w − ,c + l − ) one or more fully connected layers. The final output of CNN
2 2 comes from the last fully connected layer, often known as
w=0 l=0
(3) the classification layer. The fully connected layer in the CNN
contains a large number of inputs and outputs. Therefore, it is
2) POOLING LAYER challenging to implement fully connected layer operations on
The pooling layer shrinks the spatial dimensions of the input hardware platforms with limited resources.
image after convolution, thereby reducing the computation
and number of parameters in the network. Pooling layers 5) DECONVOLUTION LAYER
are also known as subsampling layers. In CNN, the pooling To increase the size of the feature map, a deconvolu-
layer is used between two convolution layers. The MAX tion layer, also known as a transposed convolution layer,
operation is used to resize each slice of the input image is employed [52]. Upsampling (inserting zeros in the feature
spatially, on which the pooling layers operate individually. map) and then convolving the upsampled feature maps with
A pooling layer with filters of size 2×2 is found in many CNN the kernel coefficients are used to accomplish this.
topologies. Over the four samples in the filter, the pooling
operation, which is nothing but the MAX operation, is done. 6) DILATED CONVOLUTION LAYER
The operation yielding the maximum value is retained while The filter coefficients are up-sampled and convolved with the
discarding the other values [123]. It is noteworthy that input image in a dilated convolution layer to capture a broader
additional operations like MIN operation and AVG operation receptive field [114]. Image segmentation, for example,
can also be used in the pooling layer, particularly in some uses it to capture the larger global context in each output
CNNs [197]. The MAX and AVG pooling operations for the pixel.
filters of size 2 × 2 are shown in Fig. 7. With millions of weight coefficients, CNNs are extremely
complex. They are computationally expensive and necessitate
3) RECTIFIED LINEAR UNIT (ReLU) LAYER a significant amount of memory to store the input, output
In a CNN network, the ReLU layer is usually employed after feature maps, and weight coefficients, causing CPUs to
the convolution and fully connected layers. The ReLU layer underperform. To boost the performance of the CNNs,
is generally used after the convolution and fully connected specific hardware accelerators are used. As a result, different
layers in the CNN network. By substituting all the negative techniques for implementing CNNs efficiently on hardware
valued outputs with 0, it introduces non-linearity into the platforms must be explored in order to reduce resource and
CNN. Because of its computational simplicity, sparsity, and memory requirements.

VOLUME 10, 2022 131793

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

D. HARDWARE ARCHITECTURES FOR DNN result, only a small number of processes can be performed
ACCELERATION in parallel, limiting throughput. GPUs are commonly used
DNNs have been increasingly popular in recent years, to train and infer DNNs. They have thousands of cores
allowing for their development and deployment on a variety to run highly parallel algorithms efficiently, for instance,
of hardware platforms. These hardware platforms are of matrix multiplication. Throughput is enhanced by lowering
various types, right from general-purpose architectures such the number of multiplications in both CPUs and GPUs. There
as CPUs and GPUs, programmable architectures (FPGAs) are software libraries that optimize matrix multiplication
to special-purpose chips (ASICs). In many DNN models, for GPUs (e. g., cuBLAS, cuDNN [59], etc.) and CPUs
multiply-accumulate (MAC) operations are the most impor- (e. g., Intel MKL [2], OpenBLAS, etc.). Another well-
tant computations, and they can be easily parallelized. Since known technique to reduce the matrix multiplications is Fast
these MAC operations can be executed in parallel, hardware Fourier Transform (FFT) [80], [151]. Furthermore, several
architectures that enable parallel operations are required to techniques, such as Winogra’s algorithm [132] and Strassen’s
process DNNs. To achieve superior performance, highly algorithm [67], are used to reduce the matrix multiplications
parallel computing models, encompassing both spatial and and thereby reduce the resource and memory requirements.
temporal computing architectures, are often employed for
DNN acceleration. The spatial and temporal architectures 2) SPATIAL ARCHITECTURES
have a similar computational structure, with a set of In spatial architectures, each ALU can have its own local
Processing Elements (PEs). However, processing units can memory and control logic. The local memory is also
have internal control in a spatial architecture, whereas control referred to as the register file. The development and deploy-
in a temporal architecture is centralized, as shown in Fig. 8. ment of DNNs on Field-Programmable-Gate-Arrays (FPGA)
Each PE can have a register file (RF) to store data in spatial and Application-Specific-Integrated-Circuits (ASIC) comes
architecture; however, PEs do not have the memory capacity under the category of spatial architectures. FPGAs are less
in a temporal architecture. The PEs can also be connected to expensive and have a faster time to market than ASICs, and
exchange data in spatial computing designs. To summarize, the design flow is simpler. However, FPGAs are less energy-
the PEs in the temporal architectures contain only Arithmetic efficient and consume more power than ASICs since FPGAs,
and Logic Units (ALUs). The PEs consist of ALU as a unlike ASICs, contain a significant chip area dedicated to
computation unit, RF to store the data, and a control unit in reconfigurability. ASICs, on the other hand, are mainly
spatial architectures. designed for a particular application and cannot support
reconfigurability. The design flow of ASICs is more complex
than FPGAs [46]. ASIC chips are expensive, but they are
highly optimized and energy-efficient and provide superior
performance than FPGAs. Memory accesses are the real
bottleneck in DNN computations; therefore, off-chip DRAM
accesses must be minimized, as they have a high energy
cost and delay. The memory accesses (off-chip) can be
reduced by reusing data stored in smaller, quicker, and
low-energy memories. In spatial computing architectures,
weight stationary, row stationary, output stationary, and other
FIGURE 8. Spatial and temporal architectures. specialized processing dataflows can be designed to improve
data reuse from memories in the memory hierarchy and
reduce energy dissipation. At each level of the memory
1) TEMPORAL ARCHITECTURES hierarchy, the dataflow defines what data is read and when it is
The temporal architectures exploit parallelism by support- processed. In spatial architectures, dataflows can be classified
ing a variety of techniques, such as Single Instruction as follows:
Multiple Threads (SIMT) or Single Instruction Multiple
Data (SIMD). The temporal computing architectures appear a: WEIGHT STATIONARY (WS)
mostly in CPUs and GPUs. In temporal designs, ALUs can In weight stationary dataflow, the weights are kept fixed
only access data from the memory hierarchy and cannot and are stored in the register files of the PEs, whereas the
communicate directly with one another. The memory (i.e., inputs and partial sums are distributed across the PEs. Weight
register file) and control are shared by all ALUs in the stationary dataflow maximizes filter and convolutional reuse
temporal architecture. In temporal architectures like CPUs of weights. Weight stationary dataflow examples are found
or GPUs, all the convolution or fully connected operations in [168], [182], [195], and [50].
are mapped to matrix multiplication. CPU cores are the
least employed among the several temporal architectures b: OUTPUT STATIONARY (OS)
for DNN training and inference. CPUs contain a small Each partial sum is held fixed in a PE in the output stationary
number of processing cores, ranging from one to ten. As a dataflow, and accumulation is done until the final total is

131794 VOLUME 10, 2022

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

obtained. In the meantime, the PEs’ weights and inputs are hardware computation resources due to inefficient off-chip
dispersed in a variety of ways. The convolutional reuse is communication.
maximized with output stationary dataflow. This dataflow
reduces the amount of energy used while writing and reading
partial sums. Output stationary dataflow examples are found
in [98] and [169].

c: ROW STATIONARY (RS)

The operations of a row of convolution are mapped to the
same PE in row stationary dataflow, and the weights are kept
stationary inside the register file of the PEs. Row stationary
dataflow maximizes the convolutional reuse of input feature
maps, weights, and partial sums. Row stationary dataflow
examples are found in [53] and [57].

d: NO LOCAL REUSE (NLR)

In no local reuse dataflow, nothing is stationary inside
the PEs, and it is used to reduce the accelerator area by FIGURE 9. Roofline model, adopted from [222].
eliminating the register file from PEs. No local reuse dataflow
examples are found in [56] and [222].
All PEs in the spatial architectures can be connected in one III. FPGA-BASED ACCELERATORS
of two ways: 1-D systolic or 2-D systolic. The PEs in a 1-D The FPGA-based neural network accelerators are increas-
systolic architecture are arranged in one dimension, allowing ingly favored over CPUs because of their higher effi-
systolic data flow, but the PEs in a 2-D systolic architecture ciency [164]. FPGA supports parallelism and accelerates the
are arranged in two dimensions and can receive data from computations by mapping them to the parallel hardware;
both vertical and horizontal directions. Similarly, all PEs can i. e., multiple DNN structures are executing in parallel on
be connected in temporal architectures in one of two ways: FPGA. FPGA-based accelerators deliver up to several orders
1-D array or 2-D array. Data is received from the global buffer of magnitude speedup compared to the baseline CPU [85].
by the PEs in a 1-D array architecture, which are arrayed in FPGAs give designers the freedom to implement only the
one dimension. A 2-D array architecture has PEs that are required logic in the hardware based on the target application.
arrayed in two dimensions and receive data only from the FPGA-based DNN accelerator architectures mainly contain
global buffer. a host computer and an FPGA part to implement DNN
algorithms.
E. ROOFLINE MODEL In this section, we would like to review FPGA-based
The roofline model is basically a visual performance DNN accelerators, which can be broadly categorized into
model intended for floating point computations and mul- three types: accelerators for a specific application, such
ticore architectures [212]. The roofline model relates peak as speech recognition, object detection, natural language
performance provided by the hardware platform and off- processing, etc., accelerators for a specific algorithm, such as
chip memory traffic with system performance. For a CNN, RNN, etc., and accelerator frameworks with hardware
given compute-to-communication (CTC) ratio, the maximum templates. For the first two categories, the design complexity
attainable performance is the minimum of (1) peak com- of the accelerator is low, whereas the design complexity is
putational performance and (2) peak memory performance. relatively high for the final category.
Here, the CTC ratio, also called operational intensity, means
A. ACCELERATORS FOR A SPECIFIC APPLICATION
operations per byte of DRAM traffic. Eq. (5) formulates
the attainable performance of an application on a specific There exists many FPGA-hardware accelerators for specific
hardware platform. applications. Designing a custom accelerator for a given
application is a good fit for the problem and has a low
Attainable(Performance design complexity. Han et al. [102] proposed the FPGA-based
Peak Floating Point Performance accelerator named efficient speech recognition engine (ESE)
= min (5) to implement the LSTM algorithm for speech recognition.
Peak Memory Bandwidth × CTC ratio
Load-balanced sensing pruning method is used in the
The roofline model is illustrated in Fig. 9. Algorithm proposed design to compress the LSTM model. The proposed
2 in Fig. 9 has a better CTC ratio than Algorithm 1. accelerator uses a framework named Kaldi to implement
As a result, Algorithm 2 performs better than Algorithm LSTM algorithm for speech recognition. The ESE has a
1 because it effectively utilizes all of the hardware com- performance of 282 GOPS and is implemented in a Xilinx
putation resources. In contrast, Algorithm 1 under-utilizes XCKU060 FPGA running at 200 MHz. The implementation

VOLUME 10, 2022 131795

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

of speech recognition algorithms using FPGA-based acceler- the FPGA board, which is connected through Peripheral
ators is also presented in several earlier studies [62], [112], Component Interconnect (PCI) interface. VIP uses the low
[137], [199]. accuracy arithmetic because of the limitations of resources on
Wang et al. [209] proposed a reconfigurable YOLOv3 Altera EPF81500 FPGA. Fortunately, recent FPGAs contains
FPGA hardware accelerator for object detection. In this large numbers of computing units and memory resources and
context, YOLOv3 (You Only Look Once, Version 3) is a real- allow fast CNN implementations. FPGA implementations
time object detection algorithm that detects specific objects in of DNNs mainly focused on accelerating the convolution
images or videos. The proposed accelerator is built using the operations, which are reported in [38] and [49].
ARM + FPGA architecture. Experiment results show that the Farabet et al. [85] presented ConvNet Processor (CNP):
FPGA-based YOLOv3 accelerator consumes less energy and an FPGA-based accelerator to implement the CNNs. CNP
achieves higher throughput than the GPU counterpart. The uses dedicated hardware convolver for the data processing
proposed accelerator is compatible with several frameworks, and also uses soft-processor for controlling. CNP is designed
such as Tensorflow, Caffe, PyTorch, etc. The proposed on the Virtex4 SX35 FPGA and also equipped with external
accelerator is implemented on Xilinx ZCU104 running at a memory to store the input and filter coefficients. CNP
frequency of 300 MHz. Several previous works [82], [148], consists of Vector Arithmetic and Logic Units (VALU), one
[161] also used the FPGA to implement object detection of the main components in the architecture that implements
algorithms. the CNN operations viz. 2-D convolutions, sub-sampling, and
Hamza et al. [125] proposed the FPGA-based acceler- non-linear activation functions. The implementation of 2-D
ator named NPE to efficiently implement various Natural convolution, represented using Eq. (6), is shown in Fig. 10 for
Language Processing (NLP) models. NPE provides a single K = 3, i. e. 3 × 3 kernel. In Eq. (6), xij is the data in the input
framework for processing arbitrarily complex nonlinear func- plane, wmn is the weight value in K ×K kernel, yij is the partial
tions with software-like programmability. NPE consumes 4× sum, zij is the result in the output plane, and W is the width of
and 6× less power than CPU and GPU. NPE is implemented the input image. At each clock cycle, the convolution module
on the Xilinx Zynq Z-7100 FPGA running at a frequency performs k 2 multiply-accumulate operations simultaneously.
of 200 MHz. CNP uses the First In First Out (FIFO) buffers between the
Serkan et al. [184] developed an FPGA-based CNN accel- external memory and FPGA to provides the continuous flow
erator to classify malaria disease cells. The proposed acceler- of data in both directions. CNP uses the 32-bit soft processor
ator is implemented on Xilinx Zynq-7000 FPGA running at that provides the macro instructions, generally higher level
a frequency of 168 MHz. The proposed accelerator achieves instructions than most traditional processors, to the VALU for
an accuracy of 94.76%. Zhu et al. [228] proposed an FPGA- implementing the basic CNN operations. CNP has a compiler
based accelerator to recognize liver dynamic CT images. that converts network implementations with Torch directly
Xiong et al. [217] developed an FPGA-based CNN accel- into CNP instructions. The proposed architecture has been
erator to improve the automatic segmentation of 3D brain used to implement the face detection system.
tumors. FPGA-based accelerators are also used to implement K −1 K −1
various applications such as autonomous driving [105], [129], X X
zij = yij + xi+m,j+n · wmn (6)
image classification [45], [70], fraud detection [128], cancer
m=0 n=0
detection [186], etc. Table 2 summarizes the reviewed FPGA-
based accelerators for specific applications.

B. ACCELERATORS FOR A SPECIFIC ALGORITHM

A prominent topic of research in the realm of accelerators is
the use of FPGA-based accelerators for a particular neural
network algorithm. Since the accelerator is intended to
address a specific problem, its operation typically requires
minimal adjustments to a few parameters to operate effec-
tively. Cloutier et al. [65] proposed a hardware accelerator,
referred to as Virtual Image Processor (VIP) to implement
the CNNs. The Altera EPF81500 FPGA platform is used
to implement the proposed design. VIP primarily consists
of Processing Elements (PEs) connected by a 2-D systolic
architecture and supports the SIMD paradigm. VIP is
designed to perform the following vector and matrix oper-
ations: matrix multiplication, matrix-vector multiplication,
FIGURE 10. 2-D convolution module for 3 × 3 kernel, adopted from [85].
scalar multiplication, matrix addition, matrix-vector addition,
vector addition, 1-D convolution, 2-D convolution, etc. The Sankaradas et al. [182] presented a massively parallel
host computer is used to provide the configuration data to co-processor for accelerating CNNs. This co-processor is

131796 VOLUME 10, 2022

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

TABLE 2. Summary of FPGA-based accelerators for specific application.

designed using the Virtex5 LX330T FPGA platform and

four DDR2 (Double Data Rate 2) memory banks totalling
1 GB. The proposed co-processor mainly consists of clusters
of Vector Processing Elements (VPE) connected in parallel.
Each cluster consists of 2-D convolver units, sub-samplers,
Look-Up Tables (LUT) and performs convolution, pooling,
and non-linear functions. The co-processor is coupled with
DDR2 memory banks to store the intermediate data. Each
VPE in the proposed co-processor exploits parallelism by
supporting SIMD stream. The primitive 2-D convolver of
the proposed design is shown in Fig. 11. It contains k × k
convolution units along with k 2 + k VPEs, and the
final column of VPEs is used to add partial results. The
coprocessor operates in collaboration with a host, which can
control the coprocessor through an Application Programming
Interface (API). The proposed design uses low precision
data representation to improve the throughput and memory
bandwidth. The proposed architecture has been used to
implement the full face recognition application using CNN
with four convolution layers. The proposed accelerator FIGURE 11. 2-D convolver unit of CNN co-processor, adopted from [182].

can not be used to realize the full CNNs, which contain

convolution and fully connected layers. Graf et al. [93] used
a similar approach to accelerate the Support Vector Machines see Fig. 12. Furthermore, the output of each PE is connected
(SVM) and the proposed design contains VPEs instead of to the PE on its right. The PEs are arranged as clusters,
VPE clusters. But the accelerator proposed in [93] provides where each cluster has a separate off-chip memory block
low performance while accelerating DNNs compared to the that creates independent data streams for memory-processor
co-processor proposed in [182]. computations. MAPLE processing core can be organized as
A programmable parallel accelerator called MAPLE is H clusters, and each cluster contains M PEs. So, the total
presented in [47] to accelerate the several learning and number of PEs in MAPLE core equals H × M. MAPLE
classification algorithms such as Support Vector Machine also uses smart memory banks to process the intermediate
(SVM), K_means, CNN, etc. MAPLE contains hundreds data and to perform secondary reduction operations such
of simple PEs arranged in a 2-D grid fashion as shown as, aggregation, finding minimum or maximum, and array
in Fig. 12 MAPLE can be used to perform vector and ranking. Authors developed a tool to map the applications
matrix operations in parallel. In MAPLE, each PE has on the MAPLE. For the given input matrices and reduction
local storage to perform the computations efficiently. Each functions, the tool generates the assembly code needed to
PE has two operands; one operand comes from its local program MAPLE. The authors created a C++ simulator
storage, and another operand comes from the PE on its left, that estimates how long MAPLE will take to execute from

VOLUME 10, 2022 131797

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

the input assembly code and an architectural configuration

file that details the processor layout and off-chip memory
architecture. MAPLE processor is connected to the host
computer through Peripheral Component Interconnect (PCI).
MAPLE is implemented on Xilinx Virtex 5 SX240T FPGA
with 512 PEs organized as 2 cores, 32 chains per core,
and 8 PEs per chain. The experimental results show that
MAPLE with 512 PEs is 1.5 to 10 times faster than a quad-
core Xeon processor with a clock frequency of 2.5 GHz
despite running at 125 MHz.

FIGURE 13. DC-CNN co-processor architecture, adopted from [51].

are connected to a local data line using reconfigurable routes.

FIGURE 12. MAPLE’s processing core architecture, adopted from [47]. A multiplexer connects the local data line to a global data
line by which a PT is connected to the four neighbouring
Chakradhar et al. [51] presented a dynamically reconfig- PTs. Data is transferred from the off-chip memory to the tiles
urable architecture for CNNs on Virtex 5 FPGA platform. using a Smart Direct Memory Access Module (Smart DMA).
The proposed system consists of a dynamically configurable The control unit configures each tile for the computation
CNN (DC-CNN) processing core and three bank memory and connections between the tiles. Data streams from the
sub-system. The DC-CNN processing core continuously Smart DMA are processed in tiles, and the results are passed
communicates with the host computer that executes the main to the neighbouring tiles, or back to the Smart DMA. This
application. In the proposed accelerator, the host computer 2-D grid can be used to perform arbitrary computations
transfers the complete CNN structure and input images to the on streams of data and plain unary operations to complex
co-processor. The DC-CNN processing core, responsible for nested operations. Using FIFOs input/output flow can be
executing CNN applications, mainly contains computational managed, and operators can easily be cascaded and connected
units (2-D convolvers), subsampling, and non-linearity units, across tiles. The NeuFlow accelerator uses a compiler named
adders, input and output switches, as shown in Fig. 13. luaFlow to process CNNs. The luaFlow compiler converts
The co-processor uses the three bank memory sub-system to high-level data flow graph representations of deep learning
store input images, kernels, and intermediate data. The DC- algorithms in the Torch5 environment into machine code
CNN uses the Torch7 [66] software for CNN implementa- for NeuFlow. The proposed accelerator has been used to
tion. The proposed dynamically reconfigurable architecture implement a real-time street scene parser.
supports ‘‘inter-output’’ and ‘‘intra-output’’ parallelism. The Peeman et al. [169] presented a memory-centric design
performance of the proposed dynamically reconfigurable method for CNN accelerator. The proposed memory-centric
architecture with 20 convolvers, 128-bit memory port width is accelerator is implemented on a Virtex 6 FPGA board.
4 to 8 times faster than CNP presented in [85]. The proposed This accelerator minimizes the bandwidth requirements by
architecture can be used to accelerate CNN with only three exploiting the data reuse in complex access patterns. The
convolutional layers. The proposed accelerator is not capable memory-centric accelerator uses the loop transformation and
of realizing the full CNNs, which contain convolution and Block RAM (BRAM)-based multi-bank on-chip buffers to
fully connected layers. maximize the efficiency of on-chip memories for better data
A CNN accelerator, referred to as NeuFlow is proposed locality. The memory-centric accelerator uses SIMD type
in [84]. NeuFlow is implemented on a Xilinx Virtex 6 FPGA of PEs to accelerate the convolutional layers. The proposed
platform. NeuFlow contains a 2-D grid of Processing Tiles accelerator design mainly focused on the maximization of the
(PTs), as shown in Fig. 14. A PT contains a bank of reuse of on-chip data. The proposed accelerator is connected
processing operators where an operator can be anything from to a MicroBlaze host processor and is communicated through
memory (FIFO) to an arithmetic operator. All the operators Fast Simplex Link (FSL) connections. Vivado HLS tool is

131798 VOLUME 10, 2022

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

FIGURE 14. 2-D grid of Processing Tiles (PTs) in NeuFlow architecture,

FIGURE 15. NPU-based neural network accelerator architecture, adopted
adopted from [84].
from [171].

used to map the CNNs on the proposed accelerator, which used to transfer the data between the processing elements
enables the user to use the high-level accelerator description and the external memory, which provides independent data
in C and to use HLS directives to specify the hardware streams. The proposed architecture uses the weight stationary
configuration. The performance of the proposed accelerator dataflow to improve energy efficiency. The nn-X accelerator
will be improved with the use of the DMA controller. is implemented using the Xilinx ZC706 platform, which has a
In [171], an accelerator for DNNs is introduced, and it dual ARM Cortex-A9 processor, Xilinx Zynq XC7Z045 chip,
is implemented on the Xilinx Kintex 7 FPGA platform. and 1 GB DDR3 memory. The experimental results show that
It is built using a set of Neural Processing Units (NPUs), the nn-X can achieve a peak performance of 240 GOPS.
see Fig. 15. The number of NPUs in the proposed design
depends on the available FPGA resources. NPUs are mainly
used to compute the majority of operations (multiplications
and additions) in parallel. A multiply and accumulate (MAC)
unit and control logic are the essential components of each
NPU. The proposed accelerator utilizes the available FPGA
resources efficiently by using pipelined architecture, time
division multiplexing (TDM) processing scheme, and page-
mirror algorithm. In the proposed accelerator, NPUs get
the inputs from the host computer through the Ethernet
interface, and weight coefficients are fetched from page
mirror memory. The serializer sends the output of NPUs to
the activation function blocks. For each sample, the proposed
accelerator requires a long time to transfer the appropriate
weight coefficient from the host computer to the accelerator FIGURE 16. Architecture of nn-X system, adopted from [91].
core.
A scalable and low-power accelerator referred to as neural Zhang et.al. [222] proposed a roofline-based model [212]
network next (nn-X) is presented in [91] to accelerate the to implement CNNs on FPGAs. The authors analyzed the
DNNs. The nn-X accelerator mainly contains a co-processor, throughput and required bandwidth for a given CNN design
a host processor, and external memory as shown in Fig. 16. using various optimization techniques, such as loop tiling
The host processor controls the input and configuration and loop transformation. With the help of the roofline
data transfer to the coprocessor, parses the DNN, and model, they identified the solutions with the best performance
converts it into instructions for the coprocessor. The co- and lowest FPGA resource requirement. This roofline-based
processor mainly contains an array of processing elements model optimizes both the memory accesses as well as
called collections, configuration bus, and memory router. computations in the convolutional layers. The accelerator
The collections in the nn-X accelerator are mainly composed design is implemented with the Vivado HLS tool, which
of convolution engines, pooling modules, and non-linear enables the accelerator implementation in C language.
operators and are used to perform the most common CNN The proposed accelerator achieves a maximum throughput
operations, such as convolution, sub-sampling, and activation of 61.62 GFLOPS (Giga Floating-point Operations Per
functions. The memory router in the nn-X accelerator is Second).

VOLUME 10, 2022 131799

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

Implementing DNN in embedded devices is tough due mainly contains modules such as DMA, embedded processor,
to resource and power constraints. In this regard, authors DLAU, and DDR3 memory controller as shown in Fig. 18.
in [172] have developed novel FPGA-based accelerators for The DLAU module mainly contains three processing units,
implementing trained and fully connected DNNs. Since it viz. Partial Sum Accumulation Unit (PSAU), Tiled Matrix
is difficult to map a DNN with a large number of neurons Multiplication Unit (TMMU), and Activation Function
and corresponding weights, directly onto an FPGA, the Acceleration Unit (AFAU). TMMU is used to perform
authors in [172] used a time division multiplexing scheme. multiplication operations and also generate partial sums.
Batch processing is used in the proposed architecture, PSAU is used to add the partial sums derived from TMMU.
which distributes different weights over many input samples. Finally, AFAU is used to perform the non-linear activation
In addition, the suggested accelerator employs a pipelined functions, for instance, the sigmoid function. The DLAU
architecture to make the most of the FPGA resources while module reads the tiled input data through the DDR3 memory.
staying within power and resource limits. The concept The embedded processor provides the programming interface
of pruning has also been incorporated into the proposed to the users and communicates with DLAU via JTAG-UART.
architecture to reduce data transfer from the external memory The proposed architecture is implemented on Xilinx Zynq
to the accelerator [173]. Both Batch processing and weight Zedboard with ARM Cortex-A9 processors operating at
pruning can enhance the throughput of DNN accelerators. 667 MHz.
Qiu et al. [177] proposed FPGA based CNN accelerator,
which will efficiently accelerate all the layers of CNN,
including the fully connected layers. The proposed acceler-
ator improves bandwidth and resource usage by employing
a dynamic-precision data quantization method and a unique
design of the convolver hardware module. The proposed
accelerator applies singular value decomposition (SVD) on
weight coefficients to minimize the memory footprint at the
fully connected layer. The convolver hardware module can
be used for both convolutional and fully connected layers
to reduce resource consumption. The adder tree, convolver
complex, non-linearity, max-pooling, bias shift, and data shift
are the main elements of the convolver hardware module,
as shown in Fig. 17. Convolutions and fully connected layer
operations are both performed using the convolver complex
module. The max pooling action is carried out using the max-
pooling module. CNN’s non-linearity function is calculated FIGURE 18. DLAU accelerator architecture, adopted from [208].
using the non-linearity module. The convolver complex
module generates partial sums, which are added by the adder Lian et al. [140] proposed a block-floating-point (BFP)
tree. Finally, for dynamic quantization, bias shift and data arithmetic-based CNN accelerator for DNN inference. The
shift modules are used. The proposed accelerator supports the proposed accelerator mainly contains three elements: Pro-
Caffe deep learning framework. The proposed accelerator has cessing Array (PEA), on-chip buffer, and external memory,
been implemented on the Xilinx Zynq platform. as shown in Fig. 19. The onboard DDR3 modules receive
input data and network parameters from the host computer
via PCIe3.0 × 8. Conv PEA performs the convolutional
operations, and FC PEA performs the fully connected
layer operations. The proposed accelerator uses 8-bit and
16-bit formats to represent the feature maps and modal
parameters (activations and weights), which can reduce
off-chip bandwidth and memory compared to the 32-bit
floating point counterpart with only a tiny accuracy loss.
The accelerator design is implemented with the Vivado
HLS tool, and the proposed BFP arithmetic is conducted
on the Caffe [119] scheme. The proposed accelerator is
implemented on the Xilinx VC709 evaluation board, running
FIGURE 17. Convolver architecture, adopted from [177].
at a frequency of 200 MHz, and achieves a throughput of
Wang et al. [208] proposed a scalable design called Deep 760.83 GOP/s.
Learning Accelerator Unit (DLAU) for accelerating deep Xiao et al. [216] presented the DNN accelerator architec-
learning algorithms. DLAU utilizes the tiling technique to ture specially designed for the sparse and compressed DNN
produce a scalable architecture. The proposed accelerator models. The proposed DNN accelerator mainly contains

131800 VOLUME 10, 2022

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

FIGURE 20. DNN accelerator architecture proposed in [216], adopted

from [216].

over the Intel Core-i7 and NVidia GTX 1080Ti, respectively.

FIGURE 19. Block diagram of BFP arithmetic-based CNN accelerator, The proposed accelerator has been implemented on the
adopted from [140]. Virtex-7 FPGA running at a frequency of 200 MHz.
The low-power, energy-efficient FPGA-based accelerator
is presented in [116] to accelerate the LeNet CNNs. The
a PE array, RLC encoder, controller, and on-chip buffers, proposed accelerator uses 8-bit, 16-bit, and 32-bit fixed point
as shown in Fig. 20. In the proposed DNN accelerator, all formats to represent the weights, activations, and biases,
the weights and non-linear activation functions are kept in respectively. The proposed accelerator supports pipelining
Run-length Coding (RLC) compressed form and are stored in and implements LeNet with the minimal resources possible
off-chip DRAM memory. The PE array contains 64 PEs and without affecting the throughput. This work uses the Xilinx
performs the multiply-accumulate (MAC) operations of the Vitis HLS tool to convert the C++ code to RTL implementa-
fully connected layer. The proposed accelerator uses a novel tion. The proposed accelerator is implemented on the Nexys
circuit-level processing scheme to process the sparse data DDR 4 FPGA evaluation board and achieves a throughput of
directly in the compressed domain without decompressing, 14K images/sec while using just 628 mW of power.
which leads to improvement in efficiency and performance. An FPGA-based dynamically reconfigurable architecture
The circuit-level process scheme used in this architecture is is presented in [117] to accelerate neural networks. Dynamic
dataflow independent, and thus, applies to both CNN and Partial Reconfiguration (DPR) is used in the proposed
fully connected layers. In this architecture, a new dataflow accelerator to realize different types of neural network
is proposed to facilitate the reuse of input activations across architectures. Dynamic Partial Reconfiguration (DPR) allows
the fully connected layers, which leads to exploits parallelism the proposed architecture to switch between networks and
and maximizes the utilization of PEs. In this work, the Xilinx applications without sacrificing precision or throughput.
Vivado HLS toolchain is used to convert C code to RTL The proposed accelerator mainly contains a PE array and
implementation, and then Xilinx SDSoC is used to compile configurable switches, as shown in Fig. 21. PE is a high-
the source code to generate the bit stream. The proposed level generic block that can implement the layers of a neural
architecture is implemented on the Xilinx Virtex-7 FPGA network accelerator and has three predefined interfaces: data
platform and achieves the performance of 1.34 GOP/s. interface, I/O interface, and memory interface. DPR allows
Ahmed et al. [81] proposed an FPGA-based Low Power each PE to implement many functionalities with the same
CNN (LP-CNN) accelerator based on GoogLeNet CNN. The hardware. Any PE can communicate with any other PE, CPU,
proposed accelerator uses quantization and weight pruning or I/O port of the FPGA through configurable switches. The
techniques to reduce memory size. The LP-CNN accelerator hard/soft processor controls all PE connections using the
is a time-sharing processor designed to process the CNN memory interface. This work uses the Xilinx Vitis HLS tool to
model layer by layer, and it enables pipelining. The proposed convert the C++ code to RTL implementation. The proposed
accelerator only uses the on-chip memory to store the accelerator is implemented on the Xilinx Zynq 7020 FPGA
activations and weights instead of offline DRAM memory. board.
Moreover, the proposed architecture replaces multiplication Gowda et al. [92] proposed an FPGA-based reconfig-
operations with shifting operations and uses no DSP units. urable convolutional neural network (RCNN) accelerator.
The LP-CNN accelerator is implemented in Verilog RTL, Unlike the existing structures, the RCNN accelerator con-
and the Vivado power analyzer has been used to calculate tains configuration registers to reconfigure the architecture
the power. The experimental results show that the LP-CNN according to the configuration instructions stored in the
accelerator provides 49.5 and 7.8 times power improvement Double Data Rate (DDR), as shown in Fig. 22. The image

VOLUME 10, 2022 131801

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

FIGURE 21. Accelerator architecture proposed in [117], adopted

from [117].

and weight buffers are updated with input feature maps and
weights. The PE arrays perform the convolution operations,
whereas the special function buffer performs pooling, Batch
Normalization (BN), and activation functions. The proposed
accelerator uses the SOW (sparse optimization of weight) and
CO (convolutional optimization) optimizations to reduce the
sizes of weights and feature maps, respectively, which also
minimizes the number of hardware resources needed. The
proposed accelerator uses 16-bit, 8-bit, and 4-bit fixed point
formats to represent the feature maps, convolution (CONV)
layer weights, and fully connected (FC) layer weights,
respectively. This work uses the Xilinx Vivado HLS toolchain
to convert C++ code to RTL implementation. The proposed
accelerator is implemented on Xilinx Zynq 7020 FPGA FIGURE 22. RCNN accelerator architecture, adopted from [92].
board.
Table 3 summarizes the reviewed FPGA-based accelera-
tors for a specific algorithm. The year the accelerator was DAG is converted into an SDF hardware intermediate
introduced, the deep learning model used, the FPGA platform format, which corresponds to an utterly parallel hardware
used, the precision used for input feature maps and weights, implementation. After several transformations on ConvNet’s
the clock frequency, the number of resources available in SDF hardware model, the design space is searched, and
terms of DSPs, LUTs, BRAMs, and FFs, the percentage of this procedure provides a set of hardware mappings of
resources utilized, the performance in GOPS, and finally, the the ConvNet onto the specific FPGA-based platform. The
power efficiency (GOPS/W) are all listed for each accelerator. fpgaConvNet front-end parser can examine models written
Fig. 23 shows the power efficiency and throughput of various in the Caffe and Torch machine-learning libraries. This
FPGA-based accelerators listed in Table 3. framework accomplishes efficient design space explorations
through graph segmentation, reconfiguration, folding, and
C. ACCELERATOR FRAMEWORKS WITH HARDWARE weight reloading. This framework can be used to map small
TEMPLATES CNN models, for instance, LeNet-5 on FPGAs.
Several frameworks for mapping AI models onto FPGAs Wang et al. [211] developed a design automation tool
have been developed in recent years. Venieris et al. [206] referred to as DeepBurning that contains a library of building
developed a framework called fpgaConvNet to map CNNs blocks that mimic the behavior of typical neural network
on FPGAs. The fpgaConvNet framework employs the components. The general design flow of the DeepBurning
synchronous dataflow (SDF) paradigm to capture the CNN framework is shown in Fig. 25. The DeepBurning Neural
workloads. The processing flow of fpgaConvNet is shown Network Generator (NN-Gen) takes a model descriptive
in Fig. 24. Firstly, the Deep Learning expert uses a domain- script ( Caffe-compatible script) as input, which describes a
specific language to provide a high-level description of a high-level view of network topology and layer definition. The
ConvNet architecture as well as information on the target DeepBurning NN-Gen also takes user-specified constraints
FPGA-based platform as inputs. The ConvNet description is such as area and power as input. DeepBurning NN-Gen
passed through a DSL (Domain-Specific Language) proces- consists of a hardware generator and compiler that generate
sor, which parses the input script and populates the ConvNet’s the control flow and data layout based on the user’s
semantic model as a Directed Acyclic Graph (DAG), and also specifications. The DeepBurning automation tool’s hardware
extracts platform-specific resource constraints. The ConvNet generator builds a neural network architecture for a given

131802 VOLUME 10, 2022

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

TABLE 3. Summary of FPGA-based accelerators for specific algorithm.

FIGURE 23. Power efficiency and throughput of FPGA-based accelerators listed in Table 3.

network structure by selecting and instantiating blocks from Caffe as its programming interface. DNNWeaver consists of
the library with the required interconnections. DeepBurning three software components: translator, design weaver, and
supports a wide range of NN models and simplifies the integrator. The translator transforms the Caffe specification
design flow of NN-based accelerators for machine learning of a DNN into a macro data flow graph. Design weaver
applications. accepts macro data flow graph as an input and generates
A framework referred to as DNNWeaver is presented a synthesizable Verilog implementation of the accelerator
in [187] that generates bitstream and host code to implement code. The integrator adds the memory interface code to
DNNs on various FPGA boards. DNNWeaver employs the accelerator code. DNNWeaver generates accelerator

VOLUME 10, 2022 131803

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

and well-designed communication-optimized algorithms,

FP-DNN performs model inference for DNNs.

FIGURE 24. Processing flow of fpgaConvNet, adopted from [206].

FIGURE 26. FP-DNN framework, adopted from [95].

Umuroglu et al. [204] proposed FINN, a framework that

maps trained Binarized Neural Networks (BNNs) onto
an FPGA. FINN generates a synthesizable C++ network
description of a flexible heterogeneous streaming architec-
ture. The architecture mainly contains pipelined compute
engines that communicate via on-chip data streams. Each
BNN layer has been implemented using dedicated compute
engines with 1-bit values for FMs and weights. To evaluate
FIGURE 25. Design flow of DeepBurning framework, adopted from [211]. FINN, the authors implemented CNV, a convolutional
network topology inspired by BinaryNet [68] and VGG-
16 [192], on a Xilinx Zynq-7000 FPGA board running
code from a series of scalable and customizable hand- at 200 MHz to accelerate BNN inference.
optimized template designs, resulting in high performance Guo et al. [96] proposed a flexible and programmable
and efficiency. CNN accelerator, referred to as Angle-Eye, together with the
Guan et al. [95] proposed a framework called Field Pro- compilation tool and the data quantization scheme. The data
grammable DNN (FP-DNN) to accelerate DNNs efficiently quantization scheme can be used to reduce the bit-width down
on FPGAs. FP-DNN Framework is shown in Fig. 26. The to 8-bit with insignificant accuracy loss. The compilation tool
model description is generated by TensorFlow and is fed is responsible for mapping a given CNN model efficiently
into a Symbolic Compiler. The compiler generates a C++ onto the hardware architecture. The proposed accelerator
program and an FPGA programming bit stream for model supports the acceleration of various CNNs on different FPGA
inference, executed by the host and device, respectively. platforms. The overall architecture of the Angel-Eye is shown
Model mapper examines the model description, extracts the in Fig. 27. Angle-Eye accelerator mainly consists of a PE
target model’s topological structure and operations, and sends array, controller, on-chip buffer, and external memory. The
the hardware kernel schedule and configuration to the code PE array is used to perform the convolution operations,
generators. Software generator generates the host code in and it supports three levels of parallelism: input channel
C++ using the kernel scheduling. The host code is compiled parallelism, kernel-level parallelism, and output channel
using a commercial C++ compiler to create host programs. parallelism. The on-chip buffer can isolate the PE array
Hardware generator creates device codes by instantiating from external memory, allowing simultaneous convolution
RTL-HLS hybrid templates based on kernel configuration. and data I/O operations. All network parameters and the
The hardware code is compiled using commercial synthesis results of each layer can be saved to external memory. The
tools to generate the programming files for the hardware controller is responsible for receiving, decoding, and issuing
implementation. With a high-performance computing engine instructions to the other three components and monitoring

131804 VOLUME 10, 2022

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

each component’s work status. Angle-Eye accelerator is

implemented on the Zynq XC7Z045 platform.

FIGURE 27. Angel-Eye accelerator architecture, adopted from [96].

Zhang et al. [223] proposed a software/hardware co-design

library called Caffeine to accelerate CNNs efficiently on
FPGAs. The authors developed a uniform convolutional
matrix-multiplication representation for both convolutional FIGURE 28. CNN2Gate, adopted from [90].
and fully connected layers. Caffeine synthesizes Caffe
models comprising convolutional layers and fully connected
layers for FPGAs. The Caffeine framework effectively into DPU instruction code before deploying them to the
handles weights and biases reconfiguration in off-chip target DPU platform. For efficient DNNs implementations,
DRAM to maximize the underlying memory bandwidth the Xilinx DPU offers a tailored and scalable overlay with
utilization. The authors integrated the Caffeine with a deep ISA architecture. Xilinx Vitis AI framework supports various
learning framework Caffe and implemented AlexNet and frameworks such as Caffe, PyTorch, TensorFlow, etc., and
VGG networks on multiple FPGA platforms, viz., Xilinx efficiently implements deep learning tasks, CNN, and RNN,
KU060 FPGA board, and Virtex7 690t FPGA board. Caffeine see Fig. 29. The internal architecture of the DPU contains
achieved better energy efficiency than 12-core CPU and GPU an Instruction Unit (IU), a Compute Array (CA), and a
by 43.5 × and 1.5 ×, respectively. Global Memory Pool (GMP). The IU fetches the DPU
Ghaffari et al. [90] developed a general framework called instructions associated with the model, decodes it, and drives
CNN2Gate, which allows mapping CNN models on FPGAs the PEs present in compute array. It also manages the
with automated design space exploration. The CNN2Gate data/instructions transfer among the PEs and the memory.
overall architecture consists of an Open Neural Network The GMP acts as buffer for the input and output data as
eXchange (ONNX) format parser, a design-space exploration well as intermediate output from the DPU, resulting in high
module, and leverages automated high-level synthesis is throughput [113]. The DPU can be configured to meet the
shown in Fig. 28. CNN2gate can parse CNN models using requirements of a specific CNN architecture, and the Vitis
ONNX parser from several popular high-level machine AI stack contains all the necessary libraries to generate the
learning libraries, such as Caffe2, Keras, TensorFlow, etc. instructions for the DPU. The development flow is described
The computation flow of network layers and their weights in Fig. 29, where a trained model is compiled using the Vitis
and biases are retrieved in CNN2Gate, and a fixed-point AI compiler. The Vitis AI tools provide a model quantizer
quantization is used. To undertake design space exploration to reduce the precision of weights without losing accuracy.
for deeply pipeline OpenCL kernels of CNN, the authors used An Xmodel file is generated by the Vitis Compiler consisting
time-limited reinforcement learning. of domain-specific instructions for the DPU unit, which are
Xilinx Vitis AI [32] is a framework for implementing used to configure the DPU. During inference, a Python script
deep learning inference on Xilinx FPGAs and SoCs. It uses running on the PS acts as the interface, and it is responsible
an Intellectual Property (IP) core called the Deep Learning for transferring the data from the on-chip memory to the
Processor Unit (DPU) to implement ample essential functions DPU memory buffers. Examples of CNNs that have been
of deep learning on FPGAs, see Fig. 29. Xilinx Inc. released implemented using DPU include, but are not limited to, VGG,
the DPU, a programmable engine designed for DNNs. Xilinx ResNet, GoogLeNet, YOLO, SSD, MobileNet, and FPN.
Vitis AI framework enables the compression of DNN models Table 4 summarizes the reviewed FPGA-based accelerator
without sacrificing accuracy and compiling DNN models frameworks for the implementation of DNNs.

VOLUME 10, 2022 131805

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

FIGURE 29. DPU architecture overview, adopted from [227], Vitis AI stack, and development flow.

TABLE 4. Summary of FPGA-based accelerator frameworks.

IV. ASIC BASED ACCELERATORS prevent zero multiplications. The following sections provide
Application Specific Integrated Circuit (ASIC) is a powerful a comprehensive overview of ALU, Dataflow, and Sparsity-
platform to accelerate the DNNs. ASICs are customized based accelerators.
chips designed for a specific application. They are smaller
in size, consume less power, and provide higher speeds, A. ALU BASED ACCELERATORS
making them suitable solutions for DNN acceleration [76]. NeuFlow is the ASIC based CNN accelerator presented
ASIC based hardware accelerators have limited computing in [170] to accelerate the NNs and other ML algorithms. The
resources, memory resources, and I/O bandwidths compared architecture of the proposed accelerator is the same as the
with GPU based accelerators, but they can achieve moderate accelerator discussed in [84] and shown in Fig. 14, but is
performance and consume less power [165]. Furthermore, implemented using IBM 45 nm Silicon-On-Insulator (SOI)
ASIC exhibits the best computation speed and energy effi- process. The NeuFlow accelerator uses a compiler named
ciency than GPU and FPGA at the cost of reconfigurability. luaFlow to process CNNs. The luaFlow compiler converts
Many researchers are focused on building custom ASICs for high-level data flow graph representations of deep learning
accelerating CNNs inference workloads to achieve the best algorithms in the Torch5 environment into machine code for
performance and energy efficiency. In this section, we would Neuflow. The proposed architecture provides higher power
like to review the recent ASIC-based DNN accelerators. efficiency and is suitable for vision-based applications, such
There are three broad types of ASIC-based DNN accel- as autonomous vehicle navigation, driving assistance, etc.
erators depending on how the architecture has been opti- The proposed architecture achieves the maximum throughput
mized/designed: ALU (Arithmetic Logical Unit), Dataflow, of 320 GOPS with a power consumption of 0.6 W; in
and Sparsity-based accelerators. The main building block, contrast, the NeuFlow architecture implemented on Xilinx
the MAC unit (or an array of MAC units), in ALU- Virtex6 FPGA presented in [84] has a maximum throughput
based accelerators is modified to have ample computational of 16 GOPS with power consumption of 10 W.
resources and flexibility to obtain the best performance with Chen et al. [53] proposed the ASIC-based hardware
varying bit accuracy. In dataflow-based accelerators, the accelerator, also called DianNao, to accelerate the large-
activations, weights, and partial sums are managed to reduce scale CNNs and DNNs. The proposed architecture provides
the energy needed to move data within the chip and achieve the quick and energy-efficient execution of the inference
high arithmetic intensity. In Sparsity-based accelerators, the of large-scale CNNs and DNNs. The architecture contains
unstructured sparse data is handled in such a way that the the Neural Functional Unit (NFU), buffers, and control
matrix multiplication units (2-D array of MAC units) can processor (CP), see Fig. 30. The NFU module is used to

131806 VOLUME 10, 2022

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

perform the computations needed to determine the output

of the neuron in the fashion, i. e., in the first stage, NFU
performs the multiplication of input neuron values with
weight coefficients. In the second stage, NFU accumulates
these products using the adder trees. In the third stage,
NFU calculates the activation functions. Buffers are used
to store the input/output neuron values and weights. The
proposed architecture contains three buffers viz., an input
buffer to store the input neuron values (NBin), an output
buffer to store the output neuron values (NBout), and a third
buffer to store the weights (SB). Different computational
operators are invoked in each stage depending on the type
of the layer (convolution, activation function, pooling, etc.)
For architecture exploration, the author developed a C++
simulator that evaluates execution time and serves as a speci-
fication for the Verilog implementation. The Verilog version
of the accelerator is synthesized using Synopsys’ design
FIGURE 30. DianNao accelerator architecture, adopted from [53].
compiler, and the generated design is placed and routed
by Synopsys’ ICC compiler. The design is simulated using
Synopsys’ VCS, while PrimeTime PX is used to determine adder, divider, multiplier, and converters for the 16-bit float
the power. The proposed architecture was implemented using to 32-bit float and 32-bit float to 16-bit float. It may also
65 nm CMOS technology. The experimental results show that be used to compute estimates using the Taylor expansion of
DianNao achieves an average performance of 452 GOPS with log (1-x). HotBuf (8 KB) and ColdBuf (16 KB) store the
485 mW of power consumption. The proposed accelerator input data with short and longer reuse distances, respectively.
has scalability issues due to the bandwidth constraints of the OutputBuf (8 KB) is used to store the output data or
memory system. The DaDianNao accelerator [54] and [146] intermediate results. The authors implemented an in-house
are extensions of the DianNao accelerator [53]. DaDianNao C simulator of PuDianNao; it acts as a specification for the
has enough on-chip memory to hold all of CNN’s weights. Verilog implementation and also measures the performance
DaDianNao also uses 16-bit fixed-point representation in the of PuDianNao on large-scale datasets. The design compiler
inference process like DianNao. However, it is implemented synthesizes the design, and the ICC compiler is used to
using 28 nm CMOS technology. The design compiler synthe- generate the layout. The energy, area, and critical path are
sizes the Verilog version of the DaDianNao accelerator, and obtained after the layout. The design is simulated using
the ICC compiler is used to generate the layout. The energy, Synopsys VCS, and PrimeTime PX is used to determine
area, and critical path are obtained after the layout. The design the power using the Value Change Dump (VCD) file. The
is simulated using VCS, while PrimeTime PX is used to proposed architecture has been implemented using TSMC
determine the power. The proposed architecture uses eDRAM 65 nm CMOS technology.
to store all the data related to a CNN, i. e., input feature
maps, weight kernels, output kernels, etc. The DaDianNao
accelerator gives better performance while accelerating the
CNNs, but provides moderate to low performance while
accelerating large-scale CNNs.
Liu et al. [141] proposed the machine learning accel-
erator referred to as PuDianNao, that supports multiple
machine learning scenarios (e.g., regression, classification,
and clustering) as well as many machine learning techniques,
including k-means, k-nearest neighbors, linear regression,
classification tree, naive bayes, support vector machine, and
DNNs. The PuDianNao mainly contains various Functional
Units (FUs) and three types of data buffers: ColdBuf,
HotBuf, and OutputBuf, an instruction buffer (InstBuf), FIGURE 31. PuDianNao accelerator architecture, adopted from [141].
and a DMA, and a control module, see Fig. 31. The FU
contains a Machine Learning Functional Unit (MLU) and Du et al. [78] proposed a CNN accelerator referred
an Arithmetic Logic Unit (ALU). The MLU can be used to as ShiDianNao to improve the energy efficiency and
to perform several computational primitives, including dot scalability of DianNao [53] design discussed above. The
product, counting, sorting, distance calculations, non-linear ShiDianNao accelerator does not access the main memory
functions, for instance, sigmoid and so on. The ALU has an while executing a CNN and achieves more energy efficiency

VOLUME 10, 2022 131807

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

compared to DianNao. The design is implemented in Verilog convolutional operations, and FRP performs matrix multipli-
and synthesized by the design compiler, and IC compiler cation operations. DNPU is the first CNN/RNN accelerator
is used to place and route the synthesized design. The with the highest energy efficiency of 8.1 TOPS/W on 65 nm
energy cost of DRAM accesses is calculated using CACTI CMOS technology. DNPU has some limitations; for instance,
6.0 [159]. The ShiDianNao accelerator will not support its area limits the number of processing elements (PEs)
the acceleration of large-scale CNNs. The ShiDianNao for convolutional layers (CL) and recurrent layers (CL).
accelerator is implemented using 65 nm CMOS technology. As a result, performance was sub-optimal in cases that just
DianNao [53], DaDianNao [54], [146], PuDianNao [141], required CLs or RLs. Furthermore, DNPU only supports
and ShiDianNao [78] are not built utilizing reconfigurable a limited number of weight-bit precisions, such as 4 bits,
hardware, hence they cannot be adapted to changing 8 bits, or 16 bits. Lee et al. [133] proposed the Unified
application demands such as NN sizes. Neural Processing Unit (UNPU) architecture to process
Lu et al. [145] proposed a flexible dataflow architecture CNNs and RNNs. UNPU contains a bit-serial MAC unit
called FlexFlow to accelerate the CNNs, exploiting all kinds to perform the required computations. UNPU supports CLs,
of parallelisms viz., inter-kernel, intra-kernel, and inter- RLs, and fully connected layers (FCLs) with fully-variable
output on a two-dimensional array of PEs. FlexFlow has weight bit-precision from 1 to 16 bits. UNPU achieves an
additional interconnections between on-chip memories and energy efficiency of 3.08, 11.6, and 50.6 TOPS/W for the
PEs, which provides the flexibility to fetch any neuron case of 16-bit, 4-bit, and 1-bit weights, respectively. UNPU
from any feature map. The proposed accelerator minimizes achieves 1.43× higher energy efficiency than the DNPU for
the interconnections between the PEs at the cost of energy convolutional layers with 4-bit weights.
because of data movement from on-chip memory to PEs.
In FlexFlow, all the PEs are operated in parallel, therefore, B. DATAFLOW BASED ACCELERATOR
helping in improving the overall throughput. The proposed The accelerators based on dataflow put a special emphasis on
architecture has high scalability and supports different sizes data management to minimize off-chip memory reads/writes.
of CNNs with stable resource utilization. FlexFlow only When it is feasible, reusing parameters between layers can
implements CNNs and is confined to within a layer rather enhance dataflow. For instance, in a convolutional layer, both
than across layers. The design is simulated, synthesized, activations and weights can be reused. In a fully connected
placed & routed using Synopsys’ tools. The FlexFlow layer, each neuron has a unique set of weights; as a result,
accelerator is implemented using TSMC 65 nm technology. weights cannot be reused, but input data may. In order to
Hardik et al. [188] developed a bit-level dynamically minimize data movement between a computing unit and
composable architecture called Bit Fusion for accelerating higher-level memory, the reusable parameters are kept in
DNNs. Bit fusion mainly consists of an array of bit-level local registers.
computation elements, called BitBricks, that dynamically Cavigelli et al. [50] proposed the Origami CNN accel-
fuse to match the bit width of individual DNN layers and erator, which is scalable to different network sizes. The
execute DNN operations with the required bit width, without proposed architecture uses the Weight Stationary (WS)
any loss of accuracy. Furthermore, Bit Fusion supports dataflow to improve energy efficiency during the acceleration
the multiplication of 2, 4, 8, and 16 bits spatially. Bit process. WS dataflow minimizes energy consumption by
Fusion decomposes a 16-bit multiplication into multiple maximizing the access of weight coefficients. WS dataflow
2-bit multiplications to achieve the flexibility to efficiently used in the Origami maximizes the convolution and filter
map various layers of CNN with different bit widths and reuse of weights. The proposed accelerator was implemented
minimize the computation and the communication with no using UMC 65 nm CMOS technology and having a core
loss of accuracy. Bit Fusion architecture comes with an area of 3.09 mm2 . The proposed CNN accelerator can
Instruction Set Architecture (ISA) that minimizes the data achieve a throughput of 274 GOPS and a power efficiency
transfer and maximizes the parallelism in computations. of 369 GOPS/W with an external memory bandwidth of
The proposed design is implemented in Verilog and is 525 MB/S full-duplex. The proposed architecture is only used
synthesized using the Design Compiler, which estimates to perform the convolution operation and is unsuitable for
the area, frequency, and power. The proposed accelerator implementing the fully connected layer operations.
architecture is implemented on 45 nm CMOS technology. Bit Eyeriss [56] is an ASIC based CNN accelerator that
Fusion accelerator achieves 5.1× energy saving and 3.9× uses a row-stationary (RS) dataflow that minimizes data
speedup over Eyeriss accelerator. movement energy consumption on a spatial computing
Shin et al. [190] proposed Deep Neural Processing Unit architecture. RS dataflow is adaptable to various CNN shapes
(DNPU) architecture to process CNNs and Recurrent Neural and minimizes energy consumption by reusing the filter
Networks (RNNs). DNPU is a SIMD MAC-based CNN/RNN coefficients and input feature maps. The proposed accelerator
accelerator that uses dynamic precision control to minimize mainly contains a 12 × 14 PE array, feature map compression
kernel data size. DNPU consists of a convolutional layer units, and a 108 KB global buffer; ReLU as shown in Fig. 32.
processor (CP), a fully connected and RNN-LSTM layer The global buffer enables the reuse of loaded data from off-
processor (FRP), and a RISC controller. CP performs chip DRAM and the generated results by PEs and is also

131808 VOLUME 10, 2022

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

responsible for returning the final results to the off-chip

DRAM. In the Eyeriss accelerator, the PEs are connected
via a Network on Chip (NoC). The NoC used in Eyeriss
only supports multi-cast. The authors proposed an analysis
framework for calculating the energy efficiency of various
CNN data flows under the same hardware constraints. The
proposed accelerator is implemented using 65 nm CMOS
technology.

FIGURE 33. Eyeriss v2 top-level architecture, adopted from [57].

presented in [131]. MAERI contains a set of multiply adder

computation units, each augmented with tiny configurable
switches that can be configured to support various kinds
of dataflows, see Fig. 34. The prefetch buffer stores the
input activations, intermediate partial sums, weights, and
output activations. Acceleration units mainly contain look-up
FIGURE 32. Eyeriss DNN accelerator, adopted from [56]. tables (LUT) and perform activation functions. MAERI uses
two configurable interconnect networks, namely, distributed
Chen et al. [57] proposed a DNN accelerator architecture network and augmented reduction network. To assist the
referred to as Eyeriss v2 to accelerate compact and sparse effective mapping of the irregular dataflows and to provide
DNNs. Like Eyeriss [56], Eyeriss v2 is composed of an high resource utilization, MAERI offers non-blocking com-
array of PEs to perform MAC operations, global buffers, and munication via reconfigurable links with large bandwidth.
local scratchpad (SPad) memory to support data reuse. In the The proposed accelerator can accelerate various operations
Eyeriss v2 accelerator, PEs and global buffers (GLB) are viz., convolution, pooling, fully connected layer, and LSTM.
grouped into clusters to support a flexible Network On Chip The proposed accelerator also supports sparsity and cross-
(NoC), as shown in Fig. 33. The main difference between layer mapping. MAERI is implemented in Bluespec System
Eyeriss and Eyeriss v2 is that Eyeriss v2 uses a hierarchical Verilog (BSV) [163] and is synthesized with TSMC 28 nm
mesh NoC (HM-NoC) to connect the global buffers to the standard cell and SRAM library at 200 MHz.
PEs; in contrast, the Eyeriss uses multicast NoC between the
global buffer and PEs. Furthermore, the Eyeriss v2 accel-
erator uses separate NoCs to transfer the input activations,
weights, and partial sums between the global buffer and
PEs. The hierarchical mesh NoC used in the Eyeriss v2
accelerator supports unicast, multicast, and broadcast. The
HM-NoC can be configured into various modes ranging from
high data reuse to high bandwidth. The proposed architecture
supports various CNN layer dimensions and sizes because
of the flexible hierarchical mesh NoC. The authors proposed
an analysis framework named EYEXAM for evaluating the
performance of various CNN dataflows. The Eyeriss v2
accelerator has higher hardware utilization than Eyeriss but
has a large area overhead. The experimental results show
that Eyeriss v2 reaches 11.3× and 42.5× improvement
FIGURE 34. MAERI architecture, adopted from [57].
in energy efficiency and throughput, respectively, with the
sparse AlexNet, compared to Eyeriss running with the Tensor Processing Unit (TPU) is developed by Google in
AlexNet. It also achieves 2.5× and 12.6× improvement in order to implement machine learning algorithms. A matrix
energy efficiency and throughput, respectively, with sparse multiplication unit as a systolic array of 256 × 256 units
MobileNet compared to Eyeriss running with MobileNet. is used in the TPU architecture [120]. Fig. 35 shows the
Multiply-Accumulate Engine with Reconfigurable Inter- block diagram of the TPU. The mentioned systolic array
connect (MAERI) is a DNN accelerator containing a set structure is basically built with weight-stationary dataflow
of configurable building blocks to support various CNN and as a 2-D SIMD architecture. Extracting from the DRAM,
partitions and mapping by configuring the tiny switches the weights can then be stored in the weight FIFO (First-In,

VOLUME 10, 2022 131809

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

First-Out) register. The results from the previous layers and array to perform the required MAC operations. SCNN
the input activation function are stored in the unified local exploits all three kinds of parallelisms viz., inter-kernel,
buffer. In order to perform a convolution operation on a intra-kernel, and inter-output. SCNN requires additional
matrix multiply unit, a systolic data setup block is used in optimization circuitry to implement the fully connected layer
order to rearrange the data. Efficient running of machine operations. SCNN improves the performance by skipping
learning model tasks and inference tasks like search and the zeros in the input feature maps and weights. SCNN is
image recognition, and language translation have been the implemented in system C and Catapult High-Level Synthesis
focus of the first version of TPU, called TPU1. Since 2015, (HLS) [30] tool is used to generate the Verilog RTL. Synopsys
TPU1 has been operational in Google’s data center. A second Design Compiler synthesizes the Verilog version of the
version TPU2, also called Cloud TPU is operational in data design. SCNN is implemented using TSMC 16 nm FinFET
centers for the purpose of training and interference. Cloud technology.
TPU supports several frameworks, including TensorFlow, Eyeriss [56] also looked into input sparsity as a way
PyTorch, and JAX/FLAX. to save energy. The gating mechanism deactivates MAC
units that correspond to zero inputs. Gating saves energy
while not increasing throughput. With sparse models, The
processing speed and energy efficiency of Eyeriss V2 [57]
have improved due to its ability to process sparse data directly
in compressed format for both the weights and activations.
Zhang et al. [225] developed Sparse Neural Acceleration
Processor (SNAP) to exploit unstructured sparsity in DNNs.
To ensure that data is distributed evenly throughout the MAC
units, SNAP employs parallel associative search. SNAP is
fabricated using 16 nm CMOS technology and achieves a
peak energy efficiency of 21.55 TOPS/W (FP16) for CONV
layers with 10% weight and activation density.
Lee et al. [135] proposed an energy-efficient on-chip
accelerator called LNPU for sparse DNN model learning.
In the LNPU accelerator, Sparsity is exploited with intra-
channel as well as inter-channel accumulation. The input
load buffer module of the LNPU evenly distributes workload
FIGURE 35. Block Diagram of TPU, adopted from [120].
among the PEs while considering irregular sparsity. LNPU
uses the fine-grained mixed precision (FGMP) of FP8-FP16
that optimizes data precision while maintaining training
C. SPARSITY BASED ACCELERATORS accuracy. LNPU maintains an average hardware utilization
The fraction of zeros in a CNN layer’s weights and input of 100%. LNPU is fabricated using 65 nm CMOS technology
activation matrices is called sparsity. Since multiplying by and has an energy efficiency of 3.48 TFLOPS/W (FP8) at 0%
zero should produce a zero, there should be no effort required. sparsity and 25.3 TFLOPS/W (FP8) at 90% sparsity.
As a result, typical layers can cut work by a factor of four, and SIGMA is a scalable and flexible accelerator proposed
in some instances, by a factor of ten. Also, the addition is not in [176] to implement large, irregular, and sparse general
needed because the zero products won’t add anything to the matrix-matrix multiplications (GEMMs). The basic building
total of which they are a part. Moreover, data with many zeros block in SIGMA is the Flexible Dot Product Engine (Flex-
can be compressed—these traits, when combined, open up a DPE). All the Flex-DPE modules can be interconnected via
lot of possibilities for improvement. This section provides a simple NoC. In SIGMA, all the Flex-DPE multipliers are
comprehensive overview of accelerators that explore sparsity. arranged in a 1-D fashion, and it performs the multiple
A CNN accelerator referred to as Sparse CNN (SCNN) is variable-sized dot-products in parallel. SIGMA uses scalable
presented in [167] for inference of CNNs. SCNN employs interconnects to efficiently map the GEMMs of different
a novel dataflow referred to as sparse Planar-Tiled Input- dimensions and sparsity levels to the PEs. SIGMA outper-
Stationary Cartesian Product (PT-IS-CP-sparse) dataflow forms systolic array architectures by 5.7× for irregular sparse
that maximizes the reuse of activations and weights and matrices. SIGMA is implemented using the 28 nm CMOS
removes needless data transfers and reduces storage and technology and achieves a throughput of 10.8 TFLOPS with
power requirements. The dataflow used in SCNN eliminates a power dissipation of 22.33 W.
all multiplications with a zero and keeps both activations and Zhang et al. [224] proposed an accelerator called GAMMA
weights in compressed form. SCNN mainly contains an array to perform the Sparse matrix-sparse matrix multiplication
of processing elements arranged in a 2-D fashion with systolic (spMspM) operations. The proposed accelerator uses Gus-
connections to transfer partial sums. The proposed dataflow tavson’s algorithm [99] to compute the spMspM operations.
efficiently delivers activations and weights to the multiplier GAMMA accelerator mainly consists of an array of

131810 VOLUME 10, 2022

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

processing elements(PEs), on-chip storage referred to as accelerators together without hampering the SIMT execution
FiberCache, and a scheduler, as shown in Fig. 36. The PEs model. NGPU provides significant energy and performance
are used to perform the required spMspM operations that benefits at the cost of reasonably low hardware overhead.
combine sparse input rows to produce each output row. NGPU achieves 2.44× average speedup and 2.8× average
FiberCache is a specialized memory structure that stores energy reduction compared to the baseline GPU architecture
the non-zero elements and their coordinates. The scheduler across different sets of benchmarks.
distributes computational workloads among PEs to maximize Danial et al. [196] presented a framework for accelerating
resource efficiency while reducing unnecessary access to the training and classification of arbitrary CNNs on the
shared memory. GAMMA is implemented using 45nm GPU. The proposed method improves the performance
CMOS technology. by moving the computationally intensive tasks of a CNN
to the GPU. Training and classification of CNN on the
GPU performs 2 to 24 times faster than on the CPU
based on the network topology. Li et al. [139] proposed
an efficient GPU implementation to accelerate the training
process of large-scale Recurrent Neural Networks (RNN).
When compared to the CPU-based solution with Intel’s
Math Kernel Library (MKL), the proposed method yields
a speedup of 2 to 11 times. Kim et al. [127] proposed a
new memory management scheme to enhance the overall
GPU memory utilization in multi-GPU systems for deep
learning algorithms acceleration. The authors extended the
concept of vDNN to a multi-GPU environment employing
PCIe-bus, where vDNN [179] virtualizes the GPU and
FIGURE 36. Block Diagram of GAMMA, adopted from [224].
memory of the CPU so that it can be used simultaneously
We summarized the reviewed ASIC-based accelerators to train DL algorithms in a hybrid fashion. The suggested
for DNN in Table 5. For each accelerator, we list the year memory scheme increases batch size by 60% in multi-
the accelerator was introduced, the process technology, the GPU systems and enhances training throughput by 46.6%.
clock frequency, the dataflow, the architecture type, the High-performance GPU dedicated architecture referred to
power dissipation, the area, the performance in GOPS, and as TResNet is presented in [180] to accelerate CNNs. The
finally, the power efficiency. Fig. 37 shows the plots of proposed architecture effectively utilizes GPU resources and
various metrics, such as power, throughput, area, and power achieves better accuracy and efficiency.
efficiency of ASIC-based accelerators. Nvidia GPUs are the most popular for Deep Learning (DL)
implementations. Table 6 lists the accelerators that Nvidia has
V. GPU BASED ACCELERATORS released, which are used for the inference and training of deep
Over the last few decades, Graphics Processing Units (GPUs) learning (DL) algorithms and have both a Central Processing
are widely used in training DL algorithms or CNNs for Unit (CPU) and a GPU integrated on a single chip.
face recognition [109], object detection [220], [226], data
mining [88], and other AI applications. GPU supports VI. CGRA-BASED ACCELERATORS
parallelism due to lots of parallel cores in the architecture Coarse Grain Reconfigurable Architectures (CGRAs) pri-
and offers significant computation speed. GPU exploits marily consist of an array of Processing Elements (PEs)
large degrees of data-level parallelism in the applications connected using reconfigurable interconnects. When com-
through the Single Instruction Multiple Thread (SIMT) pared to FPGAs, CGRAs often have a shorter reconfiguration
execution models. The high computational capacity of the time. CGRAs have emerged as a popular option for real-
GPUs makes them the primary choice for DNN acceleration. time computing due to their low power consumption, high
In this section, we would like to review some of the recent efficiency, fast reconfiguration time, and ability to perform
GPU-based DNN accelerators. both spatial and temporal calculations. In recent years,
The study of implementing a standard backpropagation CGRAs have become increasingly significant in accelerating
algorithm for training multiple perceptrons simultaneously DNNs, particularly CNNs, thanks to their ability to combine
on GPU using NVIDIA CUDA technology is presented FPGAs’ flexibility with ASICs’ efficiency. In this section,
in [100]. For a given program, GPU-based implementation we would like to review some of the recent CGRA-based
on NVIDIA GTX 260 GPU achieves 50× to 150× speedup DNN accelerators.
compared to the CPU-based implementation. A neurally Jafri et al. [118] proposed a CGRA-based accelerator
accelerated architecture for GPU, called NGPU (neurally named NeuroCGRA to realize neural networks and digital
accelerated GPU) is presented in [218] to enable scalable signal processing applications. The authors have opted to
integration of neural acceleration with a large number of GPU investigate the viability of deploying neural networks on an
cores. The proposed architecture brings the neural and GPU actual CGRA using a Dynamically Reconfigurable Resource

VOLUME 10, 2022 131811

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

TABLE 5. Summary of ASIC-based accelerators.

TABLE 6. GPU-based accelerators developed by Nvidia.

Array (DRRA). DRRA mainly consists of four elements, viz., The authors have implemented edge detection on DRRA
Data Path Units (DPUs), register files (Reg-files), Switch using the proposed framework.
Boxes (SB), and sequencers, as shown in Fig. 38. The DPUs EMAX is an energy-efficient, low-power CGRA architec-
are the functional units that perform the required computa- ture with on-chip distributed memory proposed in [202] to
tions. The Reg-files store the data for the DPUs. Intercon- implement CNNs. EMAX supports both CNN training and
nectivity between various DRRA components is provided inference. EMAX is composed primarily of an array of PEs
through SBs. The sequencers configure the DPU, switch and an interconnection network, as shown in Fig. 39. Each
boxes, and register files. Distributed Memory Architecture PE is connected to its neighbors by local interconnections,
(DiMArch) is essentially a scratch pad providing enough and each row of the PE array has a shared bus. The results
data to the DRRA. The authors have embedded dedicated of calculations performed on the PEs are passed on to the
hardware, known as neuroDPU, with each DPU of DRRA PEs that exist in the next row. The PEs can access external
to implement neural networks on it. The authors proposed memory (DRAM) via the memory interface. Each PE has
a neural network translator that provides a framework for two execution units that perform the arithmetic and logical
mapping neural networks onto CGRAs. The translator takes operations. Each PE also has a local memory to store the
three inputs, viz., network model, weights, and network required data, reducing the memory bandwidth pressure.
specifications, and generates three outputs: DPU, Reg- Experimental results show that EMAX performs better than
file, and SB instructions. NeuroCGRA is synthesized using GPUs in terms of per memory bandwidth and per area.
65nm technology running at a frequency of 500 MHz. A CGRA-based accelerator referred to as stream dual-
A framework called FIST is presented in [162] that allows track CGRA (SDT-CGRA), which targets the implementation
the NeuroCGRA [118] to realize both DSP applications of object inference algorithms, is presented in [83]. SDT-
and neural networks, depending on the target applications. CGRA employs stream processing and uses both static and

131812 VOLUME 10, 2022

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

FIGURE 37. Performance metrics of ASIC-based accelerators.

dynamic configurations for stream processing. The SDT- compiler is used to synthesize the design. The proposed SDT-
CGRA accelerator mainly contains an array of PEs known CGRA is implemented using SMIC 55nm CMOS technology.
as reconfigurable cells (RS) and stream buffer units (SBUs), Experimental results show that SDT-CGRA outperforms
as shown in Fig. 40. The SDT-CGRA architecture is divided EMAX by three times in terms of operations per memory
into two sections: global memory and computing array. bandwidth.
The global memory section is dynamically configured and In [110], the authors proposed mapping of CNNs onto
stores data streams. On the other hand, the computing array Tightly Coupled Processor Array (TCPA) efficiently. TCPA
section operates in a static configuration mode. It comprises belongs to the class of CGRA, containing an array of tightly
several RC columns and one special RC column. The Special coupled VLIW Processing Elements (PEs) [104]. TCPA
RC is used for operations like power (represented as PRC offers multiple levels of parallelism, for instance, task-level,
in Fig. 40) and piece-wise functions (represented as IRC loop-level, iteration-level, instruction-level parallelism, etc.
in Fig. 40). The crossbar switch serves as a bridge to connect TCPAs are suited for accelerating computationally expen-
the RC array and SBUs. Data can be transferred from off-chip sive nested loop programs exhibiting a high degree of
memory to SBUs using the external direct memory access parallelism, such as CNNs. CNN layers are based on
interface. Static and dynamic interfaces are used for static and matrix multiplications which can be written as 6-dimensional
dynamic configurations, respectively. The proposed SDT- nested loops, making them suitable for acceleration. It was
CGRA is realized in Verilog HDL, and Synopsys design demonstrated that TCPAs use techniques such as loop

VOLUME 10, 2022 131813

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

multiplication and accumulation operations are chained

together in the MAC mode to realize the function. On the
other hand, in the MUL/ALU mode, a PE can choose
either a multiplication or an addition operation for each
cycle. Operand reuse network offers input-to-input routing
instead of output-to-input routing. The proposed NP-CGRA
is realized in Verilog HDL, and Synopsys design compiler
is used to synthesize the design. The proposed NP-CGRA
is implemented using Samsung 65nm CMOS technology.
Experimental results show that the area-delay product of NP-
CGRA is 8-18 times better than that of baseline CGRA.

FIGURE 38. DRRA computation layer [118].

FIGURE 39. EMAX architecture, adopted from [202].

permutation, loop unrolling, and layer-parallel processing to FIGURE 40. SDT-CGRA architecture, adopted from [83].
exploit the parallelism offered by the TCPA architecture.
Layer fusion allows the processing of multiple layers of There is a lot of room for CGRA research to develop and
CNN in the overlapped fashion [33], which was exploited by expand as a topic of study for future architecture; this is
TCPA to save the intermediate memory needed between the especially true when developing high-performance CGRAs
layers. Loop permutation allows the computation of multiple tailored to specialized or general-purpose computing. Some
convolution filters in an interspersed way. TCPA allows the key issues that require further research in this area include
parallel execution of multiple layers by different PEs. A CNN developing tools to program the architecture efficiently,
model for the MNIST benchmark on an array of size 4×4 was memory management, scalability, adaptability, productivity,
evaluated and the performance of the layer-parallel approach virtualization, etc.
over layer-by-layer processing was compared.
A CGRA-based accelerator called Neural Processing VII. EMBEDDED AI ACCELERATORS
CGRA (NP-CGRA) is presented in [134] to accelerate The AI hardware requirements are more critical in the
lightweight CNNs. The authors have proposed a set of edge environment, typically represented by the Internet of
extensions to the baseline CGRA [152] to improve the Things (IoT) devices (e.g., smart speaker, mobile, sensors
performance of CGRAs and to efficiently implement depth- and actuators) with limited computing resources, as opposed
wise convolution (DWC) and pointwise convolution (PWC). to cloud infrastructure with relatively sufficient computing
The authors have presented three architectural extensions: capability. For the sake of real-time immediacy, latency,
a crossbar-style memory bus, dual-mode MAC unit, and offline capabilities, security, and privacy, AI models are
operand reuse network. The crossbar-style memory bus increasingly required to be implemented on edge. In this
contains horizontal and vertical buses, and each bus is context, Small Form Factor (SFF) devices such as micro-
accessible to all the PEs connected to it. Dual-mode MAC controllers, which dominate the market, are of particular
unit works in MAC mode and MUL/ALU mode. The interest, and having AI capabilities on these devices can

131814 VOLUME 10, 2022

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

FIGURE 41. TCPA accelerator showing PE array of size 4 × 4 and a CNN that is mapped onto it for recognizing digits from MNIST database.

help many applications. Many industrial solutions require

products with SFF and Size, Weight, and Power (SWaP)
enhanced embedded systems. In this section, we review some
of the latest embedded AI accelerators.
Fig. 42 shows the architecture of Edge TPU from
Google, which is used in products such as Coral and
Pixel Phones [111]. Edge TPUs are designed to give high-
performance acceleration while staying within strict physical
and power constraints [219]. Edge TPU is organized in a
2-D array of Processing Elements (PEs) where each PE
performs computations in a SIMD fashion. Data is transferred
from off-chip memory and PEs via an on-chip controller.
Activation and parameters are loaded into the on-chip staging
buffers by the controller. In addition, the controller reads
in the low-level instructions executed on the PEs (e. g.,
convolution, pooling, etc.). Each PE may contain single
or multiple cores, each having multiple compute lanes to
support operation in SIMD fashion. A memory is shared
across all cores, PE Memory is used to model activations,
partial results, and outputs are all stored in a shared memory,
FIGURE 42. Overall architecture of Edge TPU [219].
which is labeled PE Memory, see Fig. 42. Each PE’s cores
have a core memory that is mostly used to store model
parameters. Each compute lane has multi-way MAC units
to perform computations between activations and model framework is particularly developed for mapping various
parameters. A few prototyping boards, see Fig. 43 from Coral neural network operations onto the Edge TPU [29]. The
are available for the community to try and deploy ML apps Edge TPU coprocessor can compute 4 trillion operations per
at the edge, including the Dev Board, USB accelerator, Dev second (TOPS) while consuming just 0.5 watts for each TOPS
Board Mini, and Dev Board Macro [27]. TensorFlow Lite (2 TOPS per watt) [27].

VOLUME 10, 2022 131815

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

FIGURE 43. Prototyping boards from Coral [27] having edge TPU.

FIGURE 45. BeagleBone AI board [4].

see Fig. 46. The Streaming Hybrid Architecture Vector

Engines (SHAVE) are 12 highly parallelizable vector pro-
cessors in the Myriad 2 VPU, whose parallelism and
ISA allow good performance efficiency across a range of
computer vision applications, even those with low latency
requirements. The Neural Compute Engine, a dedicated
hardware AI accelerator for deep neural network deep-
learning inferences, is included in the Myriad X VPU,
FIGURE 44. NVIDIA’s Jetson Nano. Movidius’ third generation of VPUs. Myriad X is popular
for on-device DNNs and computer vision applications thanks
to the Neural Compute Engine, 16 SHAVE cores, and ultra-
NVIDIA’s Jetson Nano [3], [183] is an embedded board high throughput. The Myriad X VPU includes a native 4K
suitable for edge AI applications. It contains a 64-bit quad- image processor pipeline and can directly link up to eight HD
core Arm Cortex-A57 CPU running at 1.43 GHz, NVIDIA sensors. The Myriad Development Kit (MDK), which offers
Maxwell GPU with 128 CUDA cores, and 4GB LPDDR4 development tools, frameworks, and APIs to implement
memory. Jetson Nano runs Linux and offers 472 GFLOPS computer vision, imaging, and DNN workloads on the chip,
of FP16 computation performance while consuming only can be used to program both the Myriad 2 and Myriad X
5-10 W of power. NVIDIA also provides the developer VPUs.
kit with examples to map the multiple neural networks
applications such as object detection, segmentation, image
classification, and speech processing [183]. NVIDIA also
provides many embedded boards such as Jetson AGX Orin,
Jetson Orin NX, Jetson Xavier NX Series, Jetson TX2 Series
in various combinations of form-factor, power-efficiency, and
performance to address various industry segments [3], [11].
BeagleBone AI [4], built around Texas Instruments’ (TI)
AM 5729 Sitara SoC [115], is yet another board for AI at the
edge. This SoC has two 32-bit Arm Cortex-A15 cores, two
Image Processing Units (IPUs) that each having two Cortex-
M4 cores, two C66x DSP cores, two PowerVER SGX5443D
GPUs, and four Embedded Vision Engines (EVEs). It also
has 15 GB of eMMC flash, 1 GB of RAM, Wi-Fi, Bluetooth
support, and USB connectors for power and data transfer. The
BeagleBone AI runs Linux, and TI Deep Learning (TIDL)
framework can be used to develop real-time ML applications.
A Vision Processing Unit (VPU) [10] is a processor
optimized to perform inference tasks at the edge with ultra-
low power without compromising performance. Movidius FIGURE 46. Myriad 2 VPU architectural block diagram, adopted from [15].
Myriad 2 VPU is based on the Intel Neural Compute
Stick (NCS) platform, designed as a 28 nm co-processor Sipeed Maixduino is like an Arduino for machine learning
that provides the high-performance tensor acceleration, projects. It has MAIX SoC [13], which includes Kendryte

131816 VOLUME 10, 2022

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

K210 KPU (Knowledge Processing Unit, also called Network

Processing Unit), a powerful chip suited for visual and
semantic recognition [12]. MAIX SoC block diagram is
shown in Fig. 47, which includes K210 featuring two RISC-
V 64-bit CPU cores, an APU (Audio Processing Unit, also
called Audio Accelerator), and KPU optimized for running
CNNs. KPU offers 0.25 [email protected] W,400 MHz; when
overclocking to 800 MHz, it offers 0.5 TOPS. It means,
we can do object recognition 60 fps@VGA. MAIX also
includes a Fast Fourier Transform (FFT) unit, making it
useful for signal processing. In addition, it supports a wide
range of other peripherals, see Fig. 47. TensorFlow Lite FIGURE 48. Bitmain Sophon(TM) Edge Developer Board, adopted from [8].
framework is supported by this board, and platforms such as
Arduino IDE and PlatformIO can be used for development.

FIGURE 49. Ultra96-V2 [1] and PYNQ-Z2 [18] development boards from
Xilinx.

between the PL and the PS. As the latest version of

Xilinx’s all-programmable System-on-Chip (SoC) families,
the Zynq architecture combines a dual-core ARM Cortex-
A9 processor with a conventional processor (FPGA). The
Advanced eXtensible Interface (AXI) standard is used
to connect the various pieces of the Zynq architecture,
allowing for high bandwidth and low latency connections.
Vivado Design Suite [22] is used to map programs on
FIGURE 47. Block diagram of KPU and MAIX SoC, adopted from [13].
to Ultra96-V2 board and is widely used in AI and ML
Sophon’s edge developer board, see Fig. 48, is envisioned projects. For instance, authors in [39] implemented real-
as a rapid prototype development board for ML applications. time face recognition on Ultra96-V2. Designers may use the
It contains a powerful BM1880 capable of efficiently imple- Python language and libraries to use Zynq’s programmable
menting DNN/CNN/RNN/LSTM models using a tailored logic and microprocessors to create more capable and
tensor processing unit. It also features two Arm Cortex-A53 intriguing embedded systems. PYNQ (Python On Zynq) is a
CPUs and a RISC-V CPU. TPU can perform 1 TOPS for Xilinxr open-source project that makes designing embedded
8-bit integer data. This board is mainly used for surveillance systems with Zynqr Systems on Chips simple. PYNQ-
cameras, BM1880 [5] is designed using 28 nm process Z2 is an FPGA development board based on the ZYNQ
and dissipated 2.5 W. Frameworks such as TensorFlow, XC7Z020 FPGA, which has been meticulously developed
PyTorch, ONNX, Caffe, etc. are supported by this board. to support PYNQ. By combining PL and PS, designers
However, BITMAIN has its own framework called BITMAIN can create more powerful embedded systems using ZYNQ.
Neural Network Software Development Kit (BMNNSDK) [5] Furthermore, the SoCs may be programmed in Python, and
and recommends it to achieve high inference throughput the code can be developed and tested on the PYNQ-Z2
and efficiency. BMNET and BMRunTime are included in directly. In the same manner that software libraries are
the BMNNSDK. BMNET is a DNN compiler for TPU imported and programmed, programmable logic circuits are
processors on edge. It translates CNN-like algorithms into imported as hardware libraries and programmed through
TPU instructions. APIs. PYNQ-Z2 board has many interfaces such as user
Ultra96-V2 [1] and PYNQ-Z2 [18] are embedded AI boards LEDs, push-buttons, switches, MIC input, Ethernet, HDMI
using FPGAs, see Fig. 49. Ultra96-V2 features a Zynq Input/Output, MIC Input, Audio Output, Arduino as well
UltraScale+ MPSoC ZU3EG device. Xilinx’s Zynq devices as Rasberry Pi interfaces etc. PYNQ takes advantage of
contain both Processor System (PS) and Programmable the greatest features of both ZYNQ and Python. Machine
Logic (PL), where PS consists of hardcore processors while learning research and prototyping have made extensive use
PL contains the FPGA. Before the Zynq, processors were of it. For instance, authors in [207] used this board for
connected to an FPGA, which complicated communication implementing CNNs. Xilinx’s configurable DPU IP [7] can

VOLUME 10, 2022 131817

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

also be used together with PYNQ board for creating a network of more complex neural networks. There are two ways
with the desired number of layers, activation functions etc.. to deploy deep learning at IoT end devices. 1) Deploy
Vivado [22], Vitis [21] and Python can be used to work with feature vector and model architecture on the Server machine
PYNQ board. and call with API using Web service to IoT. 2) Deploy
feature vector and model architecture on resource-constraint
platforms like Raspberry Pi, also called on-device computing.
The first method has network latency issues, security
risks, and high communication costs. The second method
has difficulty in implementing large DNN models due to
the limited memory and computational resources of IoT-
enabled devices like Raspberry Pi. Furthermore, devices
with limited resources, such as the Raspberry Pi, are
only used for DNN inference. The trained DNN model
can be transferred to the Raspberry Pi through network
connectivity. However, network connectivity can introduce
delays, data loss, and other security concerns, limiting DNN
deployment on the Raspberry Pi [41]. Bhosale et al. [42]
proposed Deep Convolutional Neural Network (DCNN) for
FIGURE 50. Xilinx’s Kria KV260 SOM, adopted from [122]. Covid-19 classification. In this work, the DCNN architecture
is deployed on the cloud and uses radiology x-ray images
Xilinx’s Kria KV260 [23], [122] is an AI starter kit for classification. On the other hand, the authors in [41]
targeted for vision AI applications in smart cities, smart fac- proposed a lightweight Deep Learning model (LDC-Net) for
tories, robotics, home automation, etc., see Fig. 50. KV260 Covid-19 classification with lung disease. In this work, LDC-
includes a Zynq MPSoC, and it supports the Python-based Net was trained on High-Performance Computing (HPC).
PYNQ framework. The trained models can be implemented Furthermore, the trained LDC-Net and weights have been
in the DPU [7] and are loaded with PYNQ using hardware deployed in an IoT-enabled Raspberry Pi with network
overlays. In [122], authors have demonstrated pre-trained connectivity for Covid-19 classification.
models based on the MNIST dataset, RESNET based on
Caffe framework, and InceptionV1 based on Tensorflow.
Furthermore, to exercise the features of KV260, many models
from Vitis AI Model Zoo [24] repository are implemented.
Traffic detection, lane detection and segmentation algorithms
were also implemented and tested in real-time. Silicon
Lab has recently introduced BG24/MG24 [19] SoCs with
built-in AI accelerators and a new software toolkit. These
new devices with optimized hardware and software will
help execute AL/ML applications on battery-powered edge
devices. The MAX78000 [14] from Maxim Integrated is an
AI microcontroller that runs neural networks at extremely low
power. It has a hardware-based CNN accelerator, enabling
the battery-powered applications to execute AI inferences. FIGURE 51. Raspberry Pi computer [230].
AlphaICs’ Gluon AI co-processor [9] is optimized for vision
AI applications. It comes with an SDK for easy porting of An Arm processor is a general-purpose processor that
neural networks. belongs to the family of CPUs and uses Reduced Instruc-
Deep neural networks (DNN) are increasingly being used tion Set Computer (RISC) architecture. Because of their
on IoT-enabled devices like the Raspberry Pi to improve efficiency and flexibility, Arm processors are used in many
efficiency, security, and privacy. However, the size and electronic products, including smartphones, tablets, and
complexity of the machine-learning (ML) model that can wearables. Arm’s new portfolio of hardware solutions is now
be deployed in such systems are limited by the available aimed toward Machine Learning (ML) and Deep Neural
computational and memory resources. The Raspberry Pi Network (DNN) applications. In recent times, ARM-based
is a low-cost, small, and portable computer board with processors have been developed for the acceleration of
built-in software that allows users to create scripts or machine learning applications from various manufacturers
programs in Python [229]. There are two main limitations viz., Marvell (ThunderX2), Fujitsu (A64FX), Huawei (Kun-
to utilizing a Raspberry Pi for deep learning: 1) the small peng 920), and Ampere (eMAG). With the help of its recently
amount of memory available and 2) the slow processing released Neural Processing Units (NPUs), Arm processors
speed. These limitations severely hamper the implementation bring machine learning to low-end edge devices.

131818 VOLUME 10, 2022

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

Arm ML processor uses the Neural Network (NN) software computing devices [89]. This combination achieves a 32×
development kit provided by the company to interface the improvement in ML processing compared to the base Cortex-
ML software and corresponding hardware [214]. The Arm- M55 core. Furthermore, TinyML [20] advancements have
based ML accelerator consists of a number of computing made it possible to use ML models on the microcontroller
engines up to 16, each of which includes a programmable hardware found in our household appliances, including
layer engine and a MAC convolution engine, see Fig. 52. printers, TVs, smartwatches, and pacemakers, which can now
Each computing engine has its own local memory to carry out tasks that were previously only capable of being
process the ML models. Starting with weights applied to done by computers and smartphones. The machine learning
incoming data, processing via the MAC convolution engine, and embedded ultra-low power systems communities have
and finally, results processed by the Programmable Layer joined forces to create TinyML foundation. This joint effort
Engine (PLE), the flow is typical of DNN implementations. has paved the way for innovative and captivating alternative
There are 128 multiple-accumulate (MAC) units in the MAC uses of on-device machine learning. TinyML supports various
convolution engine. MAC convolution engine receives the frameworks, including TensorFlow Lite Micro (TFLM),
input data from the input feature map read block, weights TensorFlow-Native, Embedded Learning Library (ELL),
from the weight decoder, and performs the required MAC Graph Lowering (GLOW), etc. Google developed an open-
operation. The result of the convolution is processed by the source framework called CFU Playground [174] for TinyML
PLE, which is a vectorized microcontroller. It is more akin to acceleration on FPGA. CFU playground toolchain com-
a RISC platform designed to wrap up the processing of a layer bines open-source software (TensorFlow), RTL generators
for a piece of a DNN model with several layers. The PLE is (LiteX, Migen, etc.), and FPGA tools for synthesis (yosys),
in charge of tasks like pooling and activation. The throughput place, and route (vpr). The CFU playground framework
of the proposed ML processor is 4.6 TOPS. The proposed makes it possible to investigate custom architectures for
design is implemented using 7 nm chip technology, and it is the acceleration of Tiny ML for embedded ML systems.
scalable, and can achieve the throughput of 150 TOPS for TinyML is used in many applications, including medical
high-end applications. face mask detection [156], eating detection [166], Li-Ion
The Arm AI platform, also known as Project Trillium, batteries parameter estimation [69], etc. The most in-demand
is a heterogeneous compute platform that includes Arm research areas among the TinyML community include sound
Cortex CPUs, Ethos NPUs, Mali GPUs, and microNPUs to recognition, computer vision, and the development of low-
accelerate the ML algorithms [142]. Arm supports various power accurate ML models. More research is needed to
ML frameworks such as TensorFlow Lite, Caffe, PyTorch, fully comprehend the advantages and drawbacks of the
etc. and accelerates the ML applications using software topics under discussion, even if many applications have
libraries including arm NN, arm COMPUTE LIBRARY, demonstrated TinyML’s promise. Some key issues that
and Common Microcontroller Software Interface Standard- require further research in this area include developing
NN (CMSIS-NN). The hardware products such as Arm benchmarks, memory constraints, energy, processor capacity,
Cortex CPUs, Ethos NPUs, Mali GPUs, microNPUs, FPGAs, cost reduction, etc.
DSPs, etc. ARM’s new Cortex-A55/A75 and Mali-G72
combination targets machine learning on edge computing VIII. COMPARISON BETWEEN VARIOUS HARDWARE
devices. ARCHITECTURES FOR DNN ACCELERATION
Arm has developed its Ethos series of ML processors for The performance of the various hardware accelerators for
machine learning applications. Ethos series is classified into the DNN acceleration depends on the target application.
N-series and U-series [25]. Ethos N-series was introduced However, researchers defined some standard metrics, namely,
in October 2019, containing NPUs identical to the Cortex area, power, and throughput, to measure the performance
family. Ethos U-series was introduced in early 2020, and it of the hardware accelerators for the development and
contains microNPUs. MicroNPUs are paired with the CPU, deployment of DNNs. Here, the area is nothing but the
like the Cortex-M55, to process the ML algorithms. Ethos- portion of silicon required for the DNN acceleration, which
U55 achieves a throughput of 0.5 TOPS, containing 32 to is generally represented in squared millimeters or squared
256 8-bit MAC units [143]. Ethos-U55 supports 8-bit and 16- micrometers. The area depends on the size of the on-
bit integer data types. Ethos-U65 achieves a throughput of chip memory and the technology used during the hardware
1 TOPS, containing 256 to 312 8-bit MAC units. Ethos-N57 synthesis process. Power is nothing but the amount of
achieves a throughput of 2 TOPS, containing 1024 8-bit MAC power consumed by the specific hardware during the DNN
units. Ethos-N77 is a highly efficient ML inference processor acceleration. The power consumption mainly depends on off-
that achieves a throughput of 5 TOPS and is best suitable chip and on-chip memories. Throughput is used to measure
for mobile devices. Ethos-N77 ML processors can be used the productivity of the hardware accelerator. The comparison
for facial or object recognition applications. Ethos-N78 is a between the various hardware accelerator architectures for
scalable and efficient ML inference processor that achieves DNN acceleration is shown in Table 7. Due to a lack of
a throughput of 1 to 10 TOPS [144]. Arm’s Cortex-M55 data on their footprint, power consumption, and throughput,
and the Ethos-U55 can be used as an AI accelerator in edge CGRA-based accelerators are not represented in Table 7.

VOLUME 10, 2022 131819

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

FIGURE 52. Arm based ML processor [214].

As expected, temporal or general purpose architectures and other nanomaterials. Researchers at MIT and Stanford
such as CPU and GPU have greater power consumption have developed a new 3-D architecture based on a network
and area than special purpose architectures such as FPGA of millions of carbon nanotubes [191]. Computations in
and ASIC because they are not tailored for a particular optical computing technology can happen at the speed of
application. The essential hardware metrics like power, light, much faster than conventional electron-driven chips.
area, technology, and throughput are reported for each MIT is driving research in advanced optical materials,
hardware architecture. In Table 8, we have compared the few switches, lasers, and nano-optics [106] to advance optical
embedded development boards discussed above concerning computing. We may expect a greater deployment of optical
general-purpose CPUs/GPUs, specialized co-processors they chips in the future. DNA computing is a type of parallel
contain, performance, power, SDKs, and supported ML computing in which many different DNA molecules are
frameworks. used to test many possibilities simultaneously [138]. The
major advantage of DNA is its potential for memory storage.
IX. FUTURE DIRECTIONS A single gram of DNA can store 215 petabytes (215
In the future, hardware AI acceleration is set to become ubiq- million gigabytes) [6]. Although DNA information storage
uitous. In recent processors, some AI accelerator hardware has enormous application potential, many issues, such as the
becoming a standard feature, indicating that AI acceleration high cost of writing and reading information and techniques
is an essential general-purpose task. This paper reviewed sev- to erase and rewrite the information in DNA that is still
eral FPGA-based, ASIC-based, GPU-based, CGRA-based, unknown, must be addressed before its widespread use [75].
and edge AI hardware accelerators. However, looking at In FPGA-based architectures, following future directions
the industry trends and startups in this space indicates that seems to be promising. The combination of FPGAs and
we are still in the early stage of the AI revolution. Many cloud computing opens new avenues for developing deep
more energy-efficient architectures will emerge in the future. learning applications. The FPGA cloud service is still in
In particular, architectures with transprecision or approximate its early stages. Many imperfections must be investigated,
computing, high-bandwidth memories, and emerging non- such as the virtualization of FPGA hardware resources, task
volatile memories such as MRAM and ReRAM may migration, etc. Most current research focuses on lowering
appear in the market. Evolving architectures involving the bandwidth requirements for off-chip memory access.
the Tsetlin machine are another promising future research The performance of multiple FPGA chips combined is
direction. favorable. However, dealing with processing scheduling
Emerging technologies such as nanomaterials, optical and chip allocation remains a significant challenge. Future
computing, and DNA computing may accelerate DNNs in the research could focus on the development of in-memory-
near future. Carbon nanomaterials, such as carbon nanotubes computing processors. Moreover, further improvements are
(CNTs) and graphene, are particularly intriguing due to required in the computation of the activation functions used
their rapid electron transport [58]. CNT and graphene have in DNNs. Because most studies focus on loop optimization,
desirable switching and optical properties, making them only a few researchers are currently working on activation
well-suited to electronic and optical architectures [189]. New function optimization. There will be frameworks to integrate
chip architectures become possible with the help of CNTs existing or new architectures, which will help quickly deploy

131820 VOLUME 10, 2022

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

TABLE 7. Comparison among accelerators implemented on different hardware platforms.

VOLUME 10, 2022 131821

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

TABLE 8. Comparison of various embedded edge AI development boards.

applications. Most importantly, FPGA-based accelerator X. CONCLUSION

research will be towards training and not inference. Deep Neural Networks have recently gained popularity in a
In ASIC-based hardware accelerators, following future variety of applications. They are, however, computationally
research trends are suggested. TPU is already a standard demanding, making them difficult to handle by general-
in the field of deep learning. More capable replacements purpose architectures. In this context, a detailed review
are likely to emerge in the coming years. There will be of recent advances in DNN acceleration on specialized
entirely new architectures to target low-latency and low- hardware architectures such as FPGA, ASIC, GPU, and
power applications. Most current studies assume a trained CGRA is presented. Furthermore, embedded AI accelerators
DNN and focus on increasing the speed of its inference. for the edge environment have been thoroughly discussed.
There have been only a few studies on accelerator design The review begins with a detailed background of DNNs,
for DNN training. Therefore, there will be more emphasis focusing on their key operations and applications. CNNs,
on developing ASIC-based DNN training accelerators in the which have a wide range of applications, have also been
future. More research and breakthroughs in CPU-GPU het- included in the review. To improve the performance of
erogeneous architectures are required for more efficient DNN the hardware accelerator, we discussed various computing
implementations. Special-purpose or data center system-on- architectures, such as temporal and spatial architectures,
chips (SoCs) with embedded FPGA or GPU-based machine as well as different dataflow patterns. The review focused
learning accelerators appear to be gaining traction. In CGRA- on recent advancements in the acceleration of DNNs on
based accelerators, architectures driven by programming FPGA, ASIC, GPU, CGRA, and Embedded AI accelerators.
might be interesting. Other directions include introducing The review divided the FPGA-based accelerators into three
process-in-memory into CGRA architectures to address the categories and briefly discussed their key features, including
data movement bottleneck. Further improvements are needed the frameworks available for each. Similarly, ASIC-based
for the architectures that support dynamic configuration, accelerators are classified, and the review summarizes the
as it is an important step towards the widespread use accelerators available in the literature based on area, power
of CGRAs. dissipation, throughput, resource utilization, and so on.
The following trends may be observed in the future A comprehensive review of Nvidia’s GPU-based accelerators
development of Edge AI accelerators. Edge AI operates in was also presented. Furthermore, the review compared the
a heterogeneous environment where the data at the edge and various popular FPGA/ASIC/GPU-based accelerators. It has
the preprocessing techniques required for each sensor vary been observed that temporal architectures, such as CPU and
greatly between applications. Therefore, more customized, GPU, dissipate more power than spatial architectures, such as
powerful, and energy-efficient chips for specific edge ML FPGA and ASIC; however, they have higher throughput than
applications will be developed. Multimodal deep learning is FPGA and ASIC. As a result, it is difficult to say that one
a major development that pulls data from multiple sources architecture is superior to another because it depends on the
to extract more granular features. Using these multimodal target application and requirements. Furthermore, the survey
techniques, the car’s make and model can be pinpointed presented and compared recent research contributions in
instead of just recognizing a car. Other potential research Arm-based machine learning processors and a few embedded
directions include using distributed ML algorithms to speed AI hardware accelerators in terms of their cores, performance,
up ML algorithm training and reduce the amount of memory power, availability of Software Development Kits (SDKs),
required for processing. ML applications at the edge require and supported frameworks. Finally, the review suggests
high accuracy. Therefore, methods for implementing cutting- future research directions for DNN acceleration using var-
edge models at the edge while maintaining accuracy based on ious hardware architectures, including FPGA, ASIC, GPU,
deep learning model pruning and quantization are among the CGRA, and Edge AI accelerators.
new research directions. We will also see the development of
REFERENCES
customized and general SDK frameworks targeting specific
[1] ULTRA96-V2. Accessed: Jan. 7, 2022. [Online]. Available: https://www.
or multiple edge accelerators for easy deployment of neural avnet.com/opasdata/d120001/medias/docus/198/5365-pb-ultra96-v2-
network applications. v10b.pdf

131822 VOLUME 10, 2022

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

[2] Accelerate Fast Math With Intel Oneapi Math Kernel Library. [29] (2022). Edge TPU Compiler. [Online]. Available: https://coral.
Accessed: Jun. 5, 2021. [Online]. Available: https://software.intel.com/ ai/docs/edgetpu/compiler/#system-requirements
content/www/us/en/develop/tools/oneapi/components/onemkl.html#gs.
[30] High-Level Synthesis & Verification. Accessed: Jul. 2022. [Online].
3595x9
Available: https://eda.sw.siemens.com/en-US/ic/ic-design/high-level-
[3] Advanced AI Embedded Systems: NVIDIA Jetson: The AI Platform for
synthesis-and-verification-platform/
Autonomous Machines. Accessed: Aug. 2, 2022. [Online]. Available: [31] NVIDIA Tesla T4 Specs. Accessed: Jun. 17, 2022. [Online]. Available:
https://www.nvidia.com/en-in/autonomous-machines/embedded- https://www.nvidia.com/en-us/data-center/tesla-t4/
systems/ [32] Vitis AI. Accessed: Jun. 17, 2022. [Online]. Available: https://www.
[4] BeagleBone AI: Fast Track to Embedded Artificial Intelligence. [Online]. xilinx.com/products/design-tools/vitis/vitis-ai.html
Available: https://beagleboard.org/AI Accessed: Jan. 2, 2022. [33] M. Alwani, H. Chen, M. Ferdman, and P. Milder, ‘‘Fused-layer CNN
[5] BitMain Neural Network SDK: Introduction. Accessed: Jan. 2, 2022. accelerators,’’ in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchitec-
[Online]. Available: https://sophon-edge.gitbook.io/project/ ture (MICRO), Oct. 2016, pp. 1–12.
[6] R. F. Service, ‘‘DNA could store all of the world’s data in one [34] D. Amodei et al., ‘‘Deep speech 2: End-to-end speech recognition in
room,’’ Science, Mar. 2017. [Online]. Available: https://www.science. English and Mandarin,’’ in Proc. 33rd Int. Conf. Mach. Learn., vol. 48,
org/content/article/dna-could-store-all-worlds-data-one-room M. F. Balcan and K. Q. Weinberger, Eds. New York, NY, USA: arXiv,
[7] DPU for Convolutional Neural Network. Accessed: May 1, 2022. Jun. 2016, pp. 173–182.
[Online]. Available: https://www.xilinx.com/products/intellectual- [35] A. Argal, S. Gupta, A. Modi, P. Pandey, S. Shim, and C. Choo, ‘‘Intelligent
property/dpu.html travel chatbot for predictive recommendation in echo platform,’’ in Proc.
[8] Edge TPU Developer Board. [Online]. Available: https://www.sophon. IEEE 8th Annu. Comput. Commun. Workshop Conf. (CCWC), Jan. 2018,
ai/product/introduce/edb.html Accessed: Jan. 2, 2022. pp. 176–183.
[9] Gluon AI Co-Processor. Accessed: Jan. 10, 2022. [Online]. Available: [36] D. Bahdanau, K. Cho, and Y. Bengio, ‘‘Neural machine translation by
https://alphaics.ai/products/gluon-ai-accelerator/ jointly learning to align and translate,’’ 2015, arXiv:1409.0473.
[10] Intel Movidius Myriad X Vision Processing Unit. Accessed: Apr. 2, 2022. [37] Y. Bengio, ‘‘Learning deep architectures for AI,’’ Found. Trends Mach.
[Online]. Available: https://www.intel.com/content/www/us/en/products/ Learn., vol. 2, no. 1, pp. 1–127, 2009.
details/processors/movidius-vpu/movidius-myriad-x.html [38] K. Benkrid and S. Belkacemi, ‘‘Design and implementation of a 2D
[11] Jetson Nano Developer Kit. Accessed: Aug. 2, 2022. [Online]. Available: convolution core for video applications on FPGAs,’’ in Proc. 3rd Int.
https://www.nvidia.com/en-in/autonomous-machines/embedded- Workshop Digit. Comput. Video (DCV), 2002, pp. 85–92.
[39] M. Bergeron. Real-Time Face Recognition on Ultra96-V2.
systems/jetson-nano-developer-kit/
Accessed: Feb. 1, 2022. [Online]. Available: https://www.hackster.io/
[12] Kendryte K210. Accessed: Sep. 2, 2022. [Online]. Available:
AlbertaBeef/real-time-face-recognition-on-ultra96-v2-94de9b
https://canaan.io/product/kendryteai
[40] Y. H. Bhosale and K. S. Patnaik, ‘‘Application of deep learning techniques
[13] Maixduino. Accessed: Sep. 2, 2022. [Online]. Available: https:// in diagnosis of COVID-19 (Coronavirus): A systematic review,’’ Neural
www.seeedstudio.com/Sipeed-Maixduino-Kit-for-RISC-V-AI-IoT-p- Process. Lett., pp. 1–53, Sep. 2022.
4047.html [41] Y. H. Bhosale and K. Sridhar Patnaik, ‘‘IoT deployable lightweight
[14] MAX78000—Artificial Intelligence Microcontroller with Ultra- deep learning application for COVID-19 detection with lung diseases
Low-Power Convolutional Neural Network Accelerator. Accessed: using RaspberryPi,’’ in Proc. Int. Conf. IoT Blockchain Technol. (ICIBT),
Jan. 10, 2022. [Online]. Available: https://www.maximintegrated.com/en/ May 2022, pp. 1–6.
products/microcontrollers/MAX78000.html [42] Y. H. Bhosale, S. Zanwar, Z. Ahmed, M. Nakrani, D. Bhuyar, and
[15] Myriad 2 MA2x5x Vision Processor: Transforming Devices Through U. Shinde, ‘‘Deep convolutional neural network based COVID-19
Ultra Low-Power Machine Vision—Google Search. Accessed: classification from radiology X-ray images for IoT enabled devices,’’ in
Apr. 2, 2022. [Online]. Available: https://www.intel.com/content/www/ Proc. 8th Int. Conf. Adv. Comput. Commun. Syst. (ICACCS), Mar. 2022,
us/en/products/details/processors/movidius-vpu/movidius-myriad- pp. 1398–1402.
x.html,www.movidius.com [43] L. Bishnoi and S. N. Singh, ‘‘Artificial intelligence techniques used in
[16] (2020). Nvidia A100 Tensor Core GPU Architecture. Accessed: medical sciences: A review,’’ in Proc. 8th Int. Conf. Cloud Comput., Data
Jun. 13, 2021. [Online]. Available: https://www.nvidia.com/content/dam/ Sci. Eng. (Confluence), Jan. 2018, pp. 1–8.
en-zz/solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf [44] A. G. Blaiech, K. Ben Khalifa, C. Valderrama, M. A. Fernandes, and
[17] (2017). Nvidia Tesla V100 GPU Architecture. Accessed: M. H. Bedoui, ‘‘A survey and taxonomy of FPGA-based deep learning
Jun. 13, 2021. [Online]. Available: https://images.nvidia.com/content/ accelerators,’’ J. Syst. Archit., vol. 98, pp. 331–345, Sep. 2019.
technologies/volta/pdf/437317-volta-v100-ds-nv-us-web.pdf [45] S. Bouguezzi, H. B. Fredj, T. Belabed, C. Valderrama, H. Faiedh, and
[18] PYNQ-Z2. Accessed: Jan. 7, 2022. [Online]. Available: http://www. C. Souani, ‘‘An efficient FPGA-based convolutional neural network for
pynq.io/board.html classification: Ad-MobileNet,’’ Electronics, vol. 10, no. 18, p. 2272,
[19] Silicon Labs BG24 and MG24 SoCs. Accessed: Jan. 10, 2022. [Online]. Sep. 2021.
Available: https://www.silabs.com/wireless/zigbee/efr32mg24-series-2- [46] A. Boutros, S. Yazdanshenas, and V. Betz, ‘‘You cannot improve what you
socs do not measure: FPGA vs. ASIC efficiency gaps for convolutional neural
[20] TinyML Foundation. Accessed: Sep. 2, 2022. [Online]. Available: network inference,’’ ACM Trans. Reconfigurable Technol. Syst., vol. 11,
https://www.tinyml.org/ no. 3, pp. 1–23, Sep. 2018.
[21] Vitis Unified Software Platform. Accessed: Oct. 1, 2022. [Online]. [47] S. Cadambi, A. Majumdar, M. Becchi, S. Chakradhar, and H. P. Graf,
Available: https://www.xilinx.com/products/design-tools/vitis/vitis- ‘‘A programmable parallel accelerator for learning and classification,’’ in
platform.html Proc. 19th Int. Conf. Parallel Archit. Compilation Techn. (PACT), 2010,
pp. 273–283.
[22] Vivado. Accessed: Jan. 15, 2022. [Online]. Available: https://www.
[48] M. Capra, B. Bussolino, A. Marchisio, M. Shafique, G. Masera, and
xilinx.com/products/design-tools/vivado.html
M. Martina, ‘‘An updated survey of efficient hardware architectures
[23] Xilinx Kria—Adaptive System-on-Module. Accessed: Oct. 1, 2022.
for accelerating deep convolutional neural networks,’’ Future Internet,
[Online]. Available: https://www.xilinx.com/products/som/kria.html
vol. 12, no. 7, p. 113, Jul. 2020.
[24] Xilinx Vitis AI Model Zoo. Accessed: Oct. 1, 2022. [Online]. Available: [49] F. Cardells-Tormo, P.-L. Molinet, J. Sempere-Agullo, L. Baldez, and
https://github.com/Xilinx/AI-Model-Zoo M. Bautista-Palacios, ‘‘Area-efficient 2D shift-variant convolvers for
[25] (Jul. 2021). Ethos—ARM—WikiChip. Accessed: Aug. 7, 2021. [Online]. FPGA-based digital image processing,’’ in Proc. Int. Conf. Field
Available: https://en.wikichip.org/wiki/arm_holdings/ethos Program. Log. Appl., 2005, pp. 578–581.
[26] (Aug. 2021). Jetson Xavier NX. Accessed: Jul. 15, 2022. [Online]. Avail- [50] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and L. Benini,
able: https://www.nvidia.com/en-us/autonomous-machines/embedded- ‘‘Origami: A convolutional network accelerator,’’ in Proc. 25th, Ed.,
systems/jetson-xavier-nx/ Great Lakes Symp. (VLSI), New York, NY, USA, May 2015, pp. 199–204.
[27] (2022). Coral Products. [Online]. Available: https://coral.ai/products/ [51] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, ‘‘A dynam-
[28] (Jul. 2022). Deploy AI-Powered Autonomous Machines at Scale. ically configurable coprocessor for convolutional neural networks,’’ in
Accessed: Jul. 15, 2022. [Online]. Available: https://www.nvidia.com/en- Proc. 37th Annu. Int. Symp. Comput. Archit. (ISCA), New York, NY, USA,
gb/autonomous-machines/embedded-systems/jetson-agx-xavier/ 2010, pp. 247–257.

VOLUME 10, 2022 131823

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

[52] J.-W. Chang and S.-J. Kang, ‘‘Optimizing FPGA-based convolutional [74] J. Domke, E. Vatai, A. Drozd, P. ChenT, Y. Oyama, L. Zhang, S. Salaria,
neural networks accelerator for image super-resolution,’’ in Proc. D. Mukunoki, A. Podobas, M. WahibT, and S. Matsuoka, ‘‘Matrix
23rd Asia South Pacific Design Autom. Conf. (ASP-DAC), Jan. 2018, engines for high performance computing: A paragon of performance or
pp. 343–348. grasping at straws?’’ in Proc. IEEE Int. Parallel Distrib. Process. Symp.
[53] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, (IPDPS). Los Alamitos, CA, USA: IEEE Computer Society, May 2021,
‘‘Diannao: A small-footprint high-throughput accelerator for ubiquitous pp. 1056–1065.
machine-learning,’’ in Proc. 19th Int. Conf. Architectural Support [75] Y. Dong, F. Sun, Z. Ping, Q. Ouyang, and L. Qian, ‘‘DNA storage:
Program. Lang. Operating Syst., vol. 14, New York, NY, USA, 2014, Research landscape and future prospects,’’ Nat. Sci. Rev., vol. 7, no. 6,
pp. 269–284. pp. 1092–1107, Jun. 2020.
[76] L. Du and Y. Du, ‘‘Hardware accelerator design for machine learning,’’
[54] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu,
in Machine Learning, H. Farhadi, Ed. Rijeka, Croatia: IntechOpen, 2018,
N. Sun, and O. Temam, ‘‘DaDianNao: A machine-learning supercom-
ch. 1.
puter,’’ in Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchitecture, [77] L. Du, Y. Du, Y. Li, J. Su, Y.-C. Kuan, C.-C. Liu, and
Dec. 2014, pp. 609–622. M.-C. F. Chang, ‘‘A reconfigurable streaming deep convolutional
[55] Y. Chen, Y. Xie, L. Song, F. Chen, and T. Tang, ‘‘A survey of accelerator neural network accelerator for Internet of Things,’’ IEEE Trans. Circuits
architectures for deep neural networks,’’ Engineering, vol. 6, no. 3, Syst. I, Reg. Papers, vol. 65, no. 1, pp. 198–208, Jan. 2018.
pp. 264–274, Mar. 2020. [78] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and
[56] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, ‘‘Eyeriss: An O. Temam, ‘‘Shidiannao: Shifting vision processing closer to the sensor,’’
energy-efficient reconfigurable accelerator for deep convolutional neural SIGARCH Comput. Archit. News, vol. 43, no. 3S, pp. 92–104, Jun. 2015.
networks,’’ IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, [79] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
Jan. 2017. and O. Temam, ‘‘ShiDianNao: Shifting vision processing closer to the
[57] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, ‘‘Eyeriss v2: A flexible sensor,’’ in Proc. 42nd Annu. Int. Symp. Comput. Archit., New York, NY,
accelerator for emerging deep neural networks on mobile devices,’’ USA, Jun. 2015, pp. 92–104.
IEEE J. Emerging Sel. Topics Circuits Syst., vol. 9, no. 2, pp. 292–308, [80] C. Dubout and F. Fleuret, ‘‘Exact acceleration of linear object detectors,’’
Jun. 2019. in Proc. 12th Eur. Conf. Comput. Vis. Berlin, Germany: Springer-Verlag,
2012, pp. 301–311.
[58] Z. Chen, H.-S. Philip Wong, S. Mitra, A. Bol, L. Peng, G. Hills, and
[81] A. J. A. El-Maksoud, M. Ebbed, A. H. Khalil, and H. Mostafa, ‘‘Power
N. Thissen, ‘‘Carbon nanotubes for high-performance logic,’’ MRS Bull.,
efficient design of high-performance convolutional neural networks
vol. 39, no. 8, pp. 719–726, Aug. 2014.
hardware accelerator on FPGA: A case study with GoogLeNet,’’ IEEE
[59] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, Access, vol. 9, pp. 151897–151911, 2021.
and E. Shelhamer, ‘‘CuDNN: Efficient primitives for deep learning,’’ [82] H. Fan, S. Liu, M. Ferianc, H.-C. Ng, Z. Que, S. Liu, X. Niu, and
2014, arXiv:1410.0759. W. Luk, ‘‘A real-time object detection accelerator with compressed
[60] D. Chicco, P. Sadowski, and P. Baldi, ‘‘Deep autoencoder neural networks SSDLite on FPGA,’’ in Proc. Int. Conf. Field-Programmable Technol.
for gene ontology annotation predictions,’’ in Proc. 5th ACM Conf. (FPT), Dec. 2018, pp. 14–21.
Bioinf., Comput. Biol., Health Informat., New York, NY, USA, Sep. 2014, [83] X. Fan, D. Wu, W. Cao, W. Luk, and L. Wang, ‘‘Stream processing dual-
pp. 533–540. track CGRA for object inference,’’ IEEE Trans. Very Large Scale Integr.
[61] P.-S. Chiu, J.-W. Chang, M.-C. Lee, C.-H. Chen, and D.-S. Lee, (VLSI) Syst., vol. 26, no. 6, pp. 1098–1111, Jun. 2018.
‘‘Enabling intelligent environment by the design of emotionally aware [84] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and
virtual assistant: A case of smart campus,’’ IEEE Access, vol. 8, Y. Lecun, ‘‘NeuFlow: A runtime-reconfigurable dataflow processor for
pp. 62032–62041, 2020. vision,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
Workshops (CVPRW), Jun. 2011, pp. 109–116.
[62] Y.-k. Choi, K. You, J. Choi, and W. Sung, ‘‘A real-time FPGA-based 20
[85] C. Farabet, C. Poulet, J. Han, and Y. LeCun, ‘‘CNP: An FPGA-based
000-word speech recognizer with optimized DRAM access,’’ IEEE Trans.
processor for convolutional networks,’’ in Proc. 19th Int. Conf. Field
Circuits Syst. I, Reg. Papers, vol. 57, no. 8, pp. 2119–2131, Aug. 2010.
Program. Log. Appl., 2009, pp. 32–37.
[63] J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky, [86] X. Feng, H. Zhang, Y. Ren, P. Shang, Y. Zhu, Y. Liang, R. Guan, and
‘‘NVIDIA A 100 tensor core GPU: Performance and innovation,’’ IEEE D. Xu, ‘‘The deep learning—Based recommender system ‘pubmende’ for
Micro, vol. 41, no. 2, pp. 29–35, Mar. 2021. choosing a biomedical publication venue: Development and validation
[64] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, ‘‘Fast and accurate study,’’ J. Med. Internet Res., vol. 21, no. 5, May 2019, Art. no. e12957.
deep network learning by exponential linear units (ELUs),’’ 2015, [87] K. Fukushima, ‘‘Neocognitron: A hierarchical neural network capable
arXiv:1511.07289. of visual pattern recognition,’’ Neural Netw., vol. 1, no. 2, pp. 119–130,
[65] J. Cloutier, E. Cosatto, S. Pigeon, F. R. Boyer, and P. Y. Simard, ‘‘VIP: 1988.
An FPGA-based processor for image processing and neural networks,’’ [88] A. Gainaru, E. Slusanschi, and S. Trausan-Matu, ‘‘Mapping data mining
in Proc. 5th Int. Conf. Microelectron. Neural Netw., 1996, pp. 330–336. algorithms on a GPU architecture: A study,’’ in Proc. Found. Intell.
Syst. 19th Int. Symp., (ISMIS), in Lecture Notes in Computer Science,
[66] R. Collobert, K. Kavukcuoglu, and C. Farabet, ‘‘Torch7: A MATLAB-
vol. 6804. M. Kryszkiewicz, H. Rybinski, A. Skowron, and Z. W. Ras,
like environment for machine learning,’’ in Proc. NIPS, 2011, pp. 1–6.
Eds. Warsaw, Poland: Springer, Jun. 2011, pp. 102–112.
[67] J. Cong and B. Xiao, ‘‘Minimizing computation in convolutional neural [89] C. Gartenberg, ‘‘ARM’s new edge AI chips promise IoT devices
networks,’’ in Proc. ICANN, 2014, pp. 281–290. that won’t need the cloud,’’ Verge, Washington, DC, USA,
[68] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, Tech. Rep., Feb. 2020. [Online]. Available: https://www.theverge.com/
‘‘Binarized neural networks: Training deep neural networks with weights 2020/2/10/21130800/arm-new-edge-ai-chips-processing-npu-cortex-
and activations constrained to +1 or −1,’’ 2016, arXiv:1602.02830. m55-u55-iot
[69] G. Crocioni, D. Pau, J.-M. Delorme, and G. Gruosso, ‘‘Li-ion batteries [90] A. Ghaffari and Y. Savaria, ‘‘CNN2Gate: An implementation of
parameter estimation with tiny neural networks embedded on intelligent convolutional neural networks inference on FPGAs with automated
IoT microcontrollers,’’ IEEE Access, vol. 8, pp. 122135–122146, 2020. design space exploration,’’ Electronics, vol. 9, no. 12, p. 2200, Dec. 2020.
[70] D. Danopoulos, C. Kachris, and D. Soudris, ‘‘Acceleration of image [91] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, ‘‘A 240 G-
classification with Caffe framework using FPGA,’’ in Proc. 7th Int. Conf. ops/s mobile coprocessor for deep neural networks,’’ in Proc. IEEE Conf.
Modern Circuits Syst. Technol. (MOCAST), May 2018, pp. 1–4. Comput. Vis. Pattern Recognit. Workshops, Jun. 2014, pp. 696–701.
[92] K. M. V. Gowda, S. Madhavan, S. Rinaldi, P. B. Divakarachari, and
[71] L. Deng and D. Yu, ‘‘Deep learning: Methods and applications,’’ Found. A. Atmakur, ‘‘FPGA-based reconfigurable convolutional neural network
Trends Signal Process., vol. 7, nos. 3–4, pp. 197–387, Jun. 2014. accelerator using sparse and convolutional optimization,’’ Electronics,
[72] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. De Freitas, ‘‘Predicting vol. 11, no. 10, p. 1653, May 2022.
parameters in deep learning,’’ in Proc. 26th Int. Conf. Neural Inf. [93] H. Graf, S. Cadambi, V. Jakkula, M. Sankaradass, E. Cosatto,
Process. Syst., vol. 2. Red Hook, NY, USA: Curran Associates, 2013, S. Chakradhar, and I. Dourdanovic, ‘‘A massively parallel digital learning
pp. 2148–2156. processor,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 21, D. Koller,
[73] A. Deshpande, A Beginner’s Guide To Understanding Convolutional D. Schuurmans, Y. Bengio, and L. Bottou, Eds. Red Hook, NY, USA:
Neural Networks. Los Angeles, CA, USA: University of California, 2018. Curran Associates, 2009, pp. 1–8.

131824 VOLUME 10, 2022

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

[94] S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu, ‘‘A survey of deep [115] Texas Instruments. (2015). Am5729 Sitara Processor. [Online]. Avail-
learning techniques for autonomous driving,’’ 2019, arXiv:1910.07738. able: https://www.ti.com/product/AM5729
[95] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, [116] H. Irmak, N. Alachiotis, and D. Ziener, ‘‘An energy-efficient FPGA-based
and J. Cong, ‘‘FP-DNN: An automated framework for mapping deep convolutional neural network implementation,’’ in Proc. 29th Signal
neural networks onto FPGAs with RTL-HLS hybrid templates,’’ in Proc. Process. Commun. Appl. Conf. (SIU), Jun. 2021, pp. 1–4.
IEEE 25th Annu. Int. Symp. Field-Program. Custom Comput. Mach. [117] H. Irmak, F. Corradi, P. Detterer, N. Alachiotis, and D. Ziener,
(FCCM), Apr. 2017, pp. 152–159. ‘‘A dynamic reconfigurable architecture for hybrid spiking and convo-
[96] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and lutional FPGA-based neural network designs,’’ J. Low Power Electron.
H. Yang, ‘‘Angel-Eye: A complete design flow for mapping CNN onto Appl., vol. 11, no. 3, p. 32, Aug. 2021.
embedded FPGA,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits [118] S. M. A. H. Jafri, T. N. Gia, S. Dytckov, M. Daneshtalab, A. Hemani,
Syst., vol. 37, no. 1, pp. 35–47, Jan. 2018. J. Plosila, and H. Tenhunen, ‘‘NeuroCGRA: A CGRA with support for
[97] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang, ‘‘[DL] A survey neural networks,’’ in Proc. Int. Conf. High Perform. Comput. Simul.
of FPGA-based neural network inference accelerators,’’ ACM Trans. (HPCS), Jul. 2014, pp. 506–511.
Reconfigurable Technol. Syst., vol. 12, no. 1, pp. 1–26, 2019. [119] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick,
[98] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, ‘‘Deep S. Guadarrama, and T. Darrell, ‘‘Caffe: Convolutional architecture for fast
learning with limited numerical precision,’’ in Proc. 32nd Int. Conf. Mach. feature embedding,’’ 2014, arXiv:1408.5093.
Learn. (ICML), vol. 37, 2015, pp. 1737–1746.
[120] N. P. Jouppi et al., ‘‘In-datacenter performance analysis of a tensor
[99] F. G. Gustavson, ‘‘Two fast algorithms for sparse matrices: Multiplication
processing unit,’’ in Proc. 44th Annu. Int. Symp. Comput. Archit., 2017,
and permuted transposition,’’ ACM Trans. Math. Softw., vol. 4, no. 3,
pp. 1–12.
pp. 250–269, Sep. 1978.
[121] D. Justus, J. Brennan, S. Bonner, and A. S. McGough, ‘‘Predicting the
[100] A. Guzhva, S. Dolenko, and I. Persiantsev, ‘‘Multifold acceleration
computational cost of deep learning models,’’ in Proc. IEEE Int. Conf.
of neural network computations using gpu,’’ in Proc. 19th Int. Conf.
Big Data (Big Data), Dec. 2018, pp. 3873–3882.
Artif. Neural Networks, I. Berlin, Germany: Springer-Verlag, 2009,
pp. 373–380. [122] S. Kalapothas, G. Flamis, and P. Kitsos, ‘‘Efficient edge-AI application
[101] R. Hadsell, P. Sermanet, J. Ben, A. Erkan, M. Scoffier, K. Kavukcuoglu, deployment for FPGAs,’’ Information, vol. 13, no. 6, p. 279, May 2022.
U. Müller, and Y. LeCun, ‘‘Learning long-range vision for autonomous [123] A. Karpathy, ‘‘Convolutional neural networks for visual recognition,’’
off-road driving,’’ J. Field Robot., vol. 26, no. 2, pp. 120–144, Feb. 2009. GitHub, Tech. Rep., 2018. [Online]. Available: https://cs231n.github.
[102] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, io/convolutional-networks/
Y. Wang, H. Yang, and W. J. Dally, ‘‘ESE: Efficient speech recognition [124] M. Kavitha, R. Srinivasan, and R. Bhuvanya, Fake News Detection Using
engine with sparse LSTM on FPGA,’’ in Proc. ACM/SIGDA Int. Symp. Machine Learning Algorithms. Hoboken, NJ, USA: Wiley, 2022, ch. 10,
Field-Program. Gate Arrays, Feb. 2017, pp. 75–84. pp. 181–207.
[103] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, [125] H. Khan, A. Khan, Z. Khan, L. B. Huang, K. Wang, and L. He, ‘‘NPE:
‘‘EIE: Efficient inference engine on compressed deep neural network,’’ An FPGA-based overlay processor for natural language processing,’’ in
in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit. (ISCA), Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2021,
Jun. 2016, pp. 243–254. pp. 1–11.
[104] F. Hannig, V. Lari, S. Boppu, A. Tanase, and O. Reiche, ‘‘Invasive tightly- [126] J.-Y. Kim, ‘‘FPGA based neural network accelerators,’’ in Hardware
coupled processor arrays: A domain-specific architecture/compiler co- Accelerator Systems for Artificial Intelligence and Machine Learning
design approach,’’ ACM Trans. Embedded Comput. Syst., vol. 13, no. 4S, (Advances in Computers), vol. 122, S. Kim and G. C. Deka, Eds.
pp. 1–29, Jul. 2014. Amsterdam, The Netherlands: Elsevier, 2021, pp. 135–165.
[105] C. Hao, A. Sarwari, Z. Jin, H. Abu-Haimed, D. Sew, Y. Li, X. Liu, B. Wu, [127] Y. Kim, J. Lee, J.-S. Kim, H. Jei, and H. Roh, ‘‘Efficient multi-GPU
D. Fu, J. Gu, and D. Chen, ‘‘A hybrid GPU + FPGA system design for memory management for deep learning acceleration,’’ in Proc. IEEE 3rd
autonomous driving cars,’’ in Proc. IEEE Int. Workshop Signal Process. Int. Workshops Found. Appl. Self Syst. (FASW), Sep. 2018, pp. 37–43.
Syst. (SiPS), Oct. 2019, pp. 121–126. [128] J. P. Klock, J. Correa, M. Bessa, J. Arias-Garcia, F. Barboza, and
[106] L. Hardesty, ‘‘Researchers build an all-optical transistor,’’ Massachusetts C. Meinertz, ‘‘A new automated energy meter fraud detection system
Inst. Technol., Cambridge, MA, USA, Tech. Rep., 2013. [Online]. based on artificial intelligence,’’ in Proc. 11th Brazilian Symp. Comput.
Available: https://news.mit.edu/2013/computing-with-light-0704 Syst. Eng. (SBESC), Nov. 2021, pp. 1–8.
[107] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Delving deep into rectifiers: [129] A. Kojima and Y. Nose, ‘‘Development of an autonomous driving robot
Surpassing human-level performance on ImageNet classification,’’ in car using FPGA,’’ in Proc. Int. Conf. Field-Programmable Technol.
Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1026–1034. (FPT), Dec. 2018, pp. 411–414.
[108] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for
[130] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification
image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
with deep convolutional neural networks,’’ Commun. ACM, vol. 60, no. 6,
(CVPR), Jun. 2016, pp. 770–778.
pp. 84–90, May 2017.
[109] D. Hefenbrock, J. Oberg, N. T. N. Thanh, R. Kastner, and S. B. Baden,
[131] H. Kwon, A. Samajdar, and T. Krishna, ‘‘MAERI: Enabling flexible
‘‘Accelerating Viola–Jones face detection to FPGA-level using GPUs,’’
dataflow mapping over DNN accelerators via reconfigurable intercon-
in Proc. 18th IEEE Annu. Int. Symp. Field-Program. Custom Comput.
nects,’’ ACM Architectural Support Program. Lang. Operating Syst.,
Mach., May 2010, pp. 11–18.
vol. 53, pp. 461–475, Mar. 2018.
[110] C. Heidorn, M. Witterauf, F. Hannig, and J. Teich, ‘‘Efficient mapping of
CNNs onto tightly coupled processor arrays,’’ J. Comput., vol. 14, no. 8, [132] A. Lavin and S. Gray, ‘‘Fast algorithms for convolutional neural
pp. 541–556, 2019. networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
[111] A. Howard and S. Gupta. (2020). Introducing the Next Generation Jun. 2016, pp. 4013–4021.
of on-Device Vision Models: Mobilenetv3 and Mobilenetedgetpu. [133] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, ‘‘UNPU:
[Online]. Available: https://ai.googleblog.com/2019/11/introducing- A 50.6 TOPS/W unified deep neural network accelerator with 1 b-to-
next-generation-on-device.html 16 b fully-variable weight bit-precision,’’ in Proc. IEEE Int. Solid-State
[112] H. Hu, J. Li, C. Wu, X. Li, and Y. Chen, ‘‘Design and implementation of Circuits Conf. (ISSCC), Feb. 2018, pp. 218–220.
intelligent speech recognition system based on FPGA,’’ J. Phys., Conf., [134] J. Lee and J. Lee, ‘‘NP-CGRA: Extending CGRAs for efficient processing
vol. 2171, no. 1, Jan. 2022, Art. no. 012010. of light-weight deep neural networks,’’ in Proc. Design, Autom. Test Eur.
[113] A. S. Hussein, A. Anwar, Y. Fahmy, H. Mostafa, K. N. Salama, and Conf. Exhib. (DATE), Feb. 2021, pp. 1408–1413.
M. Kafafy, ‘‘Implementation of a DPU-based intelligent thermal imaging [135] J. Lee, J. Lee, D. Han, J. Lee, G. Park, and H.-J. Yoo, ‘‘LNPU: A
hardware accelerator on FPGA,’’ Electronics, vol. 11, no. 1, p. 105, 25.3 TFLOPS/W sparse deep-neural-network learning processor with
Dec. 2021. fine-grained mixed precision of FP8-FP16,’’ in Proc. IEEE Int. Solid-
[114] D. Im, D. Han, S. Choi, S. Kang, and H.-J. Yoo, ‘‘DT-CNN: Dilated and State Circuits Conf. (ISSCC), Feb. 2019, pp. 142–144.
transposed convolution neural network accelerator for real-time image [136] J. Lee and H.-J. Yoo, ‘‘An overview of energy-efficient hardware
segmentation on mobile devices,’’ in Proc. IEEE Int. Symp. Circuits Syst. accelerators for on-device deep-neural-network training,’’ IEEE Open J.
(ISCAS), May 2019, pp. 1–5. Solid-State Circuits Soc., vol. 1, pp. 115–128, 2021.

VOLUME 10, 2022 131825

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

[137] M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, and W. Sung, ‘‘FPGA-based [159] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, ‘‘Optimizing
low-power speech recognition with recurrent neural networks,’’ in Proc. NUCA organizations and wiring alternatives for large caches with CACTI
IEEE Int. Workshop Signal Process. Syst. (SiPS), Oct. 2016, pp. 230–235. 6.0,’’ in Proc. 40th Annu. IEEE/ACM Int. Symp. Microarchitecture
[138] D. I. Lewin, ‘‘DNA computing,’’ Computing Sci. Eng., vol. 4, no. 3, (MICRO), Dec. 2007, pp. 3–14.
pp. 5–8, May 2002. [160] V. Nair and G. E. Hinton, ‘‘Rectified linear units improve restricted
[139] B. Li, E. Zhou, B. Huang, J. Duan, Y. Wang, N. Xu, J. Zhang, and H. Yang, Boltzmann machines,’’ in Proc. 27th Int. Conf. Int. Conf. Mach. Learn.
‘‘Large scale recurrent neural network on GPU,’’ in Proc. Int. Joint Conf. Madison, WI, USA: Omnipress, 2010, pp. 807–814.
Neural Netw. (IJCNN), Jul. 2014, pp. 4062–4069. [161] D. T. Nguyen, T. N. Nguyen, H. Kim, and H. J. Lee, ‘‘A high-throughput
[140] X. Lian, Z. Liu, Z. Song, J. Dai, W. Zhou, and X. Ji, ‘‘High-performance and power-efficient FPGA implementation of YOLO CNN for object
FPGA-based CNN accelerator with block-floating-point arithmetic,’’ detection,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27,
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27, no. 8, no. 8, pp. 1861–1873, Aug. 2019.
pp. 1874–1885, Aug. 2019. [162] T. Ngyen, S. M. A. H. Jafri, M. Daneshtalab, A. Hemani, S. Dytckov,
[141] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, J. Plosila, and H. Tenhunen, ‘‘FIST: A framework to interleave spiking
and Y. Chen, ‘‘Pudiannao: A polyvalent machine learning accelerator,’’ neural networks on CGRAs,’’ in Proc. 23rd Euromicro Int. Conf. Parallel,
in Proc. 20th Int. Conf. Architectural Support Program. Lang. Operating Distrib., Network-Based Process., Mar. 2015, pp. 751–758.
Syst., New York, NY, USA, 2015, pp. 369–381. [163] R. Nikhil, ‘‘Bluespec System Verilog: Efficient, correct RTL from high
[142] ‘‘Learn more about the Linaro machine learning initiative,’’ Arm level specifications,’’ in Proc. 2nd ACM IEEE Int. Conf. Formal Methods
The Architecture for the Digital World, Linaro, Cambridge, U.K., Models Co-Design, Jun. 2004, pp. 69–70.
Tech. Rep., Jan. 2019. [Online]. Available: https://www.linaro.org/ [164] E. Nurvitadhi, D. Sheffield, J. Sim, A. Mishra, G. Venkatesh, and D. Marr,
news/linaro-announces-launch-of-machine-intelligence-initiative/ ‘‘Accelerating binarized neural networks: Comparison of FPGA, CPU,
[143] (Aug. 2021). Ethos-U55 Arm Developer. Accessed: Aug. 7, 2021. GPU, and ASIC,’’ in Proc. Int. Conf. Field-Program. Technol. (FPT),
[Online]. Available: https://developer.arm.com/Processors/Ethos-U55 Dec. 2016, pp. 77–84.
[144] (Aug. 2021). High-Performing AI Solutions to Transform our Digital [165] E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. O. G. Hock,
World. Accessed: Aug. 7, 2021. [Online]. Available: https://www.google. Y. T. Liew, K. Srivatsan, D. Moss, S. Subhaschandra, and G. Boudoukh,
com/search?client=firefox-b-d&q=High-Performing+AI+Solutions+ ‘‘Can FPGAs beat GPUs accelerating next-generation deep neural
to+Transform+our+Digital+World networks?’’ in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate
[145] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, ‘‘FlexFlow: A flexible Arrays, New York, NY, USA, 2017, pp. 5–14.
dataflow accelerator architecture for convolutional neural networks,’’ in [166] M. T. Nyamukuru and K. M. Odame, ‘‘Tiny Eats: Eating detection on a
Proc. IEEE Int. Symp. High Perform. Comput. Archit. (HPCA), Feb. 2017, microcontroller,’’ in Proc. IEEE 2nd Workshop Mach. Learn. Edge Sensor
pp. 553–564. Syst. (SenSys-ML), Apr. 2020, pp. 19–23.
[146] T. Luo, S. Liu, L. Li, Y. Wang, S. Zhang, T. Chen, Z. Xu, O. Temam, and [167] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,
Y. Chen, ‘‘Dadiannao: A neural network supercomputer,’’ IEEE Trans. B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, ‘‘SCNN: An
Comput., vol. 66, no. 1, pp. 73–88, Jan. 2017. accelerator for compressed-sparse convolutional neural networks,’’ in
[147] M.-T. Luong, H. Pham, and C. D. Manning, ‘‘Effective approaches Proc. 44th Annu. Int. Symp. Comput. Archit., New York, NY, USA,
to attention-based neural machine translation,’’ in Proc. EMNLP. Jun. 2017, pp. 27–40.
Lisbon, Portugal: Association for Computational Linguistics, Aug. 2015, [168] S.-W. Park, J. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J. Yoo,
pp. 1412–1421. ‘‘An energy-efficient and scalable deep learning/inference processor with
[148] P. Lv, W. Liu, and J. Li, ‘‘A FPGA-based accelerator implementaion tetra-parallel MIMD architecture for big data applications,’’ IEEE Trans.
for YOLOv2 object detection using Winograd algorithm,’’ in Proc. Biomed. Circuits Syst., vol. 9, no. 6, pp. 838–848, Dec. 2015.
5th Int. Conf. Mech., Control Comput. Eng. (ICMCCE), Dec. 2020, [169] M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal, ‘‘Memory-
pp. 1894–1898. centric accelerator design for convolutional neural networks,’’ in Proc.
[149] A. L. Maas, ‘‘Rectifier nonlinearities improve neural network acoustic IEEE 31st Int. Conf. Comput. Design (ICCD), Oct. 2013, pp. 13–19.
models,’’ Stanford Univ., Stanford, CA, USA, Tech. Rep., 2013. [170] P.-H. Pham, D. Jelaca, C. Farabet, B. Martini, Y. LeCun, and
[150] R. Machupalli, M. Hossain, and M. Mandal, ‘‘Review of ASIC E. Culurciello, ‘‘NeuFlow: Dataflow vision processing system-on-a-
accelerators for deep neural network,’’ Microprocessors Microsyst., chip,’’ in Proc. IEEE 55th Int. Midwest Symp. Circuits Syst. (MWSCAS),
vol. 89, Mar. 2022, Art. no. 104441. Aug. 2012, pp. 1044–1047.
[151] M. Mathieu, M. Henaff, and Y. LeCun, ‘‘Fast training of convolutional [171] M. Pietras, ‘‘Hardware conversion of neural networks simulation models
networks through FFTs,’’ 2014, arXiv:1312.5851. for neural processing accelerator implemented as FPGA-based SoC,’’
[152] B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, ‘‘Adres: in Proc. 24th Int. Conf. Field Program. Log. Appl. (FPL), Sep. 2014,
An architecture with tightly coupled VLIW processor and coarse-grained pp. 1–4.
reconfigurable matrix,’’ in Field Programmable Logic and Application, [172] T. Posewsky and D. Ziener, ‘‘Efficient deep neural network acceleration
P. Y. K. Cheung and G. A. Constantinides, Eds. Berlin, Germany: through FPGA-based batch processing,’’ in Proc. Int. Conf. ReConFig-
Springer, 2003, pp. 61–70. urable Comput. FPGAs (ReConFig), Nov. 2016, pp. 1–8.
[153] J. Misra and I. Saha, ‘‘Artificial neural networks in hardware: A survey [173] T. Posewsky and D. Ziener, ‘‘Throughput optimizations for FPGA-based
of two decades of progress,’’ Neurocomputing, vol. 74, nos. 1–3, deep neural network inference,’’ Microprocessors Microsyst., vol. 60,
pp. 239–255, Dec. 2010. pp. 151–161, Jul. 2018.
[154] S. Mittal, ‘‘A survey of FPGA-based accelerators for convolutional [174] S. Prakash, T. Callahan, J. Bushagour, C. Banbury, A. V. Green,
neural networks,’’ Neural Comput. Appl., vol. 32, no. 4, pp. 1109–1139, P. Warden, T. Ansell, and V. J. Reddi, ‘‘CFU playground: Full-stack open-
Feb. 2020. source framework for tiny machine learning (tinyML) acceleration on
[155] S. Mittal and J. S. Vetter, ‘‘A survey of CPU-GPU heterogeneous FPGAs,’’ 2022, arXiv:2201.01863.
computing techniques,’’ ACM Comput. Surveys, vol. 47, no. 4, pp. 1–35, [175] W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and
Jul. 2015. M. Horowitz, ‘‘Convolution engine: Balancing efficiency and flexibility
[156] P. Mohan, A. J. Paul, and A. Chirania, ‘‘A tiny CNN architecture for in specialized computing,’’ Commun. ACM, vol. 58, no. 4, pp. 85–93,
medical face mask detection for resource-constrained endpoints,’’ in Mar. 2015.
Innovations in Electrical and Electronic Engineering (Lecture Notes in [176] E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul,
Electrical Engineering). Singapore: Springer, 2021, pp. 657–670. and T. Krishna, ‘‘SIGMA: A sparse and irregular GEMM accelerator with
[157] J. J. Moolayil, ‘‘A Layman’s guide to deep neural networks—Towards flexible interconnects for DNN training,’’ in Proc. IEEE Int. Symp. High
data science,’’ Medium, May 2020. [Online]. Available: https:// Perform. Comput. Archit. (HPCA), Feb. 2020, pp. 58–70.
towardsdatascience.com/a-laymans-guide-to-deep-neural-networks- [177] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu,
ddcea24847fb S. Song, Y. Wang, and H. Yang, ‘‘Going deeper with embedded FPGA
[158] D. Moolchandani, A. Kumar, and S. R. Sarangi, ‘‘Accelerating CNN platform for convolutional neural network,’’ in Proc. ACM/SIGDA Int.
inference on ASICs: A survey,’’ J. Syst. Archit., vol. 113, Feb. 2021, Symp. Field-Program. Gate Arrays, New York, NY, USA, Feb. 2016,
Art. no. 101887. pp. 26–35.

131826 VOLUME 10, 2022

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

[178] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner, [199] D.-F. Syu, S.-W. Syu, S.-J. Ruan, Y.-C. Huang, and C.-K. Yang,
‘‘AI accelerator survey and trends,’’ in Proc. IEEE High Perform. Extreme ‘‘FPGA implementation of automatic speech recognition system in a
Comput. Conf. (HPEC), Sep. 2021, pp. 1–9. car environment,’’ in Proc. IEEE 4th Global Conf. Consum. Electron.
[179] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, (GCCE), Oct. 2015, pp. 485–486.
‘‘VDNN: Virtualized deep neural networks for scalable, memory- [200] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, ‘‘Efficient processing
efficient neural network design,’’ in Proc. 49th Annu. IEEE/ACM Int. of deep neural networks: A tutorial and survey,’’ Proc. IEEE, vol. 105,
Symp. Microarchitecture (MICRO). Piscataway, NJ, USA: IEEE Press, no. 12, pp. 2295–2329, Dec. 2017.
Oct. 2016, pp. 1–13. [201] M. A. Talib, S. Majzoub, Q. Nasir, and D. Jamal, ‘‘A systematic literature
[180] T. Ridnik, H. Lawen, A. Noy, E. Ben Baruch, G. Sharir, and review on hardware implementation of artificial intelligence algorithms,’’
I. Friedman, ‘‘TResNet: High performance GPU-dedicated architecture,’’ J. Supercomput., vol. 77, no. 2, pp. 1897–1938, Feb. 2021.
2020, arXiv:2003.13630. [202] M. Tanomoto, S. Takamaeda-Yamazaki, J. Yao, and Y. Nakashima,
[181] S. Saha, ‘‘A comprehensive guide to convolutional neural networks—The ‘‘A CGRA-based approach for accelerating convolutional neural net-
ELI5 way,’’ Towards Data Sci., Toronto, ON, Canada, Tech. Rep., 2018. works,’’ in Proc. IEEE 9th Int. Symp. Embedded Multicore/Many-Core
[Online]. Available: https://towardsdatascience.com/a-comprehensive- Syst. Chip, Sep. 2015, pp. 73–80.
guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53 [203] Y. Tkachenko, ‘‘Autonomous CRM control via CLV approximation with
[182] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic, deep reinforcement learning in discrete and continuous action space,’’
E. Cosatto, and H. P. Graf, ‘‘A massively parallel coprocessor for 2015, arXiv:1504.01840.
convolutional neural networks,’’ in Proc. 20th IEEE Int. Conf. Appl.- [204] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre,
Specific Syst., Archit. Processors, Jul. 2009, pp. 53–60. and K. Vissers, ‘‘FINN: A framework for fast, scalable binarized neural
[183] V. Sati, S. M. Sánchez, N. Shoeibi, A. Arora, and J. M. Corchado, ‘‘Face network inference,’’ in Proc. ACM/SIGDA Int. Symp. Field-Program.
detection and recognition, face emotion recognition through NVIDIA Gate Arrays, Feb. 2017, pp. 65–74.
Jetson Nano,’’ in Proc. Int. Symp. Ambient Intell. Cham, Switzerland: [205] A. Vasudevan, A. Anderson, and D. Gregg, ‘‘Parallel multi channel
Springer, 2020, pp. 177–185. convolution using general matrix multiplication,’’ in Proc. IEEE 28th
[184] S. Saglam, F. Tat, and S. Bayar, ‘‘FPGA implementation of CNN Int. Conf. Appl.-Specific Syst., Archit. Processors (ASAP), Jul. 2017,
algorithm for detecting malaria diseased blood cells,’’ in Proc. Int. Symp. pp. 19–24.
Adv. Electr. Commun. Technol. (ISAECT), Nov. 2019, pp. 1–5. [206] S. I. Venieris and C.-S. Bouganis, ‘‘FpgaConvNet: A framework for
[185] U. Schmidt and S. Roth, ‘‘Shrinkage fields for effective image restora- mapping convolutional neural networks on FPGAs,’’ in Proc. IEEE
tion,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, 24th Annu. Int. Symp. Field-Program. Custom Comput. Mach. (FCCM),
pp. 2774–2781. May 2016, pp. 40–47.
[186] D. Selvathi, R. D. Nayagam, D. J. Hemanth, and V. E. Balas, ‘‘FPGA [207] T. V. Huynh, ‘‘FPGA-based acceleration for convolutional neural
implementation of on-chip ANN for breast cancer diagnosis,’’ Intell. networks on PYNQ-Z2,’’ Int. J. Comput. Digit. Syst., vol. 11, no. 1,
Decis. Technol., vol. 10, no. 4, pp. 341–352, Dec. 2016. pp. 441–449, Jan. 2022.
[187] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra,
[208] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou, ‘‘DLAU: A scalable
and H. Esmaeilzadeh, ‘‘From high-level deep neural models to FPGAs,’’
deep learning accelerator unit on FPGA,’’ IEEE Trans. Comput.-Aided
in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO),
Design Integr. Circuits Syst., vol. 36, no. 3, pp. 513–517, Mar. 2017.
Oct. 2016, pp. 1–12.
[209] J. Wang and S. Gu, ‘‘FPGA implementation of object detection
[188] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and
accelerator based on Vitis-AI,’’ in Proc. 11th Int. Conf. Inf. Sci. Technol.
H. Esmaeilzadeh, ‘‘Bit fusion: Bit-level dynamically composable archi-
(ICIST), May 2021, pp. 571–577.
tecture for accelerating deep neural network,’’ in Proc. ACM/IEEE 45th
Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2018, pp. 764–775. [210] T. Wang, C. Wang, X. Zhou, and H. Chen, ‘‘A survey of FPGA
based deep learning accelerators: Challenges and opportunities,’’ 2018,
[189] R. Shi, H. Xu, B. Chen, Z. Zhang, and L.-M. Peng, ‘‘Scalable fabrication
arXiv:1901.04988.
of graphene devices through photolithography,’’ Appl. Phys. Lett.,
vol. 102, no. 11, Mar. 2013, Art. no. 113102. [211] Y. Wang, J. Xu, Y. Han, H. Li, and X. Li, ‘‘DeepBurning: Automatic
[190] D. Shin, J. Lee, J. Lee, and H.-J. Yoo, ‘‘DNPU: An 8.1 TOPS/W generation of FPGA-based learning accelerators for the neural network
reconfigurable CNN-RNN processor for general-purpose deep neural family,’’ in Proc. 53rd Annu. Design Autom. Conf., Jun. 2016, pp. 1–6.
networks,’’ in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), [212] S. Williams, A. Waterman, and D. Patterson, ‘‘Roofline: An insightful
Feb. 2017, pp. 240–241. visual performance model for multicore architectures,’’ Commun. ACM,
[191] M. M. Shulaker, G. Hills, R. S. Park, R. T. Howe, K. Saraswat, vol. 52, no. 4, pp. 65–76, 2009.
H.-S. P. Wong, and S. Mitra, ‘‘Three-dimensional integration of nan- [213] W. Vanderbauwhede and K. Benkrid, High-Performance Computing
otechnologies for computing and data storage on a single chip,’’ Nature, Using FPGAs. New York, NY, USA: Springer, 2013.
vol. 547, pp. 74–78, Jul. 2017. [214] W. G. Wong, ‘‘More details emerge about arm’s machine learning,’’
[192] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for Electron. Des. Mag., Hasbrouck Heights, NJ, USA, Tech. Rep., Jun. 2018.
large-scale image recognition,’’ 2014, arXiv:1409.1556. [Online]. Available: https://www.electronicdesign.com/industrial-
[193] G. Smith and F. F. Leymarie, ‘‘The machine as artist: An introduction,’’ automation/article/21806582/more-details-emerge-about-arms-machine-
Arts, vol. 6, no. 4, p. 5, Apr. 2017. learning
[194] S. Srinivas, R. K. Sarvadevabhatla, K. R. Mopuri, and N. Prabhu, [215] B. Wu, A. Wan, F. Iandola, P. H. Jin, and K. Keutzer, ‘‘SqueezeDet:
‘‘Kruthiventi, and R. V. Babu,’’ A taxonomy of deep convolutional neural Unified, small, low power fully convolutional neural networks for
nets for computer vision,’’ Frontiers Robot. AI, vol. 2, p. 36, Jan. 2016. real-time object detection for autonomous driving,’’ in Proc. IEEE
[195] V. Sriram, D. Cox, K. H. Tsoi, and W. Luk, ‘‘Towards an embedded Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jul. 2017,
biologically-inspired machine vision processor,’’ in Proc. Int. Conf. Field- pp. 446–454.
Programmable Technol., Dec. 2010, pp. 273–278. [216] H. Xiao, K. Zhao, and G. Liu, ‘‘Efficient hardware accelerator for
[196] D. Strigl, K. Kofler, and S. Podlipnig, ‘‘Performance and scalability of compressed sparse deep neural network,’’ IEICE Trans. Inf. Syst.,
GPU-based convolutional neural networks,’’ in Proc. 18th Euromicro vol. 104, no. 5, pp. 772–775, May 2021.
Conf. Parallel, Distrib. Network-Based Process., Feb. 2010, pp. 317–324. [217] S. Xiong, G. Wu, X. Fan, X. Feng, Z. Huang, W. Cao, X. Zhou, S. Ding,
[197] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J. Yu, L. Wang, and Z. Shi, ‘‘MRI-based brain tumor segmentation using
J.-S. Seo, and Y. Cao, ‘‘Throughput-optimized OpenCL-based FPGA FPGA-accelerated neural network,’’ BMC Bioinf., vol. 22, no. 1, pp. 1–15,
accelerator for large-scale convolutional neural networks,’’ in Proc. Dec. 2021.
ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, New York, [218] A. Yazdanbakhsh, J. Park, H. Sharma, P. Lotfi-Kamran, and
NY, USA, Feb. 2016, pp. 16–25. H. Esmaeilzadeh, ‘‘Neural acceleration for GPU throughput processors,’’
[198] M. Svedin, S. W. D. Chien, G. Chikafa, N. Jansson, and A. Podobas, in Proc. 48th Int. Symp. Microarchitecture, Dec. 2015, pp. 482–493.
‘‘Benchmarking the NVIDIA GPU lineage: From early K 80 to modern [219] K. Seshadri, B. Akin, J. Laudon, R. Narayanaswami, and
A 100 with asynchronous memory transfers,’’ in Proc. 11th Int. Symp. A. Yazdanbakhsh, ‘‘An evaluation of edge TPU accelerators for
Highly Efficient Accel. Reconfigurable Technol., Jun. 2021, pp. 1–6. convolutional neural networks,’’ 2021, arXiv:2102.10423.

VOLUME 10, 2022 131827

P. Dhilleswararao et al.: Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey

[220] X. Yin, L. Chen, X. Zhang, and Z. Gao, ‘‘Object detection implementation M. SABARIMALAI MANIKANDAN (Senior
and optimization on embedded GPU system,’’ in Proc. IEEE Int. Symp. Member, IEEE) received the B.E. degree in
Broadband Multimedia Syst. Broadcast. (BMSB), Jun. 2018, pp. 1–5. electronic and communication engineering from
[221] R. Zanc, T. Cioara, and I. Anghel, ‘‘Forecasting financial markets using Bharathiar University, Coimbatore, India, the M.E.
deep learning,’’ in Proc. IEEE 15th Int. Conf. Intell. Comput. Commun. degree in microwave and optical engineering
Process. (ICCP), Sep. 2019, pp. 459–466. from Madurai Kamaraj University, Madurai, India,
[222] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, ‘‘Optimizing and the Ph.D. degree in cardiovascular signal
FPGA-based accelerator design for deep convolutional neural networks,’’
processing from the Department of Electronics
in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays,
and Communication Engineering, IIT Guwahati,
Feb. 2015, pp. 161–170.
[223] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong, ‘‘Caffeine: Guwahati, India. He was an Assistant Professor at
Toward uniformed representation and acceleration for deep convolutional Amrita Vishwa Vidyapeetham University, Ettimadai, India. He was the Chief
neural networks,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits Engineer at the Advanced Technology Group, Samsung India Electronic
Syst., vol. 38, no. 11, pp. 2072–2085, Nov. 2019. Pvt., Ltd., Noida, India. He was an Assistant Professor at the Biomedical
[224] G. Zhang, N. Attaluri, J. S. Emer, and D. Sánchez, ‘‘Gamma: Leveraging System Laboratory, School of Electrical Sciences, IIT Bhubaneswar, India.
Gustavson’s algorithm to accelerate sparse matrix multiplication,’’ in He is currently an Associate Professor of electrical engineering with IIT
Proc. 26th ACM Int. Conf. Architectural Support Program. Lang. Palakkad. He has published more than 70 research papers in reputed journals
Operating Syst., Apr. 2021, pp. 687–701. and conference proceedings. His research interests include signal and image
[225] J.-F. Zhang, C.-E. Lee, C. Liu, Y. S. Shao, S. W. Keckler, and Z. Zhang, processing, adaptive machine learning, the Internet of Things, VLSI signal
‘‘SNAP: A 1.67—21.55 TOPS/W sparse neural acceleration processor for processing, machine learning architectures, application system development:
unstructured sparse deep neural network inference in 16 nm CMOS,’’ in health (human, machine, structural) monitoring systems, audio and speech
Proc. Symp. VLSI Circuits, Jun. 2019, pp. C306–C307. processing systems for human–machine interactions, biometric and data
[226] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, ‘‘Object detection with deep security for authentication and authorization, environmental monitoring
learning: A review,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 30,
systems for ambient assisted living, UAV-assisted IoT for smart surveillance
no. 11, pp. 3212–3232, Nov. 2019.
systems, and context and quality aware pattern learning networks for event
[227] J. Zhu, L. Wang, H. Liu, S. Tian, Q. Deng, and J. Li, ‘‘An efficient
task assignment framework to accelerate DPU-based convolutional neural recognition. He was a recipient of the 2012 Outstanding Performance Award
network inference on FPGAs,’’ IEEE Access, vol. 8, pp. 83224–83237, during his tenure at Samsung India Electronic Pvt., Ltd. He served as a
2020. Reviewer for many reputed journals of the IEEE, IET, Springer, Hindawi,
[228] J. Zhu, T. Yang, R. Liu, X. Xu, and X. Zhu, ‘‘Image recognition of CT PLOS One, Frontiers, and Elsevier.
diagnosis for cholangiocarcinoma treatment based on FPGA processor
and neural network,’’ Microprocessors Microsyst., vol. 81, Mar. 2021,
Art. no. 103645.
[229] S. Monk, Programming the Raspberry Pi: Getting Started With Python.
New York, NY, USA: McGraw-Hill Education, 2016.
[230] (Nov. 2022). Photos of the Raspberry Pi Through the Ages: From the
Prototype to Pi3 B+. Accessed: Nov. 14, 2022. [Online]. Available:
https://www.zdnet.com/pictures/photos-of-the-raspberry-pi-through-the-
ages-from-the-prototype-to-pi-3/

PUDI DHILLESWARARAO (Graduate Student

Member, IEEE) received the M.Tech. degree in
microelectronics and VLSI from the National
Institute of Technology, Durgapur, India, in 2014.
He is currently pursuing the Ph.D. degree with
the School of Electrical Sciences, Indian Institute
LINGA REDDY CENKERAMADDI (Senior
of Technology, Bhubaneswar, India. His current
Member, IEEE) received the master’s degree in
research interests include programmable hardware
electrical engineering from the Indian Institute
accelerators, compilers, and high-level synthesis.
of Technology Delhi (IIT Delhi), New Delhi,
India, in 2004, and the Ph.D. degree in electrical
engineering from the Norwegian University of
SRINIVAS BOPPU (Member, IEEE) received Science and Technology (NTNU), Trondheim,
the international M.Sc. degree in IC design Norway, in 2011.
from Nanyang Technological University, Singa- He worked on mixed-signal circuit design
pore, and the Technical University of Munich, at Texas Instruments. He worked on radiation
Germany, and the Ph.D. degree from the Chair imaging for an atmosphere-space interaction monitor (ASIM mission to the
for Hardware/Software Co-Design, Department International Space Station) at the University of Bergen, Bergen, Norway,
of Computer Science, University of Erlangen– from 2010 to 2012. He is currently the Leader of the Autonomous and
Nuremberg, Germany in 2015. He was a Senior Cyber-Physical Systems (ACPS) Research Group and a Professor with the
Consultant at Infineon Technologies Munich, University of Agder, Grimstad, Norway. Several of his master’s students
Germany. He also worked at Freescale Semicon- received the Best Master Thesis Awards in information and communication
ductors India and ST Microelectronics as a Physical Design Engineer. He has technology (ICT). He has coauthored over 120 research publications that
been an Assistant Professor with the School of Electrical Sciences, IIT have been published in prestigious international journals and standard
Bhubaneswar, since October 2017. He has more than 15 years of experience conferences. His main scientific interests include cyber-physical systems,
in both academia and industry in the field of the VLSI domain. He received autonomous systems, and wireless embedded systems.
the Full Scholarship from Infineon Technologies Asia Pacific Pte. Ltd., Dr. Cenkeramaddi is a member of the editorial boards of various
for M.Sc. degree. His research interests include high-level synthesis, international journals and the technical program committees of several IEEE
programmable hardware accelerators, compilers, scheduling and mapping conferences. He is the Principal Investigator and a Co-Principal Investigator
approaches, low-power VLSI design, SoC design, and design automation of of many research grants from the Norwegian Research Council.
integrated circuits.

131828 VOLUME 10, 2022

Artificial Intelligence Hardware Design - Challenges and Solutions
100% (2)
Artificial Intelligence Hardware Design - Challenges and Solutions
233 pages
Individual Paper - Nina Luksha - ITEC 625 9080 - Updated
No ratings yet
Individual Paper - Nina Luksha - ITEC 625 9080 - Updated
11 pages
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
No ratings yet
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
42 pages
Embedded_Deep_Learning_Accelerators_A_Survey_on_Recent_Advances
No ratings yet
Embedded_Deep_Learning_Accelerators_A_Survey_on_Recent_Advances
19 pages
Futureinternet 12 00113 v2
No ratings yet
Futureinternet 12 00113 v2
22 pages
Capra 2020
No ratings yet
Capra 2020
48 pages
Embedded Deep Learning Accelerators - A Survey On Recent Advances
No ratings yet
Embedded Deep Learning Accelerators - A Survey On Recent Advances
19 pages
FPGA CNN Project Paper
No ratings yet
FPGA CNN Project Paper
31 pages
Deep NN - Theory, Tutorial and Survey
No ratings yet
Deep NN - Theory, Tutorial and Survey
32 pages
FP-DNN An Automated Framework For Mapping
No ratings yet
FP-DNN An Automated Framework For Mapping
8 pages
An Empirical Approach To Enhance Performance For Scalable CORDIC-Based Deep Neural Networks
No ratings yet
An Empirical Approach To Enhance Performance For Scalable CORDIC-Based Deep Neural Networks
32 pages
thesis-2
No ratings yet
thesis-2
144 pages
Power Efficientreconfigurable Accelerator For Deep Convolutional Neural Networks
No ratings yet
Power Efficientreconfigurable Accelerator For Deep Convolutional Neural Networks
6 pages
14280
No ratings yet
14280
47 pages
make-04-00004-v3
No ratings yet
make-04-00004-v3
37 pages
Hardware For Deep Learning Acceleration
No ratings yet
Hardware For Deep Learning Acceleration
20 pages
Advanced Intelligent Systems - 2024 - Song - Hardware for Deep Learning Acceleration
No ratings yet
Advanced Intelligent Systems - 2024 - Song - Hardware for Deep Learning Acceleration
20 pages
EXSY Apr 21 455 Proof Hi
No ratings yet
EXSY Apr 21 455 Proof Hi
14 pages
w1--Machine Learning Hardware Design for Efficiency, Flexibility, and Scalability [Feature]
No ratings yet
w1--Machine Learning Hardware Design for Efficiency, Flexibility, and Scalability [Feature]
19 pages
Efficient Hardware Architectures For Deep Convolutional Neural Network
No ratings yet
Efficient Hardware Architectures For Deep Convolutional Neural Network
13 pages
MythicWhitepaper-2019oct31
No ratings yet
MythicWhitepaper-2019oct31
9 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
Applsci 12 10771 v2
No ratings yet
Applsci 12 10771 v2
44 pages
Accelerating Deep Neural Networks Implem
No ratings yet
Accelerating Deep Neural Networks Implem
18 pages
An_Overview_of_Efficient_Interconnection_Networks_for_Deep_Neural_Network_Accelerators
No ratings yet
An_Overview_of_Efficient_Interconnection_Networks_for_Deep_Neural_Network_Accelerators
15 pages
V Ersion: A Survey On Deep Learning Hardware Accelerators For Heterogeneous HPC Platforms
No ratings yet
V Ersion: A Survey On Deep Learning Hardware Accelerators For Heterogeneous HPC Platforms
58 pages
FPGA - Based Accelerators of Deep LearningNetworks For Learning and Classification
100% (1)
FPGA - Based Accelerators of Deep LearningNetworks For Learning and Classification
37 pages
Design and implementation of deep neural network hardware chip and its performance analysis
No ratings yet
Design and implementation of deep neural network hardware chip and its performance analysis
10 pages
inbound6702194954077661265
No ratings yet
inbound6702194954077661265
42 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
15 pages
Design Possibilities and Challenges of DNN Models
No ratings yet
Design Possibilities and Challenges of DNN Models
61 pages
Quantization and Deployment Od DNN On Microcontroller
No ratings yet
Quantization and Deployment Od DNN On Microcontroller
34 pages
Introduction To Hardware Accelerator Systems For Artificial Intelligence and Machine Learning
No ratings yet
Introduction To Hardware Accelerator Systems For Artificial Intelligence and Machine Learning
21 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
A_Survey_Comparing_Specialized_Hardware_And_Evolution_In_TPUs_For_Neural_Networks
No ratings yet
A_Survey_Comparing_Specialized_Hardware_And_Evolution_In_TPUs_For_Neural_Networks
7 pages
Hardware Architectures For Deep Neural Networks: ISCA Tutorial June 24, 2017
No ratings yet
Hardware Architectures For Deep Neural Networks: ISCA Tutorial June 24, 2017
290 pages
DaDianNao_A_Machine-Learning_Supercomputer
No ratings yet
DaDianNao_A_Machine-Learning_Supercomputer
14 pages
EECS251Leture-JennyHuang 2021
No ratings yet
EECS251Leture-JennyHuang 2021
67 pages
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
No ratings yet
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
8 pages
DNN Accelerators For Heterogeneous HPC
No ratings yet
DNN Accelerators For Heterogeneous HPC
53 pages
Architecture Design For Highly Flexible and Energy-Efficient Deep Neural Network Accelerators
No ratings yet
Architecture Design For Highly Flexible and Energy-Efficient Deep Neural Network Accelerators
147 pages
Systolic Array Architecture For Educational Use
No ratings yet
Systolic Array Architecture For Educational Use
6 pages
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
No ratings yet
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
101 pages
Benchmarking_Contemporary_Deep_Learning_Hardware_and_Frameworks_A_Survey_of_Qualitative_Metrics
No ratings yet
Benchmarking_Contemporary_Deep_Learning_Hardware_and_Frameworks_A_Survey_of_Qualitative_Metrics
8 pages
Tutorial On DNN 1 of 9 Background of DNNs
No ratings yet
Tutorial On DNN 1 of 9 Background of DNNs
65 pages
Download full Artificial Intelligence Hardware Design: Challenges and Solutions 1st Edition Albert Chun-Chen Liu ebook all chapters
100% (4)
Download full Artificial Intelligence Hardware Design: Challenges and Solutions 1st Edition Albert Chun-Chen Liu ebook all chapters
40 pages
L-0017398760-pdf
No ratings yet
L-0017398760-pdf
24 pages
Hardware Architectures For Deep Neural Networks-MIT'16
No ratings yet
Hardware Architectures For Deep Neural Networks-MIT'16
300 pages
Full Download Artificial Intelligence Hardware Design: Challenges and Solutions 1st Edition Albert Chun-Chen Liu PDF DOCX
100% (1)
Full Download Artificial Intelligence Hardware Design: Challenges and Solutions 1st Edition Albert Chun-Chen Liu PDF DOCX
50 pages
Isvlsi2019 SS
No ratings yet
Isvlsi2019 SS
7 pages
NoC Based DNN Accelerators
No ratings yet
NoC Based DNN Accelerators
8 pages
02-92
No ratings yet
02-92
15 pages
Engineering: Yiran Chen, Yuan Xie, Linghao Song, Fan Chen, Tianqi Tang
No ratings yet
Engineering: Yiran Chen, Yuan Xie, Linghao Song, Fan Chen, Tianqi Tang
11 pages
CNN hw1
No ratings yet
CNN hw1
13 pages
Deep Physical Neural Networks Enabled by A Backpro
No ratings yet
Deep Physical Neural Networks Enabled by A Backpro
60 pages
Paper 8
No ratings yet
Paper 8
7 pages
A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration
No ratings yet
A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration
10 pages
1119810450-3
No ratings yet
1119810450-3
6 pages
AI-Focused Hardware
From Everand
AI-Focused Hardware
Kai Turing
No ratings yet
CUDA Programming in C: From Basics to Expert Proficiency
From Everand
CUDA Programming in C: From Basics to Expert Proficiency
William Smith
No ratings yet
Heterogeneous Multicore Processor Technologies for Embedded Systems Compress
No ratings yet
Heterogeneous Multicore Processor Technologies for Embedded Systems Compress
233 pages
Chapter 5.1-5.6 Memory
No ratings yet
Chapter 5.1-5.6 Memory
26 pages
DSP ch05 S1,2P
No ratings yet
DSP ch05 S1,2P
37 pages
DSP ch04 S6,7P
No ratings yet
DSP ch04 S6,7P
70 pages
DSP ch03 S9P
No ratings yet
DSP ch03 S9P
17 pages
Accelerating DNN Training in Wireless Federated Edge Learning
No ratings yet
Accelerating DNN Training in Wireless Federated Edge Learning
30 pages
Q1M2 Illustrating and Simplifying RAE
No ratings yet
Q1M2 Illustrating and Simplifying RAE
25 pages
Als CG
No ratings yet
Als CG
35 pages
Vehicle To Vehicle Communication Whitepaper
100% (1)
Vehicle To Vehicle Communication Whitepaper
13 pages
75 Digital Tools and Apps Teachers Can Use To Support Formative Assessment in The Classroom
100% (1)
75 Digital Tools and Apps Teachers Can Use To Support Formative Assessment in The Classroom
5 pages
Modeling Business Objectives
No ratings yet
Modeling Business Objectives
20 pages
Job Entry Tree Vi
No ratings yet
Job Entry Tree Vi
3 pages
Why To Learn C++: C++ Is A Middle-Level Programming Language Developed by Bjarne Stroustrup Starting
No ratings yet
Why To Learn C++: C++ Is A Middle-Level Programming Language Developed by Bjarne Stroustrup Starting
12 pages
Ucs422 Partc Group Assingment
No ratings yet
Ucs422 Partc Group Assingment
10 pages
unit 6.TIẾNG ANH BỔ TRỢ (GRAMMAR+READING)
No ratings yet
unit 6.TIẾNG ANH BỔ TRỢ (GRAMMAR+READING)
21 pages
Ilovepdf Merged
0% (1)
Ilovepdf Merged
20 pages
Gate Project
No ratings yet
Gate Project
77 pages
CH 06
No ratings yet
CH 06
75 pages
Paradox REM101
No ratings yet
Paradox REM101
2 pages
Microwave Line of Sight
No ratings yet
Microwave Line of Sight
10 pages
V6.15.Draft RFP For FMS and O&M of Non IT Equipment of DR Site Jodhpur - PD - 29.05.23
No ratings yet
V6.15.Draft RFP For FMS and O&M of Non IT Equipment of DR Site Jodhpur - PD - 29.05.23
135 pages
IBM Annual Report 2023
No ratings yet
IBM Annual Report 2023
128 pages
Advance Presentation Skills
No ratings yet
Advance Presentation Skills
11 pages
CHS NC II - Training Regulations
No ratings yet
CHS NC II - Training Regulations
69 pages
Colour Theory and Colour Matching Optional
No ratings yet
Colour Theory and Colour Matching Optional
2 pages
Esc201: Introduction To Electronics: Amit Verma Dept. of Electrical Engineering Iit Kanpur
No ratings yet
Esc201: Introduction To Electronics: Amit Verma Dept. of Electrical Engineering Iit Kanpur
30 pages
history-of-cryptography-and-cryptanalysis-codes-ciphers-and-their-algorithms-1nbsped-3319904426-978-3319904429_compress (1) (1)
No ratings yet
history-of-cryptography-and-cryptanalysis-codes-ciphers-and-their-algorithms-1nbsped-3319904426-978-3319904429_compress (1) (1)
300 pages
Chapter One Introduction and Literature Survey
No ratings yet
Chapter One Introduction and Literature Survey
34 pages
ITPLUS2-Module 3
No ratings yet
ITPLUS2-Module 3
3 pages
Programmable Logic Controller: Engr - Jama Adam Salah
No ratings yet
Programmable Logic Controller: Engr - Jama Adam Salah
158 pages
GRC & Open-AudIT - Final
No ratings yet
GRC & Open-AudIT - Final
69 pages
S6 Sem Syllabus Computer Science (Old Scheme)
No ratings yet
S6 Sem Syllabus Computer Science (Old Scheme)
6 pages
MBE - MBE900 (2007 & Newer) .MBE900 EPA07 PDF
100% (1)
MBE - MBE900 (2007 & Newer) .MBE900 EPA07 PDF
11 pages
Crime Type and Occurrence Prediction Using Machine Learning
No ratings yet
Crime Type and Occurrence Prediction Using Machine Learning
28 pages
Universal Verification Methodology Based Verification Environment For PCIE Data Link Layer
No ratings yet
Universal Verification Methodology Based Verification Environment For PCIE Data Link Layer
5 pages
Session 6
No ratings yet
Session 6
19 pages