Design and Implementation of Hardware Computation For Convolutional Neural Networks

Uploaded by

Luu Nguyễn

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Design and Implementation of Hardware Computation For Convolutional Neural Networks

Uploaded by

Luu Nguyễn

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Design and Implementation of Hardware

Computation for Convolutional Neural Networks

1st Given Name Surname
dept. name of organization (of Aff.)
name of organization (of Aff.)
City, Country
email address or ORCID

Abstract—Convolutional Neural Networks (CNNs) are vital Since data movement can consume more energy than the
in artificial intelligence and machine learning, especially for computation itself [5], optimizing CNN processing involves
image processing and recognition. They are widely used in facial not only achieving high parallelism for increased throughput
recognition, object detection, and image classification, signifi-
cantly improving system performance and accuracy. However, but also enhancing the efficiency of data movement across the
deploying CNNs on hardware poses challenges due to their system. To address these challenges, it is crucial to design a
high computational and memory requirements and the complex compute scheme, called a dataflow, that can support a highly
computations arising from the weight-sharing mechanism used parallel compute paradigm while optimizing the energy cost
in CNNs. Designing efficient hardware accelerators involves of data movement from both on-chip and off-chip. The cost
balancing speed, power consumption, and resource usage.
In this research, the design and implementation of a computation of data movement is reduced by exploiting data reuse in a
unit for CNNs include a convolutional accelerator, max-pooling multilevel memory hierarchy.
layer, fully connected layers, and a softmax activation function. Almost all existing FPGA-based CNN implementations have
This study utilizes a data flow called weight stationary (WS) to focused on exploring the limitations of memory bandwidth and
minimize data movement and reuse partial sums based on spatial computing parallelism. Works such as [5] and [6] alleviate
architecture with an array of processing elements. Specifically,
a softmax activation function is implemented using a Look- pressure on off-chip memory by reducing the precision of
Up Table (LUT) technique to construct a complete AlexNet the neural network parameters, as lower numerical precision
(batch size N = 1) for the handwritten digit recognition task has been shown to be sufficient for CNNs. Other studies,
using the MNIST dataset and fixed-point representation for data. such as [6] and [7], have exploited fixed-point quantization,
The design utilizes 125,021 Flip-Flops, 624 Distributed RAM loop and task pipelining, loop unrolling, parallelization, and
(LUTRAM), 92,727 Look-Up Tables (LUTs), 269 Input/Output
(I/O) pins, 2 Global Buffers (BUFG), 162.5 BRAM (Block RAM), loop tiling to enhance throughput and memory bandwidth
and 183 Digital Signal Processors (DSPs) at a frequency of 100 and lowest FPGA resource requirement. Regarding energy
MHz on the ZCU104 board. The system achieves an accuracy of efficiency, reference [8] emphasizes this aspect by employing
98% in software and 95% after hardware simulation. The system a binary weight method, converting CNN computations into
executes the convolutional layer at 33.5 frames per second, with multiplication-free processing. In [9] and [10], all layers are
the total power consumption of the entire network being 4.87 W.
processed in a computing unit called a matrix multiplication
Index Terms—Convolutional neural networks (CNNs), FPGA, engine, and the utilization of a hierarchical memory structure
weight stationay, spatial architecture. and on-chip buffers reduces the bandwidth limitations of off-
chip memory. However, these studies have not yet estab-
I. I NTRODUCTION lished a comprehensive dataflow that effectively addresses the
Convolutional Neural Networks (CNNs) [1], a specialized challenges of data movement and energy efficiency in CNN
form of Deep Neural Networks (DNNs) [2], have revolution- processing.
ized the field of artificial intelligence by significantly enhanc- In this study, a hardware computation unit for CNNs is imple-
ing the capability of computers to interpret and analyze visual mented, including convolutional computation (CONV), max-
data. Unlike traditional neural networks, CNNs are designed pooling (POOL), fully connected layers (FC), and a softmax
to automatically and adaptively learn spatial hierarchies [3] of activation function. Most of the computations in CNNs come
features from images through convolutional layers. This ability from the convolutional layers. To optimize performance, the
to capture local patterns and spatial relationships makes CNNs key contributions of this work are:
particularly effective for image processing tasks such as facial
recognition, object detection, and image classification. (1) A spatial architecture using an array of PEs with size of
However, state-of-the-art CNNs [4] require tens to hundreds array depend on size of kernel matrix.
of megabytes of parameters and involve billions of operations (2) A data flow called weight stationary is employed, where
per inference pass. This demands substantial data movement weights are kept fixed within an array of Processing
between on-chip and off-chip memory to support computation. Elements (PEs).
(3) The utilization of hierarchical memory structure and Table I
FIFO asynchronous on-chip buffer reduces the off-chip S HAPE PARAMETER OF A CNN LAYER
memory access and reuse data. Shape parameter Description
(4) Fixed-point representation to reduce computational com- N batch size
plexity and improve hardware efficiency. M number of filter/ofmap channel
C channel ifmap
(5) Activation function approximation using lookup table H/W ifmap height/width
methods: By precomputing and storing the values of the R/S filter height/width
softmax function in a LUT, we can significantly reduce E/F ofmap height/width
the need for complex calculations during inference.
This paper is organized as follows. Section II provides
fundamental knowledge of 3-D convolution operation, part
III covers the workflow of software-hardware co-design, and
Part IV describes the architecture of the accelerator and the
characteristics of this study including data flow, processing
element array and FIFO asynchronous.

II. BACKGROUND OF CNN S

CNNs are constructed from multiple computational layers
organized as a directed acyclic graph (DAG)[5]. Each layer
extracts an abstraction of data provided by the previous layer,
which is referred to as a feature map (fmap). The most
common layers in CNNs are convolution (CONV), pooling Fig. 1. Software and Hardware co-design
(POOL), and fully connected (FC) layers. In CONV layers,
as illustrated in figure 7, two-dimensional (2-D) filters slide
over the input images or feature maps (Ifmaps), performing combines both software and hardware to implement a fine-
convolution operations to extract feature characteristics from tuned AlexNet model for handwritten digit recognition. On the
local regions and generating output images or feature maps software side, the model is trained using the MNIST dataset,
(Ofmaps). In the case of three-dimensional (3D) convolution, where a modified AlexNet architecture is used to improve
a batch of 3-D ifmaps is processed by a group of 3-D filters recognition accuracy specific to the task. The model’s weights,
in a layer. In addition, there is a 1-D bias that is added to the obtained from training, are converted into a fixed-point repre-
filtering results. Given the shape parameters in Table I, the sentation to be compatible with the hardware requirements.
On the hardware side, the architecture is designed and
computation of a layer is defined as
implemented in RTL (Register Transfer Level), with IP ver-
ification performed to ensure the accuracy and reliability
O[z][u][x][y] = ReLU (1) of the hardware model. The fixed-point weights from the
software are transferred to the hardware environment, where
!
C−1
X R−1
X S−1X they are integrated into the AlexNet network architecture.
B[u] + I[z][k][Ux + i][Uy + j]W[u][k][i][j] ,
This complete hardware network is then used for real-time
k=0 i=0 j=0
(2) recognition of handwritten digit images.
The software and hardware components are connected
through a feedback loop for image recognition, accuracy calcu-
0 < z < N, 0 < u < M, 0 < y < E, 0 < x < F, lation, and performance comparison, as illustrated in Figure 1.
This collaborative framework enables efficient processing and
H −R+U W −S+U
E= , F = verification, leveraging both the flexibility of software and the
U U performance of hardware to achieve optimal results.
where O, I, W, and B are the matrices of the ofmaps,
ifmaps, filters, and biases, respectively. U is a given stride size. IV. S YSTEM D ESIGN
Fig. ?? shows a visualization of this computation (ignoring A. Architecture Overview
biases). After the convolutions, activation functions, such as Figure 2 illustrates the block diagram of the architecture and
the rectified linear unit (ReLU) [6], are applied to introduce memory hierarchy of the convolutional accelerator, which in-
nonlinearity. cludes a PE array, ping-pong buffer, controller block and ReLU
activation function. This block is responsible for convolution
III. H ARDWARE AND S OFTWARE C O -D ESIGN operations, max pooling, ReLU, and fully connected layers.
The block digram in Fig. III gives an overview of work- The weights, biases, and input feature maps are stored in off-
flow of this proposal. The co-design approach in this project chip DRAM and are read into the accelerator via buffers to
reduce latency when accessing off-chip memory. The memory The PE array will perform a 2-D convolution between the
hierarchy consists of three types: off-chip DRAM, a global kernel and the IFM window, with each row of the PE array
buffer (ping-pong buffer), and registers within each PE. executing a 1-D convolution multiplication as described in
Fig 6. Initially, the first pixel, ifm1, from row 1 is pushed into
the PE array, at which point psum in in all PEs is initialized
to 0. The result of the multiplication is stored in the register
within each PE and passed to the adjacent PE via the psum in
signal at each PE. After W cycles, the first row has been fully
read, and the FIFOs are filled, ready to push one value per
cycle to the psum in input of the first PE in the row below.
The sliding window will shift downward until all H rows of
the IFM are read, at which point the values in FIFO END
represent the result of the 2-D convolution of the kernel (RxS)
with the IFM (RxW). However, this is not the final result,
Fig. 2. System architecture as in 3-D convolution, accumulation occurs along the depth
dimension, using a buffer of size ExF to temporarily store the
Each PE in the PE array is responsible for computing a 2-D convolution results for each channel. The mux will select
convolution operation or max-pooling and accumulating the input from the FIFO for the first channel’s computation and
result through the internal PE register and a ping-pong buffer. alternate for the other channels.
The ping-pong buffer is closely associated with the PE array
in rows. The accelerator is controlled by finite state machine
(FSM) in controller block.

B. Dataflow
Data flow is a major challenge when designing computing
units for convolutional layers, as computations in these layers
are highly complex and involve a large amount of memory
access. To optimize data movement, we use a dataflow called
weight stationary. In this dataflow, the weight filters are
stored statically in small local memory such as registers in
PE, forming a PE array of size RxS, corresponding to the
size of the kernel matrix. The input feature map (activation)
is streamed row by row with a bandwidth of 1 pixel per
cycle, broadcasting activations and accumulating partial sums
spatially across the PE array. Each activation is multiplied
and accumulated with the weight stored statically in the PE. Fig. 5. Convolutional architecture for kernel size of mxn
Each primitive multiplication result needs to be stored and
accumulated with others to form partial sums. By using ping
pong buffer, we can store and reuse the primitive results for
subsequent references. The number of buffers needed is equal
to the number of rows of the weight matrix and the size of
the FIFO depends on the row size of the input feature map
(IFM). The size of the buffer is calculated using the following:
W + 2p − k
Fifo size = +1 (3)
s
This architecture reduces the energy required for weight Fig. 6. 1-D Convolutional
reads, maximizes convolutional operations, and enables effi-
cient reuse of the filter. 1-D Convolution Primitive PE array: the weight station-
• Filter reuse: Each filter weight is reused E x F times ary dataflow first divides the computation in (1) into 1-
within one input feature map (ifm) channel. D convolution primitives that can all run in parallel. Each
• IFM reuse: Each input feature map (ifm) is reused R x primitive operates on one weight of filter weights and one
S times. ifm of ifmap values, generating nine of result of multilple
and accumulator. The result from different primitives are then
further accumulated to generate the partial sum and ofmap
value.
Fig. 3. 3-D convolutional operator.

Fig. 4. 3-D convolutional operator.

By mapping each primitive to a single PE for processing, Table II

the computation of each pair of value (ifm and weight) remains F IXED - POINT REPRESENTATION FOR DATA
stationary within the PE. Due to the sliding window processing Data type Signed bit Integer part Fraction part
of each primitive, as shown in Fig. 6, each PE can utilize local Parameter 1 0 12
scratchpad memory (spads) for both convolutional data reuse Activation/
1 9 12
Image input
and psum accumulation. Since only a sliding window of data Input of Softmax 1 3 13
needs to be retained at any given time, the required is one in Output of Softmax 1 15 16
a PE.

C. Fix-Point representation implementation of the softmax function lies in the computation

of the exponential function.
To deploy a neural network model onto hardware, all
weights must be converted into fixed-point representations. exi
softmax(xi ) = Pn (4)
To determine the exact number of bits needed for fixed-point j=1 exj
representation, we need to identify the output range of each A simple way to implement the exponential function is by
layer in the AlexNet network. The weights in the network using a Look-up Table (LUT), which helps avoid the need for
layers are trained based on the AlexNet model, which has division in hardware.
been fine-tuned for the handwritten digit recognition task on Instead of calculating the softmax function directly, one can
the MNIST dataset. This set of weights achieves a recognition compute the inverse of the softmax function. Therefore, the
accuracy of 98.62% on the MNIST test set. The figure 8 inverse formula of the softmax function is as follows:
below visualizes the value range of weights across the layers N
of the pre-trained network. 1 X
= exj −xi (5)
softmax(x)i j=1
From the figure, it can be observed that most of the weights
Because of normalization, the input data for the softmax layer
in all layers of the network fall within the range of [-1:1].
in the DNN is generally not too large. In this study’s model, the
Therefore, only 1 bit is needed to represent the sign, and no
input data range is [-5, 5], and the total number of input data
bits are required for the integer part.
points is 81,920. As described on the table II, the hardware
input data is fixed to 17 bits with 1 sign bit, 3 integer bits,
D. Softmax function
and 13 fraction bits and the output is 32 bits. The absolute
The softmax function is commonly used in the final layer of error of the output and the floating-point result from software
a CNN and plays a crucial role in the hardware implementation calculations does not exceed 4.5 ×10−6 , and the relative error
of the CNN. The function is given by the formula 4, which does not exceed 0.88%.
shows that the highest computational cost in the hardware The default address without the data index is 0. According
achieving a maximum processing speed of 33.5 FPS with a
power consumption of 4.8W.

Table VI provides a detailed summary of the resource

utilization results for each convolution and fully connected
layer of the CNN. It is evident that as the sliding window
size of the convolution layer increases, so does the number of
Fig. 7. Overall structure of Softmax function. LUTs, FFs, and DSPs. Moreover, although CONV3, CONV4,
and CONV5 layers have different input and output channel
numbers, they consume a similar amount of resources. This is
to this method, the fixed-point number of each input data x
because each channel computation is stored in a fixed buffer,
can be used as its lookup table address to index its exponential
allowing this data to be reused for subsequent channel passes,
value ex . The detailed mapping relationship between the input
highlighting the data reuse capability of partial sum across
fixed-point number and lookup table address is shown in Table
channels.
IV-D below.
VI. CONCLUSION
In this study, we presented a hardware architecture tailored
for Convolutional Neural Networks (CNNs) with a focus on
Table III efficient computation, data movement reduction, and resource
L OOK - UP TABLE FOR EXPONENTIAL FUNCTION management. By implementing a weight-stationary data flow
Address Fixed-point number on a spatially arranged array of processing elements, we
True value
17 bit 32 bit minimized data transfers and maximized filter reuse, leading
1 011 0000000000000 0000000000000000000000000000001 Exp(-5.00)
to a significant reduction in energy requirements. Our design
--- --- ---
1 110 1010100011110 00000000000000000001000110001100 incorporated fixed-point representation and a Look-Up Table-
Exp(-2.32)
--- --- - - - based softmax function to streamline computation and opti-
0 001 0001100111110 00000000000010010000110010000000 Exp(1.10)
mize hardware efficiency. The architecture was successfully
--- --- ---
0 001 001111111110 00000000011001111111011010111101
validated using the fine-tuned AlexNet model on the MNIST
Exp(2.32)
--- --- - - - dataset, achieving a high recognition accuracy of 95% on
0 101 0000000000000 01010110000010100111011100000000 hardware compared to 98% in software simulation, with a real-
Exp(5.00)
time performance of 33.5 frames per second. The proposed
CNN accelerator efficiently utilized available FPGA resources,
as demonstrated by the experimental results, which confirm its
potential in deploying CNNs on low-power embedded systems,
paving the way for further advancements in energy-efficient AI
V. EXPERIMENTAL SETUP AND RESULT hardware solutions.
The design, training, and extraction of post-training param- R EFERENCES
eters for the network were carried out on Google Colab with [1] Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, Fellow, “Eyeriss: An
GPU support (Tesla T4), using the PyTorch library, and all Energy-Efficient Reconfigurable Accelerator for Deep Convolutional
Neural Networks,” IEEE Journal of Solid-State Circuits, vol. 52, no.
network weights are of the float data type. The model used 1, pp. 127–138, 2017.
for this experimental task is based on the AlexNet architecture, [2] Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y.W. Tai, “Exploring het-
which has been fine-tuned to meet the requirements of the task, erogeneous algorithms for accelerating deep convolutional neural net-
works on fpgas,” in Design Automation Conference (DAC), 2017 54th
as described in Figure 4. The details of the model are shown in ACM/EDAC/IEEE, 2017, pp. 1–6.
Table 5.3. The model was trained using the Stochastic Gradient [3] S. I. Venieris and C.-S. Bouganis, “fpgaconvnet: Automated mapping of
Descent (SGD) method with the following configuration pa- convolutional neural networks on fpgas,” in Proceedings of the 2017
ACM/SIGDA International Symposium on Field-Programmable Gate
rameters: Image dataset: MNIST, Number of training samples: Arrays (FPGA), 2017, pp. 291–292.
60,000 images, Learning rate: 0.005, Momentum: 0.8, Batch [4] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing loop oper-
size: 32, Epochs: 20. ation and dataflow in fpga acceleration of deep convolutional neural
networks,” in Proceedings of the 2017 ACM/SIGDA International Sym-
An independent test dataset, separate from the training set, posium on FieldProgrammable Gate Arrays (FPGA), 2017, pp. 45–54.
consisting of 10,000 images containing digits from 0 to 9, is [5] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N.
used for evaluation. The software testing is performed by a Xu, S. Song et al., “Going deeper with embedded fpga platform for
convolutional neural network,” in Proceedings of the 2016 ACM/SIGDA
Python-based program. According to Table V, the pre-trained International Symposium on Field-Programmable Gate Arrays (FPGA),
neural network with float-type weights achieves an accuracy 2016, pp. 26–35.
of 98.6%. [6] A. Aimar, H. Mostafa, E. Calabrese, A. Rios-Navarro, R. Tapiador-
Morales, I.-A. Lungu, M. B. Milde, F. Corradi, A. Linares-Barranco,
The deployment architecture operates at a frequency of 100 S.-C. Liu, and T. Delbruck, ”Nullhop: A flexible convolutional neural
MHz, utilizing the Zynq UltraScale+ ZCU104 FPGA board, network accelerator based on sparse repre.
Fig. 8. Range of weight values across hidden layers.

Table IV
ACCURACY OF CNN-BASED R ECOGNITION ON S OFTWARE A FTER T RAINING

Class 0 1 2 3 4 5 6 7 8 9 Average
Accuracy 99.18% 99.30% 98.55% 98.02% 97.96% 98.88% 98.23% 99.03% 98.56% 98.41% 98.6%

Table V
ACCURACY OF RTL-BASED R ECOGNITION ON FPGA

Class 0 1 2 3 4 5 6 7 8 9 Average
Accuracy 96.67% 96.64% 96.75% 94.44% 94.31% 89.05% 96.34% 98.40% 95.65% 94.51% 95.3%

Table VI
SUMMARY OF TOTAL RESOURCE USAGE FOR THE ENTIRE NETWORK ON
FPGA

Result
Resource
Use Avaiable Utilization
LUT 93727 230400 40.68%
LUTRAM 624 101760 0.61%
FF 125021 460800 27.13%
BRAM 162.5 312 52.08%
I/O 269 360 74.72%
BUFG 2 544 0.37%
DSP 183 1728 10.59%

Table VII
S UMMARY TABLE OF METRICS FOR FINE - TUNED A LEX N ET NETWORK
Fig. 9. Fine-tuned AlexNet Network Architecture
Power Num.of Num.of
Layer LUT FF DSP
(mW) MAC PE
CONV1 333 11.71M 121 34303 73793 121
[7] R. Zhao, X. Niu, and W. Luk, ”Automatic optimising cnn with depthwise CONV2 84 37.32M 25 8166 18143 25
separable convolution on fpga: (abstact only),” in Proceedings of the CONV3 26 12.46M 9 2301 4551 9
2018 ACM/SIGDA International Symposium on Field-Programmable CONV4 26 24.92M 9 2301 4551 9
Gate Arrays (FPGA), 2018, pp. 285–285. CONV5 26 12.46M 9 2301 4549 9
[8] Xiao Dong, Xiaolei Zhu, and De Ma, ”Hardware Implementation of Sum
Softmax Function Based on Piecewise LUT” 2019 IEEE International 4.8W 104.65M 183 93727 125021 183
(all)
Workshop on Future Computing (IWOFC)

Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Convolution Optimization For DNN
No ratings yet
Convolution Optimization For DNN
14 pages
Efficient Hardware Architectures For Deep Convolutional Neural Network
No ratings yet
Efficient Hardware Architectures For Deep Convolutional Neural Network
13 pages
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
No ratings yet
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
5 pages
Electronics 08 00065
No ratings yet
Electronics 08 00065
19 pages
Data and Hardware Efficient Design For Convolutional Neural Network!
No ratings yet
Data and Hardware Efficient Design For Convolutional Neural Network!
10 pages
2017.01.jssc.eyeriss_design
No ratings yet
2017.01.jssc.eyeriss_design
12 pages
A CNN Accelerator On FPGA Using Depthwise Separable Convolution
No ratings yet
A CNN Accelerator On FPGA Using Depthwise Separable Convolution
5 pages
CNN-MERP: An FPGA-Based Memory-Efficient Reconfigurable Processor For Forward and Backward Propagation of Convolutional Neural Networks
No ratings yet
CNN-MERP: An FPGA-Based Memory-Efficient Reconfigurable Processor For Forward and Backward Propagation of Convolutional Neural Networks
8 pages
10.1109VDAT50263.2020.9190274
No ratings yet
10.1109VDAT50263.2020.9190274
6 pages
Irmak2021energy_efficient
No ratings yet
Irmak2021energy_efficient
4 pages
Electronics 10 02859 v2
No ratings yet
Electronics 10 02859 v2
16 pages
Main
No ratings yet
Main
8 pages
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
No ratings yet
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
18 pages
CNN hw1
No ratings yet
CNN hw1
13 pages
An Efficient Hardware Accelerator For Structured Sparse Convolutional Neural Networks On Fpgas
No ratings yet
An Efficient Hardware Accelerator For Structured Sparse Convolutional Neural Networks On Fpgas
12 pages
[email protected]
No ratings yet
[email protected]
4 pages
A Survey of FPGA Based Accelerators For
No ratings yet
A Survey of FPGA Based Accelerators For
32 pages
Fully Convolutional
No ratings yet
Fully Convolutional
4 pages
A_New_Hardware-Efficient_VLSI-Architecture_of_GoogLeNet_CNN-Model_Based_Hardware_Accelerator_for_Edge_Computing_Applications
No ratings yet
A_New_Hardware-Efficient_VLSI-Architecture_of_GoogLeNet_CNN-Model_Based_Hardware_Accelerator_for_Edge_Computing_Applications
4 pages
A CNN Accelerator On FPGA With A Flexible Structure
No ratings yet
A CNN Accelerator On FPGA With A Flexible Structure
6 pages
Advantages and Limitations of Fully on-chip CNN
No ratings yet
Advantages and Limitations of Fully on-chip CNN
5 pages
Fixed-Point CNN For FPGA
No ratings yet
Fixed-Point CNN For FPGA
7 pages
Energy-Efficient Convolution Architecture Based on Rescheduled Dataflow
No ratings yet
Energy-Efficient Convolution Architecture Based on Rescheduled Dataflow
12 pages
Cafpga: An Automatic Generation Model For CNN Accelerator
No ratings yet
Cafpga: An Automatic Generation Model For CNN Accelerator
30 pages
cao2019
No ratings yet
cao2019
5 pages
MAC
No ratings yet
MAC
5 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
An Efficient Reconfigurable Hardware Accelerator for Cnn
No ratings yet
An Efficient Reconfigurable Hardware Accelerator for Cnn
5 pages
Zynqnet: An Fpga-Accelerated Embedded Convolutional Neural Network
No ratings yet
Zynqnet: An Fpga-Accelerated Embedded Convolutional Neural Network
102 pages
Tcas-I Haco Final
No ratings yet
Tcas-I Haco Final
14 pages
A Comprehensive Evaluation of CNN
No ratings yet
A Comprehensive Evaluation of CNN
5 pages
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
No ratings yet
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
8 pages
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
No ratings yet
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
15 pages
Acceleration and Optimization of Artificial Intelligence CNN Image Recognition Based On F
No ratings yet
Acceleration and Optimization of Artificial Intelligence CNN Image Recognition Based On F
5 pages
FPT2017-PipeCNN
No ratings yet
FPT2017-PipeCNN
4 pages
rongshi2019
No ratings yet
rongshi2019
4 pages
NullHop_A_Flexible_Convolutional_Neural_Network_Accelerator_Based_on_Sparse_Representations_of_Feature_Maps
No ratings yet
NullHop_A_Flexible_Convolutional_Neural_Network_Accelerator_Based_on_Sparse_Representations_of_Feature_Maps
13 pages
Systematic Analysis of FPGA-based Hardware Acceler
No ratings yet
Systematic Analysis of FPGA-based Hardware Acceler
9 pages
A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration
No ratings yet
A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration
10 pages
Kanoria Shubham Anil 2023HT01569
No ratings yet
Kanoria Shubham Anil 2023HT01569
9 pages
A High-Performance Hardware Accelerator For Sparse Convolutional Neural Network On FPGA
No ratings yet
A High-Performance Hardware Accelerator For Sparse Convolutional Neural Network On FPGA
7 pages
286-1006-1-PB (3)
No ratings yet
286-1006-1-PB (3)
8 pages
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
No ratings yet
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
5 pages
Article Report Final
No ratings yet
Article Report Final
9 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
Design and Implementation of Convolutional Neural Network Accelerator Based On RISCV
No ratings yet
Design and Implementation of Convolutional Neural Network Accelerator Based On RISCV
6 pages
Performance Modeling For CNN Inference Accelerators On FPGA
No ratings yet
Performance Modeling For CNN Inference Accelerators On FPGA
14 pages
A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA
No ratings yet
A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA
8 pages
FPGA Based Implementation of Neural Network
No ratings yet
FPGA Based Implementation of Neural Network
5 pages
10 3390@electronics8030295
No ratings yet
10 3390@electronics8030295
15 pages
A 64 KB Reconfigurable Full-Precision Digital ReRAM-Based Compute-In-Memory For Artificial Intelligence Applications
No ratings yet
A 64 KB Reconfigurable Full-Precision Digital ReRAM-Based Compute-In-Memory For Artificial Intelligence Applications
13 pages
PE Implementation paper
No ratings yet
PE Implementation paper
2 pages
Electronics 13 01564 v2
No ratings yet
Electronics 13 01564 v2
18 pages
1808 09945
No ratings yet
1808 09945
9 pages
Image Skin Cancer Classification Based On FPGA and Convolutional Neural Network
No ratings yet
Image Skin Cancer Classification Based On FPGA and Convolutional Neural Network
7 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
From Everand
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
S. R. Jena
No ratings yet
ShareX Log 2022 03
No ratings yet
ShareX Log 2022 03
30 pages
SCT - en Summa Cutter Tools
No ratings yet
SCT - en Summa Cutter Tools
30 pages
Creo Parametric
No ratings yet
Creo Parametric
8 pages
Jenkins Pipeline - Intermediate
No ratings yet
Jenkins Pipeline - Intermediate
97 pages
Internet of Things: (Lecture 2)
No ratings yet
Internet of Things: (Lecture 2)
48 pages
smb5653 Configure Port To Vlan Interface Settings On A Switch Throug
No ratings yet
smb5653 Configure Port To Vlan Interface Settings On A Switch Throug
10 pages
ThinkPad X1 Yoga 4th Gen Datasheet EN
No ratings yet
ThinkPad X1 Yoga 4th Gen Datasheet EN
2 pages
Lab 3 - Conveyor System Control
No ratings yet
Lab 3 - Conveyor System Control
10 pages
Wews
No ratings yet
Wews
18 pages
F2837xD MCU Driverlib 1-0
No ratings yet
F2837xD MCU Driverlib 1-0
48 pages
VIA CN700 MBD-J-J7F2WE2G-manual PDF
No ratings yet
VIA CN700 MBD-J-J7F2WE2G-manual PDF
49 pages
Computer Architecture Report
No ratings yet
Computer Architecture Report
32 pages
Roof Con/Truss Con-Manual For Beginners
100% (1)
Roof Con/Truss Con-Manual For Beginners
24 pages
RDWorks V8 Manual 022818
No ratings yet
RDWorks V8 Manual 022818
76 pages
Information and Communication Technology: Edexcel IGCSE
No ratings yet
Information and Communication Technology: Edexcel IGCSE
16 pages
Computer Graphics
No ratings yet
Computer Graphics
5 pages
Oppo CPH2375 Op5312l1 2024-06-07 09-16-32
No ratings yet
Oppo CPH2375 Op5312l1 2024-06-07 09-16-32
177 pages
Word 2021 Advanced Quick Reference
No ratings yet
Word 2021 Advanced Quick Reference
3 pages
Smartface Platform: Datasheet
No ratings yet
Smartface Platform: Datasheet
7 pages
Manual PDF
No ratings yet
Manual PDF
50 pages
Fanuc Screen Display Manual
33% (3)
Fanuc Screen Display Manual
86 pages
Nikhil Mind CV
No ratings yet
Nikhil Mind CV
1 page
Voicelab: Replicable Automated Acoustical Analysis
No ratings yet
Voicelab: Replicable Automated Acoustical Analysis
13 pages
Inspiron Notebook Diagnostic Indicators
No ratings yet
Inspiron Notebook Diagnostic Indicators
14 pages
Wa0007.
No ratings yet
Wa0007.
36 pages
JBC Web Manager User Manual
No ratings yet
JBC Web Manager User Manual
16 pages
Empowerment and Technology
No ratings yet
Empowerment and Technology
11 pages
CES 2025 Innovation Awards-英文版
No ratings yet
CES 2025 Innovation Awards-英文版
459 pages
Fuzzy Preprocessing of Viola-Jones Algorithm For Face Recognition in Thermal Images
No ratings yet
Fuzzy Preprocessing of Viola-Jones Algorithm For Face Recognition in Thermal Images
8 pages
Manual64450837 Inspiron 13 7000 2 in 1 Service Manual
No ratings yet
Manual64450837 Inspiron 13 7000 2 in 1 Service Manual
77 pages

Design and Implementation of Hardware Computation For Convolutional Neural Networks

Uploaded by

Design and Implementation of Hardware Computation For Convolutional Neural Networks

Uploaded by

Design and Implementation of Hardware

Computation for Convolutional Neural Networks

II. BACKGROUND OF CNN S

Fig. 4. 3-D convolutional operator.

By mapping each primitive to a single PE for processing, Table II

C. Fix-Point representation implementation of the softmax function lies in the computation

Table VI provides a detailed summary of the resource

You might also like