Design and Implementation of Hardware Computation For Convolutional Neural Networks
Design and Implementation of Hardware Computation For Convolutional Neural Networks
Abstract—Convolutional Neural Networks (CNNs) are vital Since data movement can consume more energy than the
in artificial intelligence and machine learning, especially for computation itself [5], optimizing CNN processing involves
image processing and recognition. They are widely used in facial not only achieving high parallelism for increased throughput
recognition, object detection, and image classification, signifi-
cantly improving system performance and accuracy. However, but also enhancing the efficiency of data movement across the
deploying CNNs on hardware poses challenges due to their system. To address these challenges, it is crucial to design a
high computational and memory requirements and the complex compute scheme, called a dataflow, that can support a highly
computations arising from the weight-sharing mechanism used parallel compute paradigm while optimizing the energy cost
in CNNs. Designing efficient hardware accelerators involves of data movement from both on-chip and off-chip. The cost
balancing speed, power consumption, and resource usage.
In this research, the design and implementation of a computation of data movement is reduced by exploiting data reuse in a
unit for CNNs include a convolutional accelerator, max-pooling multilevel memory hierarchy.
layer, fully connected layers, and a softmax activation function. Almost all existing FPGA-based CNN implementations have
This study utilizes a data flow called weight stationary (WS) to focused on exploring the limitations of memory bandwidth and
minimize data movement and reuse partial sums based on spatial computing parallelism. Works such as [5] and [6] alleviate
architecture with an array of processing elements. Specifically,
a softmax activation function is implemented using a Look- pressure on off-chip memory by reducing the precision of
Up Table (LUT) technique to construct a complete AlexNet the neural network parameters, as lower numerical precision
(batch size N = 1) for the handwritten digit recognition task has been shown to be sufficient for CNNs. Other studies,
using the MNIST dataset and fixed-point representation for data. such as [6] and [7], have exploited fixed-point quantization,
The design utilizes 125,021 Flip-Flops, 624 Distributed RAM loop and task pipelining, loop unrolling, parallelization, and
(LUTRAM), 92,727 Look-Up Tables (LUTs), 269 Input/Output
(I/O) pins, 2 Global Buffers (BUFG), 162.5 BRAM (Block RAM), loop tiling to enhance throughput and memory bandwidth
and 183 Digital Signal Processors (DSPs) at a frequency of 100 and lowest FPGA resource requirement. Regarding energy
MHz on the ZCU104 board. The system achieves an accuracy of efficiency, reference [8] emphasizes this aspect by employing
98% in software and 95% after hardware simulation. The system a binary weight method, converting CNN computations into
executes the convolutional layer at 33.5 frames per second, with multiplication-free processing. In [9] and [10], all layers are
the total power consumption of the entire network being 4.87 W.
processed in a computing unit called a matrix multiplication
Index Terms—Convolutional neural networks (CNNs), FPGA, engine, and the utilization of a hierarchical memory structure
weight stationay, spatial architecture. and on-chip buffers reduces the bandwidth limitations of off-
chip memory. However, these studies have not yet estab-
I. I NTRODUCTION lished a comprehensive dataflow that effectively addresses the
Convolutional Neural Networks (CNNs) [1], a specialized challenges of data movement and energy efficiency in CNN
form of Deep Neural Networks (DNNs) [2], have revolution- processing.
ized the field of artificial intelligence by significantly enhanc- In this study, a hardware computation unit for CNNs is imple-
ing the capability of computers to interpret and analyze visual mented, including convolutional computation (CONV), max-
data. Unlike traditional neural networks, CNNs are designed pooling (POOL), fully connected layers (FC), and a softmax
to automatically and adaptively learn spatial hierarchies [3] of activation function. Most of the computations in CNNs come
features from images through convolutional layers. This ability from the convolutional layers. To optimize performance, the
to capture local patterns and spatial relationships makes CNNs key contributions of this work are:
particularly effective for image processing tasks such as facial
recognition, object detection, and image classification. (1) A spatial architecture using an array of PEs with size of
However, state-of-the-art CNNs [4] require tens to hundreds array depend on size of kernel matrix.
of megabytes of parameters and involve billions of operations (2) A data flow called weight stationary is employed, where
per inference pass. This demands substantial data movement weights are kept fixed within an array of Processing
between on-chip and off-chip memory to support computation. Elements (PEs).
(3) The utilization of hierarchical memory structure and Table I
FIFO asynchronous on-chip buffer reduces the off-chip S HAPE PARAMETER OF A CNN LAYER
memory access and reuse data. Shape parameter Description
(4) Fixed-point representation to reduce computational com- N batch size
plexity and improve hardware efficiency. M number of filter/ofmap channel
C channel ifmap
(5) Activation function approximation using lookup table H/W ifmap height/width
methods: By precomputing and storing the values of the R/S filter height/width
softmax function in a LUT, we can significantly reduce E/F ofmap height/width
the need for complex calculations during inference.
This paper is organized as follows. Section II provides
fundamental knowledge of 3-D convolution operation, part
III covers the workflow of software-hardware co-design, and
Part IV describes the architecture of the accelerator and the
characteristics of this study including data flow, processing
element array and FIFO asynchronous.
C−1
software are transferred to the hardware environment, where
!
X R−1
X S−1X they are integrated into the AlexNet network architecture.
B[u] + I[z][k][Ux + i][Uy + j]W[u][k][i][j] ,
This complete hardware network is then used for real-time
k=0 i=0 j=0
(2) recognition of handwritten digit images.
The software and hardware components are connected
through a feedback loop for image recognition, accuracy calcu-
0 < z < N, 0 < u < M, 0 < y < E, 0 < x < F, lation, and performance comparison, as illustrated in Figure 1.
This collaborative framework enables efficient processing and
H −R+U W −S+U
E= , F = verification, leveraging both the flexibility of software and the
U U
performance of hardware to achieve optimal results.
where O, I, W, and B are the matrices of the ofmaps,
ifmaps, filters, and biases, respectively. U is a given stride size. A. Architecture Overview
Fig. ?? shows a visualization of this computation (ignoring Figure 2 illustrates the block diagram of the architecture and
biases). After the convolutions, activation functions, such as memory hierarchy of the convolutional accelerator, which in-
the rectified linear unit (ReLU) [6], are applied to introduce cludes a PE array, ping-pong buffer, controller block and ReLU
nonlinearity. activation function. This block is responsible for convolution
operations, max pooling, ReLU, and fully connected layers.
III. H ARDWARE AND S OFTWARE C O -D ESIGN The weights, biases, and input feature maps are stored in off-
The block digram in Fig. III gives an overview of work- chip DRAM and are read into the accelerator via buffers to
flow of this proposal. The co-design approach in this project reduce latency when accessing off-chip memory. The memory
hierarchy consists of three types: off-chip DRAM, a global V. S YSTEM D ESIGN
buffer (ping-pong buffer), and registers within each PE. The PE array will perform a 2-D convolution between the
kernel and the IFM window, with each row of the PE array
executing a 1-D convolution multiplication as described in
Fig 6. Initially, the first pixel, ifm1, from row 1 is pushed into
the PE array, at which point psum in in all PEs is initialized
to 0. The result of the multiplication is stored in the register
within each PE and passed to the adjacent PE via the psum in
signal at each PE. After W cycles, the first row has been fully
read, and the FIFOs are filled, ready to push one value per
cycle to the psum in input of the first PE in the row below.
The sliding window will shift downward until all H rows of
the IFM are read, at which point the values in FIFO END
Fig. 2. System architecture
represent the result of the 2-D convolution of the kernel (RxS)
with the IFM (RxW). However, this is not the final result,
Each PE in the PE array is responsible for computing a
as in 3-D convolution, accumulation occurs along the depth
convolution operation or max-pooling and accumulating the
dimension, using a buffer of size ExF to temporarily store the
result through the internal PE register and a ping-pong buffer.
2-D convolution results for each channel. The mux will select
The ping-pong buffer is closely associated with the PE array
input from the FIFO for the first channel’s computation and
in rows. The accelerator is controlled by finite state machine
alternate for the other channels.
(FSM) in controller block.
B. Dataflow
Data flow is a major challenge when designing computing
units for convolutional layers, as computations in these layers
are highly complex and involve a large amount of memory
access. To optimize data movement, we use a dataflow called
weight stationary. In this dataflow, the weight filters are
stored statically in small local memory such as registers in
PE, forming a PE array of size RxS, corresponding to the
size of the kernel matrix. The input feature map (activation)
is streamed row by row with a bandwidth of 1 pixel per
cycle, broadcasting activations and accumulating partial sums
spatially across the PE array. Each activation is multiplied
and accumulated with the weight stored statically in the PE.
Each primitive multiplication result needs to be stored and
accumulated with others to form partial sums. By using ping
Fig. 5. Convolutional architecture for kernel size of 3x3
pong buffer, we can store and reuse the primitive results for
subsequent references. The number of buffers needed is equal
to the number of rows of the weight matrix and the size of
the FIFO depends on the row size of the input feature map
(IFM). The size of the buffer is calculated using the following:
W + 2p − k
Fifo size = +1 (3)
s
This architecture reduces the energy required for weight
reads, maximizes convolutional operations, and enables effi-
cient reuse of the filter.
• Filter reuse: Each filter weight is reused E x F times Fig. 6. 1-D Convolutional
within one input feature map (ifm) channel.
• IFM reuse: Each input feature map (ifm) is reused R x A. Fix-Point representation
S times. To deploy a neural network model onto hardware, all
weights must be converted into fixed-point representations.
To determine the exact number of bits needed for fixed-point
IV. S YSTEM D ESIGN representation, we need to identify the output range of each
layer in the AlexNet network. The weights in the network
Fig. 3. 3-D convolutional operator.
layers are trained based on the AlexNet model, which has Instead of calculating the softmax function directly, one can
been fine-tuned for the handwritten digit recognition task on compute the inverse of the softmax function. Therefore, the
the MNIST dataset. This set of weights achieves a recognition inverse formula of the softmax function is as follows:
accuracy of 98.62% on the MNIST test set. The figure 8 N
1 X
below visualizes the value range of weights across the layers = exj −xi (5)
softmax(x)i
of the pre-trained network. j=1
Table II
F IXED - POINT REPRESENTATION FOR DATA
Table III
L OOK - UP TABLE FOR EXPONENTIAL FUNCTION
Class 0 1 2 3 4 5 6 7 8 9 Average
Accuracy 99.18% 99.30% 98.55% 98.02% 97.96% 98.88% 98.23% 99.03% 98.56% 98.41% 98.6%
Table V
ACCURACY OF RTL-BASED R ECOGNITION ON FPGA
Class 0 1 2 3 4 5 6 7 8 9 Average
Accuracy 96.67% 96.64% 96.75% 94.44% 94.31% 89.05% 96.34% 98.40% 95.65% 94.51% 95.3%
Table VII
S UMMARY TABLE OF METRICS FOR FINE - TUNED A LEX N ET NETWORK
R EFERENCES
[1] Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, Fellow, “Eyeriss: An
Energy-Efficient Reconfigurable Accelerator for Deep Convolutional
Neural Networks,” IEEE Journal of Solid-State Circuits, vol. 52, no.
1, pp. 127–138, 2017.
[2] Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y.W. Tai, “Exploring het-
erogeneous algorithms for accelerating deep convolutional neural net-
works on fpgas,” in Design Automation Conference (DAC), 2017 54th
ACM/EDAC/IEEE, 2017, pp. 1–6.
[3] S. I. Venieris and C.-S. Bouganis, “fpgaconvnet: Automated mapping of
convolutional neural networks on fpgas,” in Proceedings of the 2017
ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays (FPGA), 2017, pp. 291–292.
[4] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing loop oper-
ation and dataflow in fpga acceleration of deep convolutional neural
networks,” in Proceedings of the 2017 ACM/SIGDA International Sym-
posium on FieldProgrammable Gate Arrays (FPGA), 2017, pp. 45–54.
[5] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N.
Xu, S. Song et al., “Going deeper with embedded fpga platform for
convolutional neural network,” in Proceedings of the 2016 ACM/SIGDA