0% found this document useful (0 votes)
13 views

Bactran A Hardware Batch Normalization Implementation For CNN Training Engine

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Bactran A Hardware Batch Normalization Implementation For CNN Training Engine

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

IEEE EMBEDDED SYSTEMS LETTERS, VOL. 13, NO.

1, MARCH 2021 29

Bactran: A Hardware Batch Normalization


Implementation for CNN Training Engine
Yang Zhijie , Wang Lei, Luo Li, Li Shiming, Guo Shasha, and Wang Shuquan

Abstract—In recent years, convolutional neural networks these accelerators have not focused on nonconvolution layers
(CNNs) have been widely used. However, their ever-increasing in training.
amount of parameters makes it challenging to train them with Accelerating CNN training is inseparable from
the GPUs, which is time and energy expensive. This has prompted the support of batch normalization (BN). In 2015,
researchers to turn their attention to training on more energy-
Ioffe and Szegedy [7] proposed to add a BN layer to
efficient hardware. batch normalization (BN) layer has been
widely used in various state-of-the-art CNNs for it is an indispens- the CNN model. The BN layer can adjust the distribution of
able layer in the acceleration of CNN training. As the amount of input data to avoid problems, such as gradient explosion or
computation of the convolutional layer declines, its importance gradient disappearance. Thus, it accelerates the convergence
continues to increase. However, the traditional CNN training of CNN training.
accelerators do not pay attention to the efficient hardware imple- The proportion of computation of convolution layers is
mentation of the BN layer. In this letter, we design an efficient declining as the kernels of advanced CNNs become smaller
CNN training architecture by using the systolic array. The pro- and are used more frequently. The BN layer is the most crucial
cessing element of the systolic array can support the BN functions part in nonconvolution layers [6]. For example, DenseNet-121
both in the training process and the inference process. The
BN function implemented is an improved, hardware-friendly BN [120 convolution (CONV) layers plus one FC layer] spends
algorithm, range batch normalization (RBN). The experimental 58.5% of the execution time on nonconvolution layers. In train-
results show that the implementation of RBN saves 10% hard- ing it, the operations of the BN layer account for more than
ware resources, reduces the power by 10.1%, and the delay by 90% of all nonconvolution layers operations. Because opera-
4.6% on average. We implement the accelerator on the field pro- tions of the BN layer involve the computation of fundamental
grammable gate array VU440, and the power consumption of statistics, such as mean, variance, and standard deviation. So
the its core computing engine is 8.9 W. they need a large number of multiplication, division, and
Index Terms—Accelerator, batch normalization, convolutional square operations, leading to much time and power consum-
neural network (CNN), systolic array, training. ing. In contrast, other nonconvolution layers, such as pooling
and nonlinear activation function ReLU, only involve simple
I. I NTRODUCTION comparison operation.
OMPARED to inference, training convolutional neural However, to the best of our knowledge, no accelerator archi-
C network (CNN) is more complex and more vital since
it requires much time and energy. An example of training
tecture supports both the BN functions in CNN training and
inference. In prior work, the BN layer calculations could only
with GPUs is that it takes 29 h to train ResNet-50 with eight be handled by software which introduced much overhead.
Tesla P100 GPUs, each consuming 250 W [6]. The dataset is Although this letter [12] integrated the BN in the inference
ImageNet. If the training is executed only once, its cost can be and MAC into a digital signal processor (DSP), it did not
amortized. However, various types of datasets lead to frequent support the BN in training.
training. These all illustrate the importance of optimizing and Naturally, the complex operation of BN leads to compli-
accelerating training CNNs. Since the high power consump- cated implementation in hardware, especially, BN in training.
tion and long training time brought by training with GPUs Fortunately, Banner et al. [1] proposed a simplified method,
become intolerable, researchers turn to more energy-efficient range batch normalization (RBN). With this method, the stan-
hardware, such as field programmable gate arrays (FPGAs) dard deviation can be easily approximated. The effect is
and application specific integrated circuits (ASICs). similar to that of the BN implementation. Thus, it eliminates
A few works of accelerating CNN training with hardware the process of calculating the variance and square root, which
include Google’s TPU V2 [13], the accelerator based on the simplifies the hardware implementation.
systolic array and FPGA [2] and wavecore [9]. However, As we all know, the systolic array can support convolu-
tion layers efficiently. By studying the characteristics of BN,
Manuscript received November 4, 2019; revised December 15, 2019; we find that systolic array can provide convenience to opera-
accepted February 15, 2020. Date of publication February 19, 2020; date tions of BN. Because each of the processing elements (PEs) is
of current version February 26, 2021. This work was supported by the responsible for calculating a pixel of the output feature map,
National Key Research and Development Program of China under Grant
2018YFB2202603, and in part by the National Natural Science Foundation
which is an input of the BN layer. So a systolic array with
of China under Grant 61802427 and Grant 61832018. This manuscript BN function can support both convolution layers and the most
was recommended for publication by Y. Chen. (Corresponding author: significant nonconvolution layer, BN, which means this letter
Yang Zhijie.) can accelerate the majority of calculations in CNN training
The authors are with the College of Computer Science and Technology,
National University of Defense Technology, Changsha 410073, China (e-mail:
and inference.
[email protected]). Therefore, this letter focuses on the efficient hardware
Digital Object Identifier 10.1109/LES.2020.2975055 implementation of the BN layer in CNN training and
1943-0671 
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on September 15,2023 at 16:16:36 UTC from IEEE Xplore. Restrictions apply.
30 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 13, NO. 1, MARCH 2021

Algorithm 1 Algorithm of Batch Normalization [7]


Input: Values of x over a mini-batch: β = {x1...m };
Parameters to be learned: γ ,β
Output: {yi = BNγ ,β (xi )}

1 x
m
μ= m i //mini-batch mean
i=1
1  (x − μ )2
m
σβ2 = m i β //mini-batch variance
i=1
xi −μβ
x̂i =  //normalize
σβ2 +
Fig. 1. Block diagram of the overall architecture.
yi = γ x̂i + β ≡ BNγ ,β (xi ) //scale and shift

inference. To the best of our knowledge, this is the first time values. Because if the input is assumed to follow the Gaussian
that BN function is combined with systolic array and both distribution, the range of the input is highly correlated with
BN functions in training and inference are implemented in the standard deviation σ
hardware. The main contributions of this letter are as follows. xi − μ
1) We implement an efficient hardware engine for CNN x̂i = . (1)
C(n) · range(xi − μ)
training. Its core unit is a systolic array with BN
function. √
2) We implement an efficient PE in the systolic array that As shown in (1), C(n) = 2 ln(N), where N is the mini
supports BN functions in both training and inference batch size, and range(x) = max(x) − min(x). With RBN, we
using the RBN algorithm which decreases the hardware do not need to calculate the variance. RBN not only solves
implementation overhead and latency when compared the problem of data overflowing in variance computation but
with the BN implementation baseline. also eliminates a lot of multiplication computation and square
The experimental results show that the power of the comput- root operation. Thus, RBN implementation saves hardware
ing engine is 8.9 W under Xilinx’s FPGA vu440. We declare resources.
that RBN implementation could save 10.0% of hardware
resources, 10.1% power consumption and a 4.6% reduc-
tion in latency on average when compared with the BN III. A RCHITECTURE
implementation baseline. A. Top Design
The overall architecture is shown in Fig. 1. It contains a
II. BACKGROUND unified buffer, a weight buffer, a buffer for inputs and results,
A. Batch Normalization a systolic array, a backpropagation engine.
The entire system works as follows. Through the data bus,
Ioffe and Szegedy [7] first proposed the BN layer. The data are exchanged between the external host and the unified
main idea is to solve the problem of the variation of the buffer on-chip using AXI protocol. After being rearranged by
internal covariate distribution through batch normalization. In the data setup unit, the data are mapped into the corresponding
other words, by batch normalization, the difficulties of gradient buffer. Once the computation begins, the data are transferred
explosion and gradient disappearance during the backpropaga- from the buffer to the array for processing. The operations
tion of training are alleviated, which significantly accelerates include MAC and BN. The results flow out of the systolic
the convergence speed of CNN training. array and are stored in the buffers. Finally, they are transferred
Batch normalization’s operations are shown in Algorithm 1. to the outside through the unified buffer.
Typically, the BN layer is placed after the convolutional layer During the training process, the backpropagation engine
and before the activation layer. The input of the BN layer is is responsible for calculating errors, gradients, and updat-
the result of the cumulative summation of the volumes calcu- ing weights. When performing error propagation, the BP
lated by the previous convolutional layer on each of the feature engine leaves the gradient calculation (mainly matrix multi-
maps in one mini batch, i.e., x1 − xm . The parameters of the plication) to the systolic array. The systolic array transfers the
BN layer are γ and β. These two parameters are updated by results back to the BP engine to complete subsequent error
the chained rule of the backpropagation algorithm at the end propagation.
of each mini-batch computation. The parameters obtained by
training with different datasets are different because the distri-
butions of different datasets are different. The output of the BN B. Data Mapping
layer is y1 − ym obtained by normalizing the input x1 − xm .
Because the core computation unit is a systolic array, it can
The entire process redirects the distribution of these data to
efficiently compute convolutions during inference and training.
the linear region of the activation function, thereby alleviating
Data needs to be rearranged and mapped before injecting
the problem of gradient explosion and gradient disappearance.
into the systolic array to ensure correctness. First, we use
the data arrangement strategy in [14], which can significantly
B. Range Batch Normalization reduce the traffic of data exchange and improve the reuse of
Banner et al. [1] first proposed the RBN idea. The main idea data. We use a series of mapping strategies [15] which is out-
is that the standard deviation σ can be approximated by the put stationery (OS) [11] to map the data to our fixed-size
product of adjustment parameter C(n) and the range of input systolic array.

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on September 15,2023 at 16:16:36 UTC from IEEE Xplore. Restrictions apply.
ZHIJIE et al.: BACTRAN: HARDWARE BATCH NORMALIZATION IMPLEMENTATION FOR CNN TRAINING ENGINE 31

data path selection, they perform the BN in inference and


MAC, respectively.
Thus, not all of the modules in PE_BN works when PE_BN
is triggered. Most of the time, only some of the pipeline stages
are executed in a loop. Compared with PE, PE_BN will only
be triggered when the PE calculates a pixel in the output fea-
ture mapping matrix. If we define the trigger as any of the
stages to be executed, then its trigger probability is
1
∗ trigger_probability_of_PE. (2)
kernel_size2
Therefore, although the implementation of BN brings power
overhead, its triggering probability is much lower than PE.

IV. E VALUATION
Fig. 2. PE with MAC and BN. Numerical serial numbers in it indicate the A. Experimental Setup
sequence of the process of training. The part of the red background called
PE is for the MAC, and the part of the blue background called PE_BN is the The experimental platform used in this letter is Xilinx’s
function of the BN. FPGA vu440, and the whole design is implemented in the
RTL-level code. In this implementation, all operands are
C. Processing Element 16 bits, the x_ram is 128B because it needs to store 64 (the
The PE_BN integrated BN and MAC is shown in Fig. 2. It size of mini batch) 16-bit results. Each PE stores only the
has three modes: 1) inference or training without BN; 2) train- elements in the same position in the activation vector in a
ing with BN; and 3) inference with BN. Through the selection mini batch, and the whole activation vector is distributed and
of the data path, the PE_BN can switch between these three stored in multiple PEs. To be general, the calculation method
different modes. of x_ram size is as follows:
Mode 1: When the mode is training with BN, the sequence
of the entire process is shown in Fig. 2 as a numerical serial x_ram_size = operand_width ∗ mini_batch. (3)
number. The results of the MAC operations in the forward pro- According to the functions integrated in PE, we implement
cess of the training are temporarily stored in x_ram. Then the three types of systolic array. These are SA_MAC with MAC
sum is calculated, and the maximum and minimum of input function only, SA_BN with BN and MAC function, and
data are updated and stored. After all the inputs are entirely SA_RBN with RBN and MAC function, respectively. To show
stored, PE_BN can calculate the mean and standard deviation the additional resource overhead of implementing BN, we
of them. After that, stored results xi are read from x_ram, and implement these three structures to reflect the differences by
they will be normalized and output to buffers. At the same comparing their synthesis results and performance. Synthesis
time, we maintain a global mean and a global variance of all results show that the maximum size of the array is 32×16
activation distributions, which will be updated by the partial without exceeding the on-chip resources, so this size is finally
activation distribution at each mini batch using the exponen- selected as our array size. It provides a reference for the work
tial average. Because there is only one input feature map in of CNN training acceleration.
the inference process in one time, we use the mean and vari- Also, we select three layers which have BN operations in
ance calculated globally throughout the training process as its each of the three state-of-the-art CNNs, MobileNet V1 [4],
normalized parameters. Finally, the parameters γ and β of DenseNet-121 [5], and ResNet-18 [3]. We test them on systolic
the BN layer are updated according to the chain rule in the arrays integrating BN and RBN, respectively, and compare
backpropagation process. the performance of the two implementations. We set the mini
Mode 2: When the working mode is inference with BN, the batch size to be 64 for testing.
input data are processed with MAC and BN functions and are
transported out of the systolic array.
Mode 3: When the BN function is not required in inference B. Experimental Results
or training, the entire PE_BN is a MAC unit. Only the PE part 1) Synthesize Results: As shown in Table I, we are the first
works, and there is no difference between it and the systolic to implement the dedicated BN engine for BN function in both
array which only accelerates the convolution layer. CNN training and inference. On-chip resources utilization and
Note that the difference between BN in training and BN power comparisons of the three systolic array implementations
in inference is that in each mini batch during the training, and their surrounding RAMs are shown in Table I. It can be
the parameters of the BN layer, i.e., γ and β, are updated in seen that compared to BN implementation, RBN reduces the
the BP engine according to the chain rule, while they are not need for hardware resources and power consumption.
changed in the inference. In addition, the power of the BP engine is 0.4 W, and
The PE_BN is pipelined into four pipelines. The first three the power of the unified buffer is 1.4 W. Because in this
pipelines correspond to mode 1, working at 50 MHz. They design, most of the computations in the backpropagation pro-
perform the tasks of calculating standard deviation and mean, cess (matrix multiplication) are done by the systolic array, the
normalized value, global standard deviation, and global mean BP engine consumes less power than traditional designs. As
in the training process, respectively. The last pipeline corre- shown in Fig. 3, due to bandwidth limitations, most of the
sponds to mode 2 and mode 3, working at 100 MHz. Through energy consumption is still consumed by data movement.

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on September 15,2023 at 16:16:36 UTC from IEEE Xplore. Restrictions apply.
32 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 13, NO. 1, MARCH 2021

TABLE I
O N -C HIP R ESOURCE U TILIZATION AND P OWER C ONSUMPTION OF W/OBN, BN, AND RBN I MPLEMENTATIONS

V. C ONCLUSION
Due to the fact that the BN layer takes a larger proportion
in the whole CNN training time and energy consumption, it is
necessary to develop a more energy-efficient hardware solu-
tion to replace the software-based method. In this letter, we
focused on the efficient implementation of the BN in CNN
training. Because BN is the most significant nonconvolution
layer and an indispensable part of the training. We implement
an 8.9-W accelerator for CNN training, whose core unit is a
systolic array with BN function and the BN algorithm used
Fig. 3. Energy breakdown of modules over layer types. is an improved algorithm—RBN. This letter is implemented
in the RTL-level code on the FPGA vu440 platform. The
experimental results show that RBN has more advantages than
BN, which is validated by fewer hardware resource utilization,
lower latency, and low power consumption.
R EFERENCES
[1] R. Banner, I. Hubara, E. Hoffer, and D. Soudry, “Scalable methods for
8-bit training of neural networks,” in Proc. Neural Inf. Process. Syst.,
2018, pp. 5145–5153.
[2] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
learning with limited numerical precision,” in Proc. Int. Conf. Mach.
Fig. 4. Normalized running time of BN and RBN implementations over part Learn., 2015, pp. 1737–1746.
of layers in MobileNet V1, DenseNet-121, and ResNet-18. [3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. Comput. Vis. Pattern Recognit., 2016,
TABLE II pp. 770–778.
C OMPARISON OF T RAINING P ERFORMANCE B ETWEEN R ELATED W ORK [4] A. G. Howard et al., “MobileNets: Efficient convolutional neural
networks for mobile vision applications,” 2017. [Online]. Available:
arXiv:1704.04861.
[5] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely
connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), 2017, pp. 2261–2269.
[6] W. Jung, D. Jung, B. Kim, S. Lee, W. Rhee, and J. H. Ahn,
2) Performance: To compare the performance of BN and “Restructuring batch normalization to accelerate CNN training,” 2018.
RBN in terms of delay, we test layers containing BN of the [Online]. Available: arXiv:1807.01702.
MobileNet, DenseNet, and ResNet on these two implemen- [7] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” in Proc. Int. Conf.
tations and get the number of clock cycles they require. As Mach. Learn., 2015, pp. 448–456.
shown in Fig. 4, the RBN performs better than BN in three [8] Z. Liu, Y. Dou, J. Jiang, Q. Wang, and P. Chow, “An FPGA-based
cases. Because in the computation process of RBN, once all processor for training convolutional neural networks,” in Proc. Int. Conf.
the accumulated sums in each mini batch are obtained, the Field Program. Technol. (ICFPT), 2017, pp. 207–210.
mean and standard deviation can be obtained at once. But in [9] S. Lym, A. Behroozi, W. Wen, G. Li, Y. Kwon, and M. Erez, “Mini-batch
serialization: CNN training with inter-layer data reuse,” 2018. [Online].
BN, it is necessary to wait for the mean to be calculated, Available: arXiv:1810.00307.
and then read the saved input in x_ram to calculate the vari- [10] C. Luo, M.-K. Sit, H. Fan, S. Liu, W. Luk, and C. Guo, “Towards
ance. After getting the variance, we can finally get the standard efficient deep neural network training by FPGA-based batch-level par-
deviation. allelism,” in Proc. 27th IEEE Int. Symp. Field Program. Custom Comput.
Mach. (FCCM), 2019, pp. 45–52.
As a result, RBN implementation could save 10.0% of [11] A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna,
hardware resources, 10.1% power consumption, and a 4.6% “SCALE-Sim: Systolic CNN accelerator simulator,” 2018. [Online].
reduction in latency on average when compared with BN Available: arXiv:1811.02883v2.
implementation. Compared to the CPU i7 8750H (2.2 GHz, [12] T. Sledevic, “Adaptation of convolution and batch normalization layer
45 W), our design can achieve a maximum of 17.9× speedup for CNN implementation on FPGA,” in Proc. Open Conf. Elect.
Electron. Inf. Sci. (eStream), 2019, pp. 1–4.
on ResNet-18, a minimum of 7.8× speed up on MobileNet [13] D. Tran et al., “Simple, distributed, and accelerated probabilistic pro-
V1 and 64× energy efficiency improvement. The performance gramming,” in Proc. Neural Inf. Process. Syst., 2018, pp. 7598–7609.
comparison between this letter and related work is shown in [14] S. Wang et al., “PRTSM: Hardware data arrangement mechanisms for
Table II. The work in [10] uses more (nearly 4×) logic to convolutional layer computation on the systolic array,” in Proc. Netw.
Parallel Comput., 2019, pp. 69–81.
accelerated convolution. So in MobileNet and ResNet where
[15] Z. Yang et al., “Systolic array based accelerator and algorithm mapping
BN layer computation does not exceed convolution, this letter for deep learning algorithms,” in Proc. Netw. Parallel Comput., 2018,
is not as good as theirs. pp. 153–158.

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on September 15,2023 at 16:16:36 UTC from IEEE Xplore. Restrictions apply.

You might also like