Bactran A Hardware Batch Normalization Implementation For CNN Training Engine

Uploaded by

sridharchandrasekar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Bactran A Hardware Batch Normalization Implementation For CNN Training Engine

Uploaded by

sridharchandrasekar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

IEEE EMBEDDED SYSTEMS LETTERS, VOL. 13, NO.

1, MARCH 2021 29

Bactran: A Hardware Batch Normalization

Implementation for CNN Training Engine
Yang Zhijie , Wang Lei, Luo Li, Li Shiming, Guo Shasha, and Wang Shuquan

Abstract—In recent years, convolutional neural networks these accelerators have not focused on nonconvolution layers
(CNNs) have been widely used. However, their ever-increasing in training.
amount of parameters makes it challenging to train them with Accelerating CNN training is inseparable from
the GPUs, which is time and energy expensive. This has prompted the support of batch normalization (BN). In 2015,
researchers to turn their attention to training on more energy-
Ioffe and Szegedy [7] proposed to add a BN layer to
efficient hardware. batch normalization (BN) layer has been
widely used in various state-of-the-art CNNs for it is an indispens- the CNN model. The BN layer can adjust the distribution of
able layer in the acceleration of CNN training. As the amount of input data to avoid problems, such as gradient explosion or
computation of the convolutional layer declines, its importance gradient disappearance. Thus, it accelerates the convergence
continues to increase. However, the traditional CNN training of CNN training.
accelerators do not pay attention to the efficient hardware imple- The proportion of computation of convolution layers is
mentation of the BN layer. In this letter, we design an efficient declining as the kernels of advanced CNNs become smaller
CNN training architecture by using the systolic array. The pro- and are used more frequently. The BN layer is the most crucial
cessing element of the systolic array can support the BN functions part in nonconvolution layers [6]. For example, DenseNet-121
both in the training process and the inference process. The
BN function implemented is an improved, hardware-friendly BN [120 convolution (CONV) layers plus one FC layer] spends
algorithm, range batch normalization (RBN). The experimental 58.5% of the execution time on nonconvolution layers. In train-
results show that the implementation of RBN saves 10% hard- ing it, the operations of the BN layer account for more than
ware resources, reduces the power by 10.1%, and the delay by 90% of all nonconvolution layers operations. Because opera-
4.6% on average. We implement the accelerator on the field pro- tions of the BN layer involve the computation of fundamental
grammable gate array VU440, and the power consumption of statistics, such as mean, variance, and standard deviation. So
the its core computing engine is 8.9 W. they need a large number of multiplication, division, and
Index Terms—Accelerator, batch normalization, convolutional square operations, leading to much time and power consum-
neural network (CNN), systolic array, training. ing. In contrast, other nonconvolution layers, such as pooling
and nonlinear activation function ReLU, only involve simple
I. I NTRODUCTION comparison operation.
OMPARED to inference, training convolutional neural However, to the best of our knowledge, no accelerator archi-
C network (CNN) is more complex and more vital since
it requires much time and energy. An example of training
tecture supports both the BN functions in CNN training and
inference. In prior work, the BN layer calculations could only
with GPUs is that it takes 29 h to train ResNet-50 with eight be handled by software which introduced much overhead.
Tesla P100 GPUs, each consuming 250 W [6]. The dataset is Although this letter [12] integrated the BN in the inference
ImageNet. If the training is executed only once, its cost can be and MAC into a digital signal processor (DSP), it did not
amortized. However, various types of datasets lead to frequent support the BN in training.
training. These all illustrate the importance of optimizing and Naturally, the complex operation of BN leads to compli-
accelerating training CNNs. Since the high power consump- cated implementation in hardware, especially, BN in training.
tion and long training time brought by training with GPUs Fortunately, Banner et al. [1] proposed a simplified method,
become intolerable, researchers turn to more energy-efficient range batch normalization (RBN). With this method, the stan-
hardware, such as field programmable gate arrays (FPGAs) dard deviation can be easily approximated. The effect is
and application specific integrated circuits (ASICs). similar to that of the BN implementation. Thus, it eliminates
A few works of accelerating CNN training with hardware the process of calculating the variance and square root, which
include Google’s TPU V2 [13], the accelerator based on the simplifies the hardware implementation.
systolic array and FPGA [2] and wavecore [9]. However, As we all know, the systolic array can support convolu-
tion layers efficiently. By studying the characteristics of BN,
Manuscript received November 4, 2019; revised December 15, 2019; we find that systolic array can provide convenience to opera-
accepted February 15, 2020. Date of publication February 19, 2020; date tions of BN. Because each of the processing elements (PEs) is
of current version February 26, 2021. This work was supported by the responsible for calculating a pixel of the output feature map,
National Key Research and Development Program of China under Grant
2018YFB2202603, and in part by the National Natural Science Foundation
which is an input of the BN layer. So a systolic array with
of China under Grant 61802427 and Grant 61832018. This manuscript BN function can support both convolution layers and the most
was recommended for publication by Y. Chen. (Corresponding author: significant nonconvolution layer, BN, which means this letter
Yang Zhijie.) can accelerate the majority of calculations in CNN training
The authors are with the College of Computer Science and Technology,
National University of Defense Technology, Changsha 410073, China (e-mail:
and inference.
[email protected]). Therefore, this letter focuses on the efficient hardware
Digital Object Identifier 10.1109/LES.2020.2975055 implementation of the BN layer in CNN training and
1943-0671
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on September 15,2023 at 16:16:36 UTC from IEEE Xplore. Restrictions apply.
30 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 13, NO. 1, MARCH 2021

Algorithm 1 Algorithm of Batch Normalization [7]

Input: Values of x over a mini-batch: β = {x1...m };
Parameters to be learned: γ ,β
Output: {yi = BNγ ,β (xi )}

1 x
m
μ= m i //mini-batch mean
i=1
1 (x − μ )2
m
σβ2 = m i β //mini-batch variance
i=1
xi −μβ
x̂i = //normalize
σβ2 +
Fig. 1. Block diagram of the overall architecture.
yi = γ x̂i + β ≡ BNγ ,β (xi ) //scale and shift

inference. To the best of our knowledge, this is the first time values. Because if the input is assumed to follow the Gaussian
that BN function is combined with systolic array and both distribution, the range of the input is highly correlated with
BN functions in training and inference are implemented in the standard deviation σ
hardware. The main contributions of this letter are as follows. xi − μ
1) We implement an efficient hardware engine for CNN x̂i = . (1)
C(n) · range(xi − μ)
training. Its core unit is a systolic array with BN
function. √
2) We implement an efficient PE in the systolic array that As shown in (1), C(n) = 2 ln(N), where N is the mini
supports BN functions in both training and inference batch size, and range(x) = max(x) − min(x). With RBN, we
using the RBN algorithm which decreases the hardware do not need to calculate the variance. RBN not only solves
implementation overhead and latency when compared the problem of data overflowing in variance computation but
with the BN implementation baseline. also eliminates a lot of multiplication computation and square
The experimental results show that the power of the comput- root operation. Thus, RBN implementation saves hardware
ing engine is 8.9 W under Xilinx’s FPGA vu440. We declare resources.
that RBN implementation could save 10.0% of hardware
resources, 10.1% power consumption and a 4.6% reduc-
tion in latency on average when compared with the BN III. A RCHITECTURE
implementation baseline. A. Top Design
The overall architecture is shown in Fig. 1. It contains a
II. BACKGROUND unified buffer, a weight buffer, a buffer for inputs and results,
A. Batch Normalization a systolic array, a backpropagation engine.
The entire system works as follows. Through the data bus,
Ioffe and Szegedy [7] first proposed the BN layer. The data are exchanged between the external host and the unified
main idea is to solve the problem of the variation of the buffer on-chip using AXI protocol. After being rearranged by
internal covariate distribution through batch normalization. In the data setup unit, the data are mapped into the corresponding
other words, by batch normalization, the difficulties of gradient buffer. Once the computation begins, the data are transferred
explosion and gradient disappearance during the backpropaga- from the buffer to the array for processing. The operations
tion of training are alleviated, which significantly accelerates include MAC and BN. The results flow out of the systolic
the convergence speed of CNN training. array and are stored in the buffers. Finally, they are transferred
Batch normalization’s operations are shown in Algorithm 1. to the outside through the unified buffer.
Typically, the BN layer is placed after the convolutional layer During the training process, the backpropagation engine
and before the activation layer. The input of the BN layer is is responsible for calculating errors, gradients, and updat-
the result of the cumulative summation of the volumes calcu- ing weights. When performing error propagation, the BP
lated by the previous convolutional layer on each of the feature engine leaves the gradient calculation (mainly matrix multi-
maps in one mini batch, i.e., x1 − xm . The parameters of the plication) to the systolic array. The systolic array transfers the
BN layer are γ and β. These two parameters are updated by results back to the BP engine to complete subsequent error
the chained rule of the backpropagation algorithm at the end propagation.
of each mini-batch computation. The parameters obtained by
training with different datasets are different because the distri-
butions of different datasets are different. The output of the BN B. Data Mapping
layer is y1 − ym obtained by normalizing the input x1 − xm .
Because the core computation unit is a systolic array, it can
The entire process redirects the distribution of these data to
efficiently compute convolutions during inference and training.
the linear region of the activation function, thereby alleviating
Data needs to be rearranged and mapped before injecting
the problem of gradient explosion and gradient disappearance.
into the systolic array to ensure correctness. First, we use
the data arrangement strategy in [14], which can significantly
B. Range Batch Normalization reduce the traffic of data exchange and improve the reuse of
Banner et al. [1] first proposed the RBN idea. The main idea data. We use a series of mapping strategies [15] which is out-
is that the standard deviation σ can be approximated by the put stationery (OS) [11] to map the data to our fixed-size
product of adjustment parameter C(n) and the range of input systolic array.

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on September 15,2023 at 16:16:36 UTC from IEEE Xplore. Restrictions apply.
ZHIJIE et al.: BACTRAN: HARDWARE BATCH NORMALIZATION IMPLEMENTATION FOR CNN TRAINING ENGINE 31

data path selection, they perform the BN in inference and

MAC, respectively.
Thus, not all of the modules in PE_BN works when PE_BN
is triggered. Most of the time, only some of the pipeline stages
are executed in a loop. Compared with PE, PE_BN will only
be triggered when the PE calculates a pixel in the output fea-
ture mapping matrix. If we define the trigger as any of the
stages to be executed, then its trigger probability is
1
∗ trigger_probability_of_PE. (2)
kernel_size2
Therefore, although the implementation of BN brings power
overhead, its triggering probability is much lower than PE.

IV. E VALUATION
Fig. 2. PE with MAC and BN. Numerical serial numbers in it indicate the A. Experimental Setup
sequence of the process of training. The part of the red background called
PE is for the MAC, and the part of the blue background called PE_BN is the The experimental platform used in this letter is Xilinx’s
function of the BN. FPGA vu440, and the whole design is implemented in the
RTL-level code. In this implementation, all operands are
C. Processing Element 16 bits, the x_ram is 128B because it needs to store 64 (the
The PE_BN integrated BN and MAC is shown in Fig. 2. It size of mini batch) 16-bit results. Each PE stores only the
has three modes: 1) inference or training without BN; 2) train- elements in the same position in the activation vector in a
ing with BN; and 3) inference with BN. Through the selection mini batch, and the whole activation vector is distributed and
of the data path, the PE_BN can switch between these three stored in multiple PEs. To be general, the calculation method
different modes. of x_ram size is as follows:
Mode 1: When the mode is training with BN, the sequence
of the entire process is shown in Fig. 2 as a numerical serial x_ram_size = operand_width ∗ mini_batch. (3)
number. The results of the MAC operations in the forward pro- According to the functions integrated in PE, we implement
cess of the training are temporarily stored in x_ram. Then the three types of systolic array. These are SA_MAC with MAC
sum is calculated, and the maximum and minimum of input function only, SA_BN with BN and MAC function, and
data are updated and stored. After all the inputs are entirely SA_RBN with RBN and MAC function, respectively. To show
stored, PE_BN can calculate the mean and standard deviation the additional resource overhead of implementing BN, we
of them. After that, stored results xi are read from x_ram, and implement these three structures to reflect the differences by
they will be normalized and output to buffers. At the same comparing their synthesis results and performance. Synthesis
time, we maintain a global mean and a global variance of all results show that the maximum size of the array is 32×16
activation distributions, which will be updated by the partial without exceeding the on-chip resources, so this size is finally
activation distribution at each mini batch using the exponen- selected as our array size. It provides a reference for the work
tial average. Because there is only one input feature map in of CNN training acceleration.
the inference process in one time, we use the mean and vari- Also, we select three layers which have BN operations in
ance calculated globally throughout the training process as its each of the three state-of-the-art CNNs, MobileNet V1 [4],
normalized parameters. Finally, the parameters γ and β of DenseNet-121 [5], and ResNet-18 [3]. We test them on systolic
the BN layer are updated according to the chain rule in the arrays integrating BN and RBN, respectively, and compare
backpropagation process. the performance of the two implementations. We set the mini
Mode 2: When the working mode is inference with BN, the batch size to be 64 for testing.
input data are processed with MAC and BN functions and are
transported out of the systolic array.
Mode 3: When the BN function is not required in inference B. Experimental Results
or training, the entire PE_BN is a MAC unit. Only the PE part 1) Synthesize Results: As shown in Table I, we are the first
works, and there is no difference between it and the systolic to implement the dedicated BN engine for BN function in both
array which only accelerates the convolution layer. CNN training and inference. On-chip resources utilization and
Note that the difference between BN in training and BN power comparisons of the three systolic array implementations
in inference is that in each mini batch during the training, and their surrounding RAMs are shown in Table I. It can be
the parameters of the BN layer, i.e., γ and β, are updated in seen that compared to BN implementation, RBN reduces the
the BP engine according to the chain rule, while they are not need for hardware resources and power consumption.
changed in the inference. In addition, the power of the BP engine is 0.4 W, and
The PE_BN is pipelined into four pipelines. The first three the power of the unified buffer is 1.4 W. Because in this
pipelines correspond to mode 1, working at 50 MHz. They design, most of the computations in the backpropagation pro-
perform the tasks of calculating standard deviation and mean, cess (matrix multiplication) are done by the systolic array, the
normalized value, global standard deviation, and global mean BP engine consumes less power than traditional designs. As
in the training process, respectively. The last pipeline corre- shown in Fig. 3, due to bandwidth limitations, most of the
sponds to mode 2 and mode 3, working at 100 MHz. Through energy consumption is still consumed by data movement.

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on September 15,2023 at 16:16:36 UTC from IEEE Xplore. Restrictions apply.
32 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 13, NO. 1, MARCH 2021

TABLE I
O N -C HIP R ESOURCE U TILIZATION AND P OWER C ONSUMPTION OF W/OBN, BN, AND RBN I MPLEMENTATIONS

V. C ONCLUSION
Due to the fact that the BN layer takes a larger proportion
in the whole CNN training time and energy consumption, it is
necessary to develop a more energy-efficient hardware solu-
tion to replace the software-based method. In this letter, we
focused on the efficient implementation of the BN in CNN
training. Because BN is the most significant nonconvolution
layer and an indispensable part of the training. We implement
an 8.9-W accelerator for CNN training, whose core unit is a
systolic array with BN function and the BN algorithm used
Fig. 3. Energy breakdown of modules over layer types. is an improved algorithm—RBN. This letter is implemented
in the RTL-level code on the FPGA vu440 platform. The
experimental results show that RBN has more advantages than
BN, which is validated by fewer hardware resource utilization,
lower latency, and low power consumption.
R EFERENCES
[1] R. Banner, I. Hubara, E. Hoffer, and D. Soudry, “Scalable methods for
8-bit training of neural networks,” in Proc. Neural Inf. Process. Syst.,
2018, pp. 5145–5153.
[2] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
learning with limited numerical precision,” in Proc. Int. Conf. Mach.
Fig. 4. Normalized running time of BN and RBN implementations over part Learn., 2015, pp. 1737–1746.
of layers in MobileNet V1, DenseNet-121, and ResNet-18. [3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. Comput. Vis. Pattern Recognit., 2016,
TABLE II pp. 770–778.
C OMPARISON OF T RAINING P ERFORMANCE B ETWEEN R ELATED W ORK [4] A. G. Howard et al., “MobileNets: Efficient convolutional neural
networks for mobile vision applications,” 2017. [Online]. Available:
arXiv:1704.04861.
[5] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely
connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), 2017, pp. 2261–2269.
[6] W. Jung, D. Jung, B. Kim, S. Lee, W. Rhee, and J. H. Ahn,
2) Performance: To compare the performance of BN and “Restructuring batch normalization to accelerate CNN training,” 2018.
RBN in terms of delay, we test layers containing BN of the [Online]. Available: arXiv:1807.01702.
MobileNet, DenseNet, and ResNet on these two implemen- [7] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” in Proc. Int. Conf.
tations and get the number of clock cycles they require. As Mach. Learn., 2015, pp. 448–456.
shown in Fig. 4, the RBN performs better than BN in three [8] Z. Liu, Y. Dou, J. Jiang, Q. Wang, and P. Chow, “An FPGA-based
cases. Because in the computation process of RBN, once all processor for training convolutional neural networks,” in Proc. Int. Conf.
the accumulated sums in each mini batch are obtained, the Field Program. Technol. (ICFPT), 2017, pp. 207–210.
mean and standard deviation can be obtained at once. But in [9] S. Lym, A. Behroozi, W. Wen, G. Li, Y. Kwon, and M. Erez, “Mini-batch
serialization: CNN training with inter-layer data reuse,” 2018. [Online].
BN, it is necessary to wait for the mean to be calculated, Available: arXiv:1810.00307.
and then read the saved input in x_ram to calculate the vari- [10] C. Luo, M.-K. Sit, H. Fan, S. Liu, W. Luk, and C. Guo, “Towards
ance. After getting the variance, we can finally get the standard efficient deep neural network training by FPGA-based batch-level par-
deviation. allelism,” in Proc. 27th IEEE Int. Symp. Field Program. Custom Comput.
Mach. (FCCM), 2019, pp. 45–52.
As a result, RBN implementation could save 10.0% of [11] A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna,
hardware resources, 10.1% power consumption, and a 4.6% “SCALE-Sim: Systolic CNN accelerator simulator,” 2018. [Online].
reduction in latency on average when compared with BN Available: arXiv:1811.02883v2.
implementation. Compared to the CPU i7 8750H (2.2 GHz, [12] T. Sledevic, “Adaptation of convolution and batch normalization layer
45 W), our design can achieve a maximum of 17.9× speedup for CNN implementation on FPGA,” in Proc. Open Conf. Elect.
Electron. Inf. Sci. (eStream), 2019, pp. 1–4.
on ResNet-18, a minimum of 7.8× speed up on MobileNet [13] D. Tran et al., “Simple, distributed, and accelerated probabilistic pro-
V1 and 64× energy efficiency improvement. The performance gramming,” in Proc. Neural Inf. Process. Syst., 2018, pp. 7598–7609.
comparison between this letter and related work is shown in [14] S. Wang et al., “PRTSM: Hardware data arrangement mechanisms for
Table II. The work in [10] uses more (nearly 4×) logic to convolutional layer computation on the systolic array,” in Proc. Netw.
Parallel Comput., 2019, pp. 69–81.
accelerated convolution. So in MobileNet and ResNet where
[15] Z. Yang et al., “Systolic array based accelerator and algorithm mapping
BN layer computation does not exceed convolution, this letter for deep learning algorithms,” in Proc. Netw. Parallel Comput., 2018,
is not as good as theirs. pp. 153–158.

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on September 15,2023 at 16:16:36 UTC from IEEE Xplore. Restrictions apply.

Individual Paper - Nina Luksha - ITEC 625 9080 - Updated
No ratings yet
Individual Paper - Nina Luksha - ITEC 625 9080 - Updated
11 pages
Lenguaje Ensamblador. Problemas: Capítulo 1
No ratings yet
Lenguaje Ensamblador. Problemas: Capítulo 1
8 pages
Accelerating Binarized Convolutional 2017
No ratings yet
Accelerating Binarized Convolutional 2017
10 pages
Chen, Deng et al 2021 - Effective and Efficient Batch Normalization
No ratings yet
Chen, Deng et al 2021 - Effective and Efficient Batch Normalization
15 pages
Memory-Efficient_Batch_Normalization_by_One-Pass_Computation_for_On-Device_Training
No ratings yet
Memory-Efficient_Batch_Normalization_by_One-Pass_Computation_for_On-Device_Training
5 pages
Energy-Efficient FPGA Implementation of Power-Of-2 Weights-Based Convolutional Neural Networks With Low Bit-Precision Input Images
No ratings yet
Energy-Efficient FPGA Implementation of Power-Of-2 Weights-Based Convolutional Neural Networks With Low Bit-Precision Input Images
5 pages
Single-Bit-Per-Weight Deep Convolutional Neural
No ratings yet
Single-Bit-Per-Weight Deep Convolutional Neural
8 pages
20231130_IntroductionToAISystems
No ratings yet
20231130_IntroductionToAISystems
29 pages
Cloning Safe Driving Behavior For Self-D PDF
No ratings yet
Cloning Safe Driving Behavior For Self-D PDF
8 pages
A_Fast_Accurate_and_Comprehensive_PPA_Estimation_of_Convolutional_Hardware_Accelerators
No ratings yet
A_Fast_Accurate_and_Comprehensive_PPA_Estimation_of_Convolutional_Hardware_Accelerators
14 pages
2019 Ics Mcdanel Zhang Kung Dong
No ratings yet
2019 Ics Mcdanel Zhang Kung Dong
12 pages
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
No ratings yet
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
15 pages
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
No ratings yet
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
12 pages
rongshi2019
No ratings yet
rongshi2019
4 pages
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
No ratings yet
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
8 pages
BN-Free
No ratings yet
BN-Free
11 pages
Eait 2018 8470438
No ratings yet
Eait 2018 8470438
5 pages
Back To Simplicit - How To Train Accurate BNNs From Scratch
No ratings yet
Back To Simplicit - How To Train Accurate BNNs From Scratch
9 pages
BNN in FPGA
No ratings yet
BNN in FPGA
15 pages
A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration
No ratings yet
A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration
10 pages
Energy-and-Area-Efficient CNN
No ratings yet
Energy-and-Area-Efficient CNN
14 pages
A Low-Power Sparse Convolutional Neural Network Accelerator With Pre-Encoding Radix-4 Booth Multiplier
No ratings yet
A Low-Power Sparse Convolutional Neural Network Accelerator With Pre-Encoding Radix-4 Booth Multiplier
5 pages
HPBDIS 2021 Paper 144
No ratings yet
HPBDIS 2021 Paper 144
5 pages
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
No ratings yet
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
8 pages
AI_slide_2
No ratings yet
AI_slide_2
82 pages
Systematic Evaluation of Convolution Neural Network Advances On The Imagenet-2017
No ratings yet
Systematic Evaluation of Convolution Neural Network Advances On The Imagenet-2017
9 pages
FP-BNN-on-FPGA
No ratings yet
FP-BNN-on-FPGA
15 pages
NullHop_A_Flexible_Convolutional_Neural_Network_Accelerator_Based_on_Sparse_Representations_of_Feature_Maps
No ratings yet
NullHop_A_Flexible_Convolutional_Neural_Network_Accelerator_Based_on_Sparse_Representations_of_Feature_Maps
13 pages
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
No ratings yet
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
13 pages
Hot Chips Overview
No ratings yet
Hot Chips Overview
47 pages
Irmak2021energy_efficient
No ratings yet
Irmak2021energy_efficient
4 pages
Binary Neural Networks
No ratings yet
Binary Neural Networks
218 pages
Deep Learning Hardware
No ratings yet
Deep Learning Hardware
82 pages
7 CNN
No ratings yet
7 CNN
66 pages
High-Performance Hardware For Machine Learning - 0916
No ratings yet
High-Performance Hardware For Machine Learning - 0916
68 pages
Futureinternet 12 00113 v2
No ratings yet
Futureinternet 12 00113 v2
22 pages
Data and Hardware Efficient Design For Convolutional Neural Network!
No ratings yet
Data and Hardware Efficient Design For Convolutional Neural Network!
10 pages
Power Efficientreconfigurable Accelerator For Deep Convolutional Neural Networks
No ratings yet
Power Efficientreconfigurable Accelerator For Deep Convolutional Neural Networks
6 pages
CV_Lecture_4-Donnnn
No ratings yet
CV_Lecture_4-Donnnn
65 pages
Quantization and Training of Neural Networks For Efficient Integer-Arithmetic-Only Inference
No ratings yet
Quantization and Training of Neural Networks For Efficient Integer-Arithmetic-Only Inference
14 pages
An Area-Power-Efficient Multiplier-Less Processing Element Design For CNN Accelerators
No ratings yet
An Area-Power-Efficient Multiplier-Less Processing Element Design For CNN Accelerators
4 pages
Introduction To Deep Learning: TA: Drew Hudson May 8, 2020
No ratings yet
Introduction To Deep Learning: TA: Drew Hudson May 8, 2020
33 pages
CNN For Computer Vision Problem (Session 1)
No ratings yet
CNN For Computer Vision Problem (Session 1)
43 pages
A Lightweight Binarized Convolutional Neural Network Model For Small Memory and Low-Cost Mobile Devices
No ratings yet
A Lightweight Binarized Convolutional Neural Network Model For Small Memory and Low-Cost Mobile Devices
11 pages
Deep learning lab manual(2)
No ratings yet
Deep learning lab manual(2)
28 pages
WS_2021
No ratings yet
WS_2021
16 pages
applsci-15-00688-v3
No ratings yet
applsci-15-00688-v3
21 pages
A CNN Accelerator On FPGA Using Depthwise Separable Convolution
No ratings yet
A CNN Accelerator On FPGA Using Depthwise Separable Convolution
5 pages
Jacob Quantization and Training
No ratings yet
Jacob Quantization and Training
10 pages
aDSA SuperComp4Trng DNN
No ratings yet
aDSA SuperComp4Trng DNN
12 pages
Electronics 10 02859 v2
No ratings yet
Electronics 10 02859 v2
16 pages
(IJCST-V11I2P11) :dr. Girish Tere, Mr. Kuldeep Kandwal
No ratings yet
(IJCST-V11I2P11) :dr. Girish Tere, Mr. Kuldeep Kandwal
7 pages
RISC_ACC
No ratings yet
RISC_ACC
7 pages
Convolutional Neural Network Layers Implementation On Low-Cost Reconfigurable Edge Computing Platforms
No ratings yet
Convolutional Neural Network Layers Implementation On Low-Cost Reconfigurable Edge Computing Platforms
31 pages
3an Empirical Study of Binary N
No ratings yet
3an Empirical Study of Binary N
11 pages
EECS251Leture-JennyHuang 2021
No ratings yet
EECS251Leture-JennyHuang 2021
67 pages
Bag of Tricks For Image Classification With Convolutional Neural Networks
No ratings yet
Bag of Tricks For Image Classification With Convolutional Neural Networks
10 pages
Learning To Train A Binary Neural Network
No ratings yet
Learning To Train A Binary Neural Network
16 pages
A High Performance Reconfigurable Hardware Archite (5)
No ratings yet
A High Performance Reconfigurable Hardware Archite (5)
17 pages
Decision Making and Operations Research Techniques for Construction Management
From Everand
Decision Making and Operations Research Techniques for Construction Management
C. M. Tam
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Cepstral Analysis Synthesis On The Mel Frequency Scale
No ratings yet
Cepstral Analysis Synthesis On The Mel Frequency Scale
4 pages
Stride 2 1-D, 2-D, and 3-D Winograd For Convolutional Neural Networks
No ratings yet
Stride 2 1-D, 2-D, and 3-D Winograd For Convolutional Neural Networks
11 pages
CNN hw1
No ratings yet
CNN hw1
13 pages
Memories & More: CRC and Fec
No ratings yet
Memories & More: CRC and Fec
39 pages
10 1109@isvlsi49217 2020 00027
No ratings yet
10 1109@isvlsi49217 2020 00027
6 pages
1-D Convolutional Neural Networks For Signal Processing Applications
No ratings yet
1-D Convolutional Neural Networks For Signal Processing Applications
5 pages
Intellectual Property Rights
No ratings yet
Intellectual Property Rights
11 pages
Mem Full
No ratings yet
Mem Full
89 pages
B.Tech. Degree in Computer Science and Engineering
No ratings yet
B.Tech. Degree in Computer Science and Engineering
171 pages
'D:/Sridhar/Reference Paper/ Data' 'Fileextensions' '.Wav' 'Includesubfolders'
No ratings yet
'D:/Sridhar/Reference Paper/ Data' 'Fileextensions' '.Wav' 'Includesubfolders'
3 pages
For End For End: "Sp01.wav"
No ratings yet
For End For End: "Sp01.wav"
2 pages
FPGA Implementation of LSTM Based On Automatic Speech Recognition
No ratings yet
FPGA Implementation of LSTM Based On Automatic Speech Recognition
3 pages
Convolutional Networks For Images, Speech, and Time-Series: January 1995
No ratings yet
Convolutional Networks For Images, Speech, and Time-Series: January 1995
15 pages
Cns QP
No ratings yet
Cns QP
2 pages
Computer Speech & Language: Gueorgui Pironkov, Sean UN Wood, ST Ephane Dupont
No ratings yet
Computer Speech & Language: Gueorgui Pironkov, Sean UN Wood, ST Ephane Dupont
13 pages
Iv TT 2020 Odd Sem
No ratings yet
Iv TT 2020 Odd Sem
1 page
SEMESTER Title
No ratings yet
SEMESTER Title
15 pages
Zeroth Review Details
No ratings yet
Zeroth Review Details
8 pages
Department of Electronics and Communication Engineering Project Batch For The Academic Year 2020 - 21 Batch: 2017-2021
No ratings yet
Department of Electronics and Communication Engineering Project Batch For The Academic Year 2020 - 21 Batch: 2017-2021
4 pages
Code Fundamentals of Semiconductor Devices L T P C: Applied Optics
No ratings yet
Code Fundamentals of Semiconductor Devices L T P C: Applied Optics
6 pages
Professional Elective - I
No ratings yet
Professional Elective - I
10 pages
Antennas and Radiation Mechanism of Various Antennas and Antenna Arrays
No ratings yet
Antennas and Radiation Mechanism of Various Antennas and Antenna Arrays
11 pages
Professional Elective - IIdfd
No ratings yet
Professional Elective - IIdfd
13 pages
Light Emitting Diodes (Leds) : - C.Madan Kumar .
No ratings yet
Light Emitting Diodes (Leds) : - C.Madan Kumar .
29 pages
Introduction To Postgresql
100% (1)
Introduction To Postgresql
54 pages
Whizlabs Online Certification Training Courses For Professionals (AWS, Java, PMP) PDF
0% (1)
Whizlabs Online Certification Training Courses For Professionals (AWS, Java, PMP) PDF
17 pages
ch-1 Petruzella PDF
100% (1)
ch-1 Petruzella PDF
16 pages
The Cost of A Cloud: Research Problems in Data Center Networks
No ratings yet
The Cost of A Cloud: Research Problems in Data Center Networks
6 pages
Lecture 1 OOP
No ratings yet
Lecture 1 OOP
24 pages
SECTION 15970 Building Management System (BMS) Part 1 General
No ratings yet
SECTION 15970 Building Management System (BMS) Part 1 General
75 pages
ILP-Architectures Part III
No ratings yet
ILP-Architectures Part III
49 pages
Automapper
No ratings yet
Automapper
6 pages
Journal 1
No ratings yet
Journal 1
4 pages
(Multi-Language LED Digit Board) : User Manual
No ratings yet
(Multi-Language LED Digit Board) : User Manual
10 pages
VHDL - Introduction: ELEC 2200: Digital Logic Circuits Nitin Yogi
No ratings yet
VHDL - Introduction: ELEC 2200: Digital Logic Circuits Nitin Yogi
19 pages
Installation and Upgrade Guide For Cisco5 - 8 PDF
No ratings yet
Installation and Upgrade Guide For Cisco5 - 8 PDF
174 pages
AS350 - Full Capacity - Spec - Rev2.2
No ratings yet
AS350 - Full Capacity - Spec - Rev2.2
5 pages
TO DO LIST report Comp
No ratings yet
TO DO LIST report Comp
27 pages
Data Center Best Practices Ebook CH4 CO-110101-En
No ratings yet
Data Center Best Practices Ebook CH4 CO-110101-En
7 pages
OLED 242 Notes
No ratings yet
OLED 242 Notes
21 pages
It - Unit 15 - Assignment 2 - Template
No ratings yet
It - Unit 15 - Assignment 2 - Template
9 pages
18CS731 August-2022
No ratings yet
18CS731 August-2022
2 pages
Upgrade Windows 7 To Windows 10
No ratings yet
Upgrade Windows 7 To Windows 10
5 pages
PrintSCP, DICOM Print Server by CharruaSoft
100% (1)
PrintSCP, DICOM Print Server by CharruaSoft
10 pages
Edited
No ratings yet
Edited
13 pages
Homework 6
No ratings yet
Homework 6
2 pages
GSM Library For Proteus PDF
No ratings yet
GSM Library For Proteus PDF
14 pages
Functions: Güntner Motor Management
No ratings yet
Functions: Güntner Motor Management
2 pages
MCS-011 Problem Solving and Programming Assignment
No ratings yet
MCS-011 Problem Solving and Programming Assignment
8 pages
Top 32 Node - Js Interview Questions (2023) - Javatpoint
No ratings yet
Top 32 Node - Js Interview Questions (2023) - Javatpoint
25 pages
Nancy Drew Captive Curse manual
No ratings yet
Nancy Drew Captive Curse manual
4 pages
Manual For 3-Disp-Jumbo-USL PDF
No ratings yet
Manual For 3-Disp-Jumbo-USL PDF
9 pages
Top - Niunaijun.blackboxa64 Logcat
No ratings yet
Top - Niunaijun.blackboxa64 Logcat
4 pages

Bactran A Hardware Batch Normalization Implementation For CNN Training Engine

Uploaded by

Bactran A Hardware Batch Normalization Implementation For CNN Training Engine

Uploaded by

IEEE EMBEDDED SYSTEMS LETTERS, VOL. 13, NO.

Bactran: A Hardware Batch Normalization

Algorithm 1 Algorithm of Batch Normalization [7]

data path selection, they perform the BN in inference and

You might also like