Accelerating Binarized Convolutional 2017

Uploaded by

zhuochentao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Accelerating Binarized Convolutional 2017

Uploaded by

zhuochentao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Accelerating Binarized Convolutional Neural Networks

with Software-Programmable FPGAs

Ritchie Zhao1,∗ , Weinan Song2 , Wentao Zhang2 , Tianwei Xing3 , Jeng-Hau Lin4 ,
Mani Srivastava3 , Rajesh Gupta4 , Zhiru Zhang1,∗
1
School of Electrical and Computer Engineering, Cornell University, USA
2
School of Electronics Engineering and Computer Science, Peking University, China
3
Department of Electrical Engineering, University of California Los Angeles, USA
4
Department of Computer Science and Engineering, University of California San Diego, USA
∗
{rz252, zhiruz}@cornell.edu

Abstract 2012 ImageNet recognition challenge [15]. Subsequently,

Convolutional neural networks (CNN) are the current state- CNNs have become the state-of-the-art for image classifica-
of-the-art for many computer vision tasks. CNNs outperform tion, detection, and localization tasks. Research in CNNs and
older methods in accuracy, but require vast amounts of com- other areas of deep learning continues at a rapid pace, with
putation and memory. As a result, existing CNN applications hundreds of new papers published each year introducing new
are typically run on clusters of CPUs or GPUs. Research models and techniques. One indicator of this rate of progress
on FPGA acceleration of CNN workloads has achieved re- is the improvement in the top-5 accuracy of the ImageNet
ductions in power and energy consumption. However, large competition winner over the years: 84.7% in 2012 [15] to
GPUs outperform modern FPGAs in throughput, and the ex- 96.4% in 2015 [10].
istence of compatible deep learning frameworks give GPUs One challenge to the widespread deployment of CNNs is
a significant advantage in programmability. their significant demands for computation and storage ca-
Recent work in machine learning demonstrates the po- pacity. The VGG-19 network, for instance, contains over
tential of very low precision CNNs — i.e., CNNs with bi- 140 million floating-point (FP) parameters and performs
narized weights and activations. Such binarized neural net- over 15 billion FP operations to classify one image [22].
works (BNNs) appear well suited for FPGA implementation, Consequently, the training and inference of modern CNNs
as their dominant computations are bitwise logic operations is almost exclusively done on large clusters of CPUs and
and their memory requirements are greatly reduced. A com- GPUs [4]. One additional benefit of such platforms is the
bination of low-precision networks and high-level design availability of compatible deep learning frameworks such
methodology may help address the performance and pro- as Caffe [12], Theano [24], or TensorFlow [9], which allow
ductivity gap between FPGAs and GPUs. In this paper, we users to make use of the latest models or to train a custom
present the design of a BNN accelerator that is synthesized network with little engineering effort.
from C++ to FPGA-targeted Verilog. The accelerator out- While CPU and GPU clusters are currently the go-to plat-
performs existing FPGA-based CNN accelerators in GOPS forms for CNN and many other machine learning appli-
as well as energy and resource efficiency. cations, a customized hardware solution on FPGA can of-
fer significant improvements in energy efficiency and power
dissipation. These factors may be critical in enabling the
1. Introduction increased use of CNNs in low-power settings such as un-
Deep convolutional neural networks (CNNs) have become manned drones or embedded computing. Recent work by
an important class of machine learning algorithms widely Microsoft has even explored cost-effective acceleration of
used in computer vision and artificial intelligence. While deep learning on FPGAs at datacenter scale [18]. There are
CNNs have been known to researchers for decades, they also efforts in the academic community on FPGA-based
were popularized after demonstrating high accuracy at the CNN accelerators [27, 19] as well as tools for generating
them automatically [23, 26]. Yet, there remains a sizable
Permission to make digital or hard copies of all or part of this work for personal or gap between GPU and FPGA platforms in both CNN perfor-
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation mance and design effort. The latter is especially distressing
on the first page. Copyrights for components of this work owned by others than ACM given the rate of algorithmic innovation in deep learning —
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a an FPGA-based CNN accelerator (or CNN design compiler)
fee. Request permissions from [email protected]. is unlikely to support the most up-to-date models, putting
FPGA ’17, February 22-24, 2017, Monterey, CA, USA
c 2017 ACM. 978-1-4503-4354-1/17/02. . . $15.00 them at a severe competitive disadvantage.
http://dx.doi.org/10.1145/3020078.3021741

15
We observe two trends which may help overcome these 2. Preliminaries
obstacles. The first is a series of recent papers in the machine In this section we briefly review the basic principles and
learning community regarding very-low-precision CNNs. terminology of CNNs, the differences between a CNN and
Networks with binary weights [6], or binary weights and ac- BNN, and the specific CIFAR-10 BNN model that our ac-
tivations [7, 21] have in certain cases demonstrated accuracy celerator will target.
comparable to full precision nets. Such binarized neural net-
works (BNNs) may be the key to efficient deep learning on
FPGA. Binarization reduces storage and memory bandwidth 2.1 Convolutional Neural Network Primer
requirements, and replace FP operations with binary opera- A CNN is a machine learning classifier that typically takes
tions which can be very efficiently performed on the LUT- in a multi-channel image and produces the probabilities of
based FPGA fabric. that image belonging to each output class. A typical CNN
Concerning the cost and effort of FPGA implementation, consists of a pipeline of connected layers. Each layer takes
we see a steady improvement in FPGA design automation as input a set of feature maps (fmaps), performs some com-
tools over the past decade. High-level synthesis (HLS) tools putation on them, and produces a new set of fmaps to be fed
such as Xilinx Vivado HLS [5] and LegUp [1] enable a user into the next layer. The input fmaps of the first layer are the
to write code in a high-level programming language, then al- channels of the input image. Layers may require configura-
gorithmically compile that code down to a register-transfer tion values known as parameters, which must first be de-
level (RTL) design specification. More recent tools such as termined by training the CNN offline on pre-classified data.
Intel FPGA SDK for OpenCL [8] and Xilinx SDSoC [13] of- Once the parameters are finalized, the CNN can be deployed
fer further automation features for generating the hardware- for inference — the classification of new data points. For
software interface and on-chip memory network. In the con- most practical machine learning applications, the first-class
text of deep learning, these tools have the potential to criti- concerns are the accuracy and execution time of online clas-
cally reduce time-to-market on new accelerator designs and sification. This paper will thus focus on accelerating the in-
thus reduce the aforementioned innovation gap. ference task without compromising accuracy.
In this paper we present the design of a BNN accelerator
for FPGAs. In order to take full advantage of the binarized CNN BNN
values and operations, our design differs in multiple aspects
from CNN accelerators in literature. Our specific contribu-
tions are as follows:
M real
fmaps ℝℝ ℝ
ℤ2 M binary
fmaps
• To our best knowledge, we are the first to study FPGA
acceleration for very low precision CNNs. Compared to Convolution Convolution
their full-precision counterparts, such networks are po-
tentially a better fit for the LUT-based fabric and limited N real N integer
on-chip storage in modern FPGAs. fmaps ℝℝ ℝℤ fmaps
• We employ an HLS design methodology for productive
development of our FPGA-based BNN accelerator. Ex- Add Bias Pool
isting HLS work has examined loop ordering, unrolling,
and local buffering for CNNs [27]. Our HLS implemen- Non-linearity Batch Norm
tation leverages these optimizations, and further propose
novel BNN-specific hardware constructs to ensure full
throughput and hardware utilization across the different Pool Binarize
input feature sizes.
N real N binary
• We implement our BNN classifier on a low-cost FPGA
development board (ZedBoard) and show promising im- fmaps
ℝℝ ℝ
ℤ2 fmaps
provements over CPU and embedded GPU baselines as
well as existing FPGA accelerators. Our source code is Figure 1: Comparison of CNNs and BNNs — Left: the or-
publicly available on the authors’ websites. der of operations in a CNN for a conv and pool layer. Right:
the (modified) order of operations in the BinaryNet BNN [7].
The rest of this paper is organized as follows: Section 2 Pooling is performed early and a batch normalization pre-
gives a primer on CNNs and BNNs; Section 3 describes our cedes the binarization to minimize information loss. Biases
BNN accelerator design; Section 4 provides some details on have been removed from the BNN.
our HLS code; Section 5 reports our experimental findings,
Section 6 reviews previous work on FPGA-based CNN ac- Below we describe three layer types which are found in
celerators; and we conclude the paper in Section 7. most CNNs, including our CIFAR-10 BNN model.

16
Convolutional (conv) layers convolve each input fmap again. Biases have been removed (see Section 3.1). Pooling
with a K × K weight filter. The conv results are summed, in the BNN is always performed on the integer data.
added with a bias, and passed through a non-linearity func- The BNN also introduces a new layer type — Batch nor-
tion (such as ReLU or sigmoid) to produce a single output malization [11] layers reduce the information lost during
fmap. In this paper, we assume conv layers pad the input binarization by linearly shifting and scaling the input dis-
fmaps at the borders to produce output fmaps of the same tribution to have zero mean and unit variance. This reduces
size. Equation (1) below shows the operation of a conv layer quantization error compared to an arbitrary input distribu-
with M input fmaps x1 , ..., xM , N output fmaps y1 , ..., yN , tion. 1 The transformation is given in Equation (3) below,
and non-linearity f .
x−µ
y=√ γ+β (3)
M
X σ2 +
yn = f ( xm ∗ wn,m + bn ) (1)
m=1 where x and y are input and output, respectively, µ and σ
are statistics collected over the training set, γ and β are
The parameters of this conv layer are M × N × K × K trained parameters, and is to avoid round-off problems.
weights and N biases. During inference, all parameters are fixed, so we need only
Pooling layers maps each input fmap to an output fmap be concerned with efficiently applying Equation (3) to each
whose every pixel is the max/mean of a K × K window of input fmap pixel. Each output fmap require its own set of
input pixels. Unlike conv layers the windows do not overlap, batch norm parameters.
and the output fmaps are K times smaller in each dimension. The primary advantages of BNNs over their higher preci-
Pooling layers are inserted throughout a CNN to gradually sion counterparts are twofold:
reduce the size of the intermediate feature maps.
Dense or fully-connected (FC) layers take an input vec- 1. The convolution operation in Equation (1) (which nom-
tor of 1×1 feature maps (pixels) and perform a dot prod- inally requires a K × K element multiply-accumulate)
uct with a weight vector. The result is added to a bias and can now be implemented as a bitwise XNOR between
passed through a non-linearity to produce a single 1×1 out- two K × K bit vectors and a popcount. This is highly
put. Equation (2) below shows the operation of a FC layer relevant to FPGA design, as these operations can be im-
with M input pixels, N output pixels, and non-linearity f . plemented very efficiently in the logic fabric.
M 2. Assuming comparable numbers of feature maps and FC
X
yn = f ( xm wn,m + bn ) (2) layer units, binarizing weights and fmaps greatly re-
m=1 duces their memory size. This is again compelling for
FPGAs as existing FPGA accelerators are typically con-
The parameters are M × N weights and N biases. strained in performance by a combination of on-chip stor-
age space and off-chip memory bandwidth.
2.2 Binarized Neural Networks
A BNN is essentially a CNN whose weights and fmap pixels
are binarized to -1 or +1; they can be seen as an extreme 2.3 CIFAR-10 BNN Model
example of the quantized, reduced-precision CNN models
The CIFAR-10 dataset [14] contains sixty thousand 32×32
commonly used for hardware acceleration. In this paper we
3-channel images consisting of photos taken of real world
focus on an architecture developed by Courbariaux et al.
vehicles and animals. The images come from 10 classes
in [6] and later refined in [7]. The first paper binarizes only
(airplane, truck, cat, etc.) and are divided into a training set
the weights while the follow-up binarizes both weights and
of 50000 and a test set of 10000.
fmaps. We focus on the latter version and refer to it as
The BinaryNet architecture consists of six conv layers
the BinaryNet architecture/model. This architecture achieves
followed by three FC layers. All conv layers use 3×3 fil-
near state-of-the-art results on both CIFAR-10 and SVHN
ters and edge padding, and all conv/FC layers apply batch
datasets at time of publication. Other more recent work on
norm before binarization. There is a 2×2 max pooling layer
low precision networks promise accuracy close to state of
after the 2nd, 4th, and 6th conv layers. The first conv layer
the arts on ImageNet [16, 28].
is different from the rest: its input is the image, which is
In the BinaryNet model, the weights and outputs of both
floating-point, not binary; its weights are still binary. The ar-
conv and FC layers are binarized using the Sign function
chitecture is summarized in Table 1; the size of the fmaps
(i.e., positive weights are set to +1 and negatives to -1).
gets smaller deeper into the network, and that the first two
Figure 1 illustrates the flow of data through a conv and
dense layers contain most of the weights.
pooling layer in both a CNN and a BNN. For the CNN, the
order of operations matches Equation (1) and the fmaps are
real-valued at all times. In the BNN, the feature maps go 1 Batch normalization can also speed up training and regularize the activa-
from binary to integer (after convolution) until it is binarized tions in full-precision CNNs, but this is beyond the scope of this paper.

17
Layer Input Output Output Output Weight This reduces the number of operations and cuts the num-
Fmaps Fmaps Dim Bits Bits ber of stored parameters to two. Furthermore, the BNN al-
Conv1 3 128 32 128K 3456 ways binarizes immediately after batch norm. Thus we do
Conv2 128 128 32 128K 144K not need the magnitude of y, only the sign, allowing us scale
Pool 128 128 16 32K k and h by any multiplicative constant. We exploit this prop-
Conv3 128 256 16 64K 288K erty during quantization by scaling each k and h to be within
Conv4 256 256 16 64K 576K the representable range of our fixed-point implementation.
Pool 256 256 8 16K Empirical testing showed that k and h can be quantized to 16
Conv5 256 512 8 32K 1.1M
bits with negligible accuracy loss while being a good fit for
Conv6 512 512 8 32K 2.3M
power-of-2 word sizes. We also quantized the floating point
Pool 512 512 4 8192
BNN inputs to 20-bit fixed point. Table 2 summarizes the
FC1 8192 1024 1 1024 8.0M
FC2 1024 1024 1 1024 1.0M impact of each algorithmic modification on test error. The
FC3 1024 10 1 10 10K HLS accelerator has the same accuracy as the C++ code.
Total 13.4M
Conv 4.36M 3.2 Retraining for +1 Edge-Padding
FC 9.01M One complication in the BinaryNet model is the interaction
between binarization and edge padding. The model binarizes
Table 1: Architecture of the BinaryNet CIFAR-10 BNN each activation to -1 or +1, but each input fmap is edge
— The weight bits exclude batch norm parameters, whose padded with zeros, meaning that a convolution can see up
total size after optimization (see Section 3.1) is 0.12M bits, to 3 values: -1, 0, or +1. Thus the BinaryNet model actually
less than 1% of the size of the weights. requires some 2-bit operators (though the fmap data can still
Training of the CIFAR-10 BNN model was done using be stored in binary form). We managed to modify and retrain
open-source Python code provided by Courbariaux et al. 2 , the BinaryNet model to pad with +1, eliminating the zeros
which uses the Theano and Lasagne deep learning frame- and creating a truly binarized CNN. This +1 padded BNN
works. We reached 11.58% test error out-of-the-box, in line achieves a test error of 11.82% in Python and 12.27% in
with their results. Their paper also presents more advanced C++/FPGA, only slightly worse than the original.
training techniques such as stochastic binarization, which For our FPGA implementation we used the 0 padded
further reduce error rate. We did not use them in this work. BNN as the resource savings of the +1 padded version was
Different training schemes do not affect the inference pass not particularly relevant for the target device.
or the compatibility of our accelerator.
Source Model Padding Test Error
3. FPGA Accelerator Design From [7] - 0 11.40%
Python Default 0 11.58%
In this section, we first outline how we optimize the Bina- Python no-bias 0 11.32%
ryNet model for hardware, then describe the design of our Python no-bias +1 11.82%
system and the specific compute units.
C++ no-bias, fixed-point 0 11.46%
3.1 Hardware Optimized BNN Model C++ no-bias, fixed-point +1 12.27%
As with the design of conventional CNN accelerators, a key Table 2: Accuracy of the BNN with various changes —
optimization we made to the BNN model is parameter quan- no-bias refers to retraining after removing biases from all
tization. While the weights are already binarized, the biases layers and fixed-point refers to quantization of the inputs
and batch norm parameters are real numbers. During bias and batch norm parameters.
quantization, we noticed that nearly every bias was much
smaller than 1. Given that the inputs have magnitude 1, we
tried setting the biases to zero and observed no effect on ac- 3.3 System Architecture
curacy. We then retrained the network with biases removed Our system architecture, shown in Figure 2(a), consists of
from the model, and reached a test error of 11.32%. For the three compute units, data and weight buffers, a direct mem-
rest of the paper we use this as the baseline error rate. ory access (DMA) system for off-chip memory transfer, and
A second optimization involved noting that the batch an FSM controller. The three compute units work on differ-
norm calculation (Equation (3)) is a linear transformation, ent types of layers: the FP-Conv unit for the (non-binary)
and can thus be formulated as y = kx + h, where: first conv layer, the Bin-Conv unit for the five binary conv
γ µγ layers, and the Bin-FC unit for the three binary FC layers. Of
k=√ and h=β−√ (4) the three, the Bin-Conv and Bin-FC units must handle differ-
2
σ + σ2 +
ent numbers of input and output fmaps, and (in the case of
2 https://github.com/MatthieuCourbariaux/BinaryNet Bin-Conv) different fmap sizes.

18
fin Convolvers fout output streams
Data Buffers Compute Units
A FP-Conv BitSel ∗
Off-chip mem

B Pooling,
DMA

Bin-Conv

…
+ + Bnorm,
Param Binarize
Buffers Bin-FC BitSel ∗ Integer
buffer
CPU

Variable-width fout Conv

Controller Line Buffer Weights

(a) (b)

Figure 2: Architectural diagrams of our BNN accelerator — (a) system-level block diagram showing the three compute
units, buffers, and how the accelerator is connected to the CPU and off-chip memory hierarchy; (b) architecture of the Bin-
Conv unit with input and output parallelization factors fin = 2 and fout = 3. The unit can stream in two words per cycle and
produce three output fmaps per invocation.
The storage of intermediate data in our accelerator differs a 3-channel 32×32 input. While the input pixels are 20-bit
from most existing designs. In full-precision CNN accelera- fixed-point, the weights are binarized, so we can replace
tors, the size of a single set of fmaps between two layers typ- the multiplies in the conv operation with sign inversions.
ically exceed the size of FPGA on-chip storage. This neces- We fully parallelize across the three input channels: each
sitates the continuous transfer of fmaps to and from off-chip cycle we stream in three input pixels, add them to three line
RAM. However, as Table 1 shows, the size of the largest set buffers, and compute a 3×3×3 convolution. The result is put
of fmaps in our BNN is only 128K bits, which easily fits on- through batch norm and binarization to produce one output
chip even in smaller FPGAs. Our design uses two in-out data bit per cycle. Greater parallelism in this unit is achievable,
buffers A and B of equal size. One layer reads from A and but the first conv layer takes up a very small portion of the
write its outputs to B; then (without any off-chip data trans- overall runtime, and we focused our efforts elsewhere.
fers) the next layer can read from B and write to A. Thus,
Bin-Conv — The binary conv unit is the most critical
off-chip memory transfers are only needed for the input im-
component of the accelerator, as it will be responsible for
age, output prediction, and loading each layer’s weights.
the five binary conv layers which take up the vast major-
Unlike the fmaps, there is only enough memory on-chip
ity of the runtime. The unit must maintain high throughput
to store a portion of a layer’s weights. Multiple accelerator
and resource efficiency while handling different input widths
invocations may be needed for a layer; in each invocation we
at runtime; our design targets 8, 16, or 32, and can sup-
load in a new set of weights and produce a new set of fmaps.
port larger power-of-two widths with minor changes. To ef-
The next invocation produces the next set of fmaps, and etc,
ficiently compute a convolution, multiple rows of input pix-
until all output fmaps have been generated and stored in the
els need to be buffered for simultaneous access. However,
on-chip data buffer. Invoking the accelerator requires pass-
a standard line buffer (i.e., from video processing applica-
ing it arguments such as pointers to the weights, the layer
tions) is unsuitable for this task due to two reasons:
type and size, the fmap size, and whether pooling should be
applied. Inside the accelerator, the controller decodes these 1. A line buffer must be sized for the largest input fmap; in
inputs and coordinates the other modules. this case it must have a width of 32. This not only causes
buffer under-utilization when the fmap is 16 or 8 wide, it
3.4 Compute Unit Architectures also leads to loss of throughput, as we can only perform
In our accelerator, each compute unit must store binarized as many convolutions per cycle as the input width.
data to the on-chip RAMs at the end of its execution. As 2. A line buffer is designed to shift one pixel per cycle
Figure 1 reveals, the first operation of a conv or FC layer to always store the most recent rows. However, with
transforms the binary inputs to integers; we make sure each binarized in puts we have access to not one but many lines
unit will also perform the subsequent batch-norm, pooling, of new pixels each cycle (for instance, a 32-bit word can
and binarization before writing data out to the buffers. One hold 4 lines from an 8×8 fmap). This radically changes
of our design goals is to limit the amount of integer-valued how we should update the line buffer.
intermediate data buffered inside each compute unit,
In full precision CNN accelerators, the size problem can be
FP-Conv — The fixed-point conv unit utilizes the well- addressed by tiling, where input fmaps are processed in tiles
known line buffer architecture for 2D convolutions. Because which are always the same size. This is unsuitable in a BNN
this unit only targets a single layer, we hardwire it to handle since the fmaps are typically very small in terms of number

19
Input Word Line Buffer Output Line
Input width 32 32
line 1
(a) line 3 line 2 ∗ line 1
conv
line 3
BitSel
Input width 8 8
z line 3 line 4 line 5 line 6
(b) line 5 line 6 line 7 line 8 line 4 line 5 line 6 line 7 ∗ line 3 line 4 line 5 line 6
conv
line 5 line 6 line 7 line 8
BitSel

Figure 3: Example usage of the variable-width line buffer — we show how a 32-bit input word is divided up by BitSel and
inserted into the VWLB. The line buffer has height 3 and width 32. Sliding a 3×3 conv filter across the VWLB produces 32
output pixels, ignoring edge effects. (a) for a 32-wide input fmap, each row of the VWLB stores one line and applying the conv
filter produces one 32-wide output line; (b) for an 8-wide input fmap, each row of the VWLB stores four lines and applying
the conv filter produces four consecutive 8-wide output lines given the mapping of input lines to banks shown.
of bits. One possible solution to the second problem is to is written to the bottom row. We can then slide the 3×3
reorganize the data so each bit in an input word comes from conv window across the VWLB to generate one 32-bit
a different fmap, and assign each bit to a separate line buffer. line of conv outputs.
However, this requires an exorbitant number of line buffers 2. For an 8-wide input, each word contains four lines. We
and is not area efficient. split each VWLB row into four banks, and map each
To address the above issues, we introduce two new mod- input line to one or more VWLB banks. The mapping is
ules: the BitSel module and the variable-width line buffer done in such a way that sliding the conv window across
(VWLB). Figure 2(b) shows the basic structure of the Bin- the VWLB produces four consecutive 8-bit output lines.
Conv unit, whose execution proceeds in two phases. In Each cycle the VWLB shifts both up and to the left.
the first phase, input fmaps from the on-chip buffers are
streamed in on the left side, through the BitSel modules, and The BitSel is responsible for slicing the input word and
into the VWLBs. The Convolver modules compute the par- mapping the slices to the row banks. Because the smallest
tial conv sums and accumulates them in the integer buffers. input width is 8, each slice and VWLB bank is sized at 8 bits.
The BitSel is responsible for reordering the input bits so that For a 32-wide input, BitSel maps four contiguous 8-bit slices
the Convolver logic can be agnostic of the fmap width. In to the bottom row. For an 8-wide input, the mapping is more
Figure 2(b), fin is the input parallelization factor — the Bin- complex, but still highly regular and can be computed in
Conv unit accepts fin input words per cycle and the data hardware with just adds and shifts. Each pixel in the output
buffers are partitioned to match this rate. fout is the output lines in Figure 3 is an integer conv sum, and each sum is
parallelization factor — each Convolver applies fout 3 × 3 accumulated at a different location in the integer buffer.
conv filters per cycle to the data in the VWLB and gener- The BitSel and VWLB provides three primary advan-
ates partial sums for fout different fmaps. The first phase tages: (1) the VWLB achieves full hardware utilization re-
ends when all input fmaps in the current layer have been gardless of input width, (2) a new input word can be buffered
processed. At this point each integer buffer contains a fin- every cycle, and (3) the BitSel deals with various input
ished conv map. In the second phase we compute max pool- widths by itself, allowing the actual buffer and convolution
ing, batch norm, and binarization to produce fout binary out- logic to be fixed. Note that the VWLB used in our design dif-
put fmaps. Note that max-pooling and binarization are non- fers from Figure 3 in a few details. First, we have neglected
linear operations, so we cannot apply them to partially fin- edge padding. The actual VWLB contains two additional el-
ished conv maps and accumulate afterwards. ements per bank to hold horizontal pad bits; vertical padding
Figure 3 explains the operation of the BitSel and VWLB is handled by inserting lines of zeros. Second, because the
in greater detail. The diagram assumes we have a word size pad bits are 0 rather than +1 or -1, we must make each ele-
of 32 bits and a 3×3 conv filter, which requires a VWLB ment in the VWLB two bits instead of one. The conv oper-
with three rows and 32 elements per row. We demonstrate ation is performed between the 2-bit data and 1-bit weights,
how the VWLB works for input fmap widths 32 and 8 and and can be implemented as sign inversion and accumulate.
ignore edge padding for the sake of simplicity.
Bin-FC — The binary FC unit is comparatively simple.
1. For a 32-wide input, each word contains exactly one line. Each cycle we read in fin data words and an equal number
Each cycle, the VWLB shifts up and the new 32-bit line of weight words. fin here is the input parallelization factor

20
just as in Bin-Conv. Because there is no edge padding in an 1 VariableLineBuffer linebuf;
FC layer the computations can be truly binary. We perform 2 ConvWeights wts;
a dot product between the data and weight words by apply- 3 IntegerBuffer outbuf;
ing a bitwise XOR operation and then summing the resulting 4
bits with a popcount. Similar to the Bin-Conv unit, we accu- 5 for (i = 0; i < n_input_words; i++) {
mulate the sum in an integer buffer and apply binarization 6 #pragma HLS pipeline
after all inputs have been processed. Note that the FC layers 7
are typically bound by memory bandwidth of the off-chip 8 // read input word, update linebuffer
connection, rather than the throughput of the accelerator. 9 WordType word = input_data[i];
10 BitSel(linebuf, word, input_width);
Data Buffers — To accommodate multiple reads per cy- 11
cle, the data buffers are partitioned into fin banks, and fea- 12 // update the weights each time we
ture maps are interleaved across the different banks. Figure 4 13 // begin to process a new fmap
shows an example with fin = 2 and four words per fmap. 14 if (i % words_per_fmap == 0)
The data words are read sequentially by address, so a com- 15 wts = weights[i / words_per_fmap];
pute unit always accesses fin consecutive fmaps in parallel. 16
17 // perform conv across linebuffer
fmap 0 fmap 2 fmap 1 fmap 3 18 for (c = 0; c < LINE_BUF_COLS; c++) {
19 #pragma HLS unroll
20 outbuf[i % words_per_fmap][c] +=
0 1 2 3 … 0 1 2 3 … 21 conv(c, linebuf, wts);
22 }
Bank 0 Bank 1 23 }
Figure 5: HLS pseudocode for part of the Bin-Conv unit
— the pseudocode implements a pipeline which reads and
Convolver 0 Convolver 1 performs convolution on one input word each cycle. Many
details are left out; the goal is to illustrate how our design
Figure 4: Example of data buffer banking — The compute can be expressed in high-level code.
unit and memory system have fin = 2. Each fmap contains
four words which are laid out sequentially. The fmaps are factor in our design, as 64 is already a very large paralleliza-
interleaved across banks, and both Bin-Conv and Bin-FC tion factor (it means we perform 64 convolutions per cycle),
benefit from this banking. and there are other sources of parallelism to exploit in the
BNN. We chose a word size of 64 bits for the data buffers
4. HLS Accelerator Implementation and sized each data buffer A and B at 2048 words, which is
Figure 5 shows the HLS pseudocode for the front half of just enough to store the largest set of fmaps in the BNN.
the Bin-Con unit, and demonstrates a key difference between We also explored different values for fin and fout in Bin-
BNN and CNN hardware design. For a CNN the code typ- Conv. It was observed that both have roughly similar effects
ically loops over an fmap processing one pixel at a time; on execution time, but increasing fout has a more severe
key design decisions include loop ordering and unroll fac- effect on total area. fin controls the number of BitSels and
tors (see [27] for a good example). In our BNN accelera- VWLBs while fout controls the number of pooling/batch
tor, the basic atom of processing is not a pixel but a word. norm units and integer buffers. In terms of logic a BitSel
The example code is designed to sustain one word per cy- and a pooling/batch norm unit is similar, but each VWLB
cle throughput over the entire input feature map set. Each contains 32 × 3 2-bit registers while each integer buffer
fmap consists of words per fmap words, a number which contains 32 × 32 12-bit registers. Thus all else being equal
differs between layers. As it processes the input set, the code it is better to increase fin . This result shows the importance
updates the weights on each new fmap and accumulates the of minimizing the storage of intermediate values and only
conv results in outbuf. We call BitSel and conv inside the committing binarized data to memory.
loop to instantiate the BitSel units and conv logic as shown We use Xilinx SDSoC as the primary design tool for our
in Figure 2(b). To increase the number of input streams we BNN application. SDSoC takes as input a software program
can tile the loop and unroll the inner loop body. with certain functions marked as “hardware”. It invokes Vi-
A key design decision here is the input word size, which vado HLS under the hood to synthesize the “hardware” por-
controls the level of parallelism across the pixels of an fmap. tion into RTL. In addition, it automatically generates the data
To guarantee correctness, words per fmap must be an inte- motion network and DMA necessary for memory transfer
ger greater than zero; this constrains the word size to at most between CPU and FPGA based on the specified software-
the size of the smallest input fmap (8 × 8 = 64 bits in our hardware partitioning. We selected a DMA engine built for
case). The word size restriction is not a significant limiting contiguous memory since it has the highest throughput, and

21
a neural network’s data and weights can be laid out contigu- tains 15.1x better performance and 11.6x better throughput
ously. We used directives to ensure that data is only trans- per Watt over mGPU, which has a similar power envelope.
ferred on the first and last accelerator invocation; weights Against the x86 processor, it achieves a 2.5x speedup. While
are transferred on every invocation. the binary conv layers were faster, the FC layers were slower,
which is unsurprising as the FC layers are bound by external
memory bandwidth. Versus GPU, the FPGA is 8.1x worse
5. Experimental Results
in performance. But as expected, it has much lower power
We evaluate our design on a ZedBoard, which uses a low- consumption and better throughput per Watt.
cost Xilinx Zynq-7000 SoC containing an XC7Z020 FPGA To show that the FC layers are indeed limited by memory
alongside an ARM Cortex-A9 embedded processor. We bandwidth, we created a design where the FC computations
make use of Xilinx SDSoC 2016.1 as the primary design are removed but the memory transfers are kept. The new
tool, which leverages Vivado HLS and Vivado to perform execution time of the FC layers is within 5% that of the
the actual HLS compilation and FPGA implementation. original, demonstrating that there is not much to gain by
We compared our design against two server-class comput- further parallelizing them beyond the current design.
ing platforms: an Intel Xeon E5-2640 multicore processor
(CPU) and an NVIDIA Tesla K40 GPU (GPU). We also Table 4: Performance comparison — Conv1 is the first FP
compared against an NVIDIA Jetson TK1 embedded GPU conv layer, Conv2-5 are the binary conv layers, FC1-3 are
board (mGPU). As BNNs are a recent development, our the FC layers. A – indicates a value we could not measure.
baseline applications will not be as well optimized compared Numbers with * are sourced from datasheets. The last row
to CNN baselines (where implementations can be found in shows power efficiency in throughput per Watt.
frameworks such as Caffe). The CPU and GPU baselines
are adapted from code provided in [7]. The code leverages
Execution time per image (ms)
Theano, and calls OpenBLAS for CPU and CUDA for GPU.
mGPU CPU GPU FPGA
However, it does not perform bitwise optimizations since
they are not natively supported in Theano, and instead uses Conv1 – 0.68 0.01 1.13
floating-point values binarized to -1 and +1. For the base- Conv2-5 – 13.2 0.68 2.68
lines we used the BNN model with no biases and k and h, FC1-3 – 0.92 0.04 2.13
and on the GPU we always used the largest batch size. Total 90 14.8 0.73 5.94
Power measurement is obtained via a power monitor. We Speedup 1.0x 6.1x 123x 15.1x
measured 4.5W idle and 4.7W max power on the Zedboard
power suppy line when running our BNN. This indicates the Power (Watt) 3.6 95* 235* 4.7
dynamic power consumption of the FPGA is very low. imgs/sec/Watt 3.09 0.71 5.83 35.8

Table 3: Comparison of different configurations — Last

row shows the resources available on the device; Runtime Table 5 compares our implementation against state-of-
is in milliseconds. * indicates our chosen configuration. the-art FPGA accelerators found in literature — all num-
bers are retrieved from the respective papers. Note that two
fin LUT FF BRAM DSP Runtime of the comparisons are against larger FPGAs while one
1 25289 28197 86 3 17.5 is against the same device. Throughput is shown in giga-
2 35291 37125 87 3 10.8 operations-per-second (GOPS), and we count adds and mul-
4 38906 36771 87 3 7.98 tiplies following [27]: each binary xor, negation, or addition
8* 46900 46134 94 3 5.94 counts as one operation. Our BNN accelerator beats the best
known FPGA accelerators in pure throughput, and is also
Dev. 53200 106400 140 220 - much more resource and power efficient. BNNs save espe-
cially on the number of DSPs since multiplication/division
Table 3 shows the performance and resource utilization of is only needed for batch norm and not for the compute-
our accelerator using different values of fin for the Bin-Conv intensive conv or FC calculations. The metrics of through-
and Bin-FC units. All numbers are post place and route. put per kLUT and throughput per Watt are especially im-
In our experiments fout is set to 1 for reasons outlined in portant — due to the relative novelty of the BNN, we were
Section 4. Performance scaling is clear, though the scaling is only able to obtain a suitable network for CIFAR-10 while
less than unity due to memory transfer and other overheads. previous work shows results for larger ImageNet networks.
We use fin = 8 for the rest of the experiments. However, our data provides evidence that the BNN is algo-
We compare the performance of our accelerator to the rithmically better suited for FPGA than CNN, enabling far
various baselines in Table 4. As raw throughput depends more efficient usage of resource and power. With a more ad-
heavily on device size, we also show the power consump- vanced network and larger device, our design should scale up
tion and the throughput per Watt. The FPGA design ob- and achieve similar gains. Very recent work on low-precision

22
CNNs have also made great strides into achieving near state- find the optimal architectural parameters. Rahman et al. [20]
of-the-art accuracy on the ImageNet dataset [16, 28]. propose a scalable array-based CNN accelerator with heavy
input reuse. Motamedi [17] uses a roofline model for perfor-
Table 5: Comparison of our work against state-of-the-art mance to guide hardware generation. Wang [26] proposes
FPGA accelerators — GOPS counts multiplies and adds DeepBurning, which targets a variety of CNN architectures
per second. * indicates values approximated from charts. and performs data layout optimization.
OpenCL frameworks for deep learning on FPGA have
[23] [19] [25] Ours also been proposed. Suda et al. [23] use parameterized
OpenCL alongside analytical models for performance and
Stratix-V Zynq Zynq Zynq
Platform resource, enabling a genetic algorithm to search for the opti-
GSD8 7Z045 7Z020 7Z020
mal configuration. Venieris and Bouganis [25] study the use
Capacity of synchronous dataflow to capture CNN workloads and use
695 218.6 53.2 53.2
(kLUTs) graph partitioning to control resource consumption.
Clock(MHz) 120 150 100 143 There is also a great deal of research work on ASIC CNN
co-processors. Among the most well know is the DianNao
Power(W) 19.1 9.6 - 4.7 line of architectures [2]. The Eyeriss paper [3] contains a
Precision 8-16b 16b - 1-2b comprehensive study of popular dataflows for spatial archi-
tectures and derive an optimal one.
GOPS (conv) 136.5 187.8 - 318.9 Our approach differs from existing work in two major
GOPS (all) 117.8 137.0 12.73 207.8 ways: (1) we are the first to study BNNs for FPGA accel-
eration; (2) we make use of a C-based HLS methodology
kLUTs 120* 182.6 43.2 46.9
and propose design constructs to maximize throughput on
DSPs 760* 780 208 3 different layers. Existing CNN accelerators on FPGA are not
GOPS/ well-equipped to handle BNNs due significant differences in
0.98 0.75 0.29 4.43 compute and storage requirements, and layer organization.
kLUT
Our final design differs greatly from previous work.
GOPS/
6.17 14.3 7.27 44.2
Watt
7. Conclusions and Future Work
While it may not be completely fair to compare GOPS We are the first to implement an accelerator for binarized
between a binarized and conventional network, it is cur- neural networks on FPGA. BNNs feature potentially re-
rently the (de facto) standard practice for hardware accel- duced storage requirements and binary arithmetic opera-
erator studies to compare reduced and full-precision imple- tions, making them well suited to the FPGA fabric. However,
mentations that use different data types. these characteristics also render CNN design constructs such
as input tiles and line buffers ineffective. We introduce new
6. Related Work design constructs such as a variable-width line buffer to ad-
Our paper owes much to the groundbreaking work on BNNs dress these challenges, creating an accelerator radically dif-
in the machine learning community [6, 7]. These papers ferent from existing work. We leverage modern HLS tools
contain some discussion on the advantages of BNNs over to write our design in productive, high-level code, and our
CNN for hardware, but to our best knowledge we are the accelerator outperforms existing work in raw throughput,
first to present a working FPGA implementation. throughput per area, and throughput per Watt.
There have been many studies on the design of CNN ac- Future BNN work should focus both on algorithmic and
celerators for FPGA. Zhang et al. [27] describe how to opti- architectural improvements. On the algorithmic side we
mize an HLS design by reordering and tiling loops, inserting would like to explore techniques to reduce model size. From
the proper pragmas, and organizing external memory trans- the architectural side one action item is to implement a
fers. Ensuing publications have mostly eschewed HLS in low-precision network for ImageNet, which would involve
favor of RTL designs. Qiu et al. [19] propose an architec- a much larger and more complicated accelerator design.
ture that computes conv and FC layers on the same hard-
ware, as well as dynamic fixed-point quantization. Their pa-
per demonstrates an area-efficient accelerator for AlexNet Acknowledgements
on the Xilinx ZC706 board. This research was supported in part by DARPA Award
A related line of research focuses on creating CNN de- HR0011-16-C-0037, a DARPA Young Faculty Award, NSF
sign compilers which can generate optimized hardware for Awards #1337240, #1453378, #1512937, and a research gift
a family of models. These works typically use a set of RTL from Xilinx, Inc. The Tesla K40 GPU used for this research
modules combined with a design space exploration tool to was donated by the NVIDIA Corporation.

23
References
[1] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
T. Czajkowski, S. D. Brown, and J. H. Anderson. LegUp: Classification with Deep Convolutional Neural Networks.
An Open-Source High-Level Synthesis Tool for FPGA-Based Advances in Neural Information Processing Systems (NIPS),
Processor/Accelerator Systems. ACM Trans. on Embedded pages 1097–1105, 2012.
Computing Systems (TECS), 13(2):24, 2013. [16] F. Li and B. Liu. Ternary Weight Networks. arXiv e-print,
[2] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and arXiv:1605.04711, May 2016.
O. Temam. Diannao: A Small-Footprint High-Throughput [17] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi. Design
Accelerator for Ubiquitous Machine-earning. Int’l Conf. Space Exploration of FPGA-Based Deep Convolutional Neural
on Architectural Support for Programming Languages and Networks. Asia and South Pacific Design Automation Conf.
Operating Systems (ASPLOS), Mar 2014. (ASP-DAC), pages 575–580, Jan 2016.
[3] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze. Eyeriss: [18] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss,
An Energy-Efficient Reconfigurable Accelerator for Deep and E. Chung. Accelerating Deep Convolutional Neural
Convolutional Neural Networks. Int’l Symp. on Computer Networks Using Specialized Hardware. Microsoft Research,
Architecture (ISCA), Jun 2016. Feb 2015.
[4] A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and [19] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang,
B. Catanzaro. Deep Learning with COTS HPC Systems. N. Xu, S. Song, et al. Going Deeper with Embedded FPGA
Int’l Conf. on Machine Learning (ICML), pages 1337–1345, Platform for Convolutional Neural Network. Int’l Symp. on
Jun 2013. Field-Programmable Gate Arrays (FPGA), pages 26–35, Feb
[5] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and 2016.
Z. Zhang. High-Level Synthesis for FPGAs: From Prototyping [20] A. Rahman, J. Lee, and K. Choi. Efficient FPGA Acceleration
to Deployment. IEEE Trans. on Computer-Aided Design of of Convolutional Neural Networks using Logical-3D Compute
Integrated Circuits and Systems (TCAD), Apr 2011. Array. Design, Automation, and Test in Europe (DATE), pages
[6] M. Courbariaux, Y. Bengio, and J.-P. David. BinaryConnect: 1393–1398, Apr 2016.
Training Deep Neural Networks with binary weights during [21] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-
propagations. Advances in Neural Information Processing Net: ImageNet Classification Using Binary Convolutional
Systems (NIPS), pages 3123–3131, 2015. Neural Networks. European Conference on Computer Vision
[7] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and (ECCV), Oct 2016. arXiv:1603.05279.
Y. Bengio. Binarized Neural Networks: Training Deep Neural [22] K. Simonyan and A. Zisserman. Very Deep Convolutional
Networks with Weights and Activations Constrained to +1 or Networks for Large-Scale Image Recognition. arXiv e-print,
-1. arXiv e-print, arXiv:1602.02830, Feb 2016. arXiv:1409.15568, Apr 2015.
[8] T. S. Czajkowski, U. Aydonat, D. Denisenko, J. Freeman, [23] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrud-
M. Kinsner, D. Neto, J. Wong, P. Yiannacouras, and D. P. hula, J.-s. Seo, and Y. Cao. Throughput-Optimal OpenCL-
Singh. From OpenCL to High-Performance Hardware on based FPGA Accelerator for Large-Scale Convolutional Neu-
FPGAs. Int’l Conf. on Field Programmable Logic and ral Networks. Int’l Symp. on Field-Programmable Gate Arrays
Applications (FPL), pages 531–534, Aug 2012. (FPGA), pages 16–25, Feb 2016.
[9] M. A. et al. TensorFlow: Large-Scale Machine Learning [24] Theano Development Team. Theano: A Python framework for
on Heterogeneous Systems, 2015. Software available from fast computation of mathematical expressions. arXiv e-print,
tensorflow.org. arXiv:1605.02688, May 2016.
[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning [25] S. I. Venieris and C.-S. Bouganis. fpgaConvNet: A Frame-
for Image Recognition. arXiv e-print, arXiv:1512.0338, Dec work for Mapping Convolutional Neural Networks on FPGAs.
2015. IEEE Symp. on Field Programmable Custom Computing Ma-
[11] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating chines (FCCM), May 2016.
Deep Network Training by Reducing Internal Covariate Shift. [26] Y. Wang, J. Xu, Y. Han, H. Li, and X. Li. DeepBurning:
arXiv e-print, arXiv:1502.03167, Mar 2015. Automatic Generation of FPGA-based Learning Accelerators
[12] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- for the Neural Network Family. Design Automation Conf.
shick, S. Guadarrama, and T. Darrell. Caffe: Convolutional (DAC), page 110, Jun 2016.
architecture for fast feature embedding. arXiv preprint, [27] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong.
arXiv:1408.5093, 2014. Optimizing FPGA-based Accelerator Design for Deep Convo-
[13] V. Kathail, J. Hwang, W. Sun, Y. Chobe, T. Shui, and lutional Neural Networks. Int’l Symp. on Field-Programmable
J. Carrillo. SDSoC: A Higher-level Programming Environment Gate Arrays (FPGA), pages 161–170, Feb 2015.
for Zynq SoC and Ultrascale+ MPSoC. Int’l Symp. on Field- [28] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou.
Programmable Gate Arrays (FPGA), pages 4–4, Feb 2016. DoReFar-Net: Training Low Bitwidth Convolutional Neural
[14] A. Krizhevsky and G. Hinton. Learning Multiple Layers of Networks with Low Bitwidth Gradients. arXiv e-print,
Features from Tiny Images, 2009. Master’s Thesis. Department arXiv:1606.06160, Jul 2016.
of Coumputer Science, University of Toronto.

Welding Machine Calibration Procedure
0% (1)
Welding Machine Calibration Procedure
10 pages
BNN in FPGA
No ratings yet
BNN in FPGA
15 pages
FP-BNN-on-FPGA
No ratings yet
FP-BNN-on-FPGA
15 pages
Energy-Efficient FPGA Implementation of Power-Of-2 Weights-Based Convolutional Neural Networks With Low Bit-Precision Input Images
No ratings yet
Energy-Efficient FPGA Implementation of Power-Of-2 Weights-Based Convolutional Neural Networks With Low Bit-Precision Input Images
5 pages
Convolution Optimization For DNN
No ratings yet
Convolution Optimization For DNN
14 pages
Electronics 10 02859 v2
No ratings yet
Electronics 10 02859 v2
16 pages
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
No ratings yet
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
8 pages
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
No ratings yet
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
18 pages
Irmak2021energy_efficient
No ratings yet
Irmak2021energy_efficient
4 pages
A Survey of FPGA Based Accelerators For
No ratings yet
A Survey of FPGA Based Accelerators For
32 pages
Optimizing FPGA-based Accelerator Design For Deep Convolutional Neural Networks
No ratings yet
Optimizing FPGA-based Accelerator Design For Deep Convolutional Neural Networks
10 pages
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
No ratings yet
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
5 pages
ElieNicolas BNNs
No ratings yet
ElieNicolas BNNs
16 pages
A CNN Accelerator On FPGA Using Depthwise Separable Convolution
No ratings yet
A CNN Accelerator On FPGA Using Depthwise Separable Convolution
5 pages
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
No ratings yet
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
15 pages
Electronics 11 00663
No ratings yet
Electronics 11 00663
14 pages
Finn
No ratings yet
Finn
10 pages
Cafpga: An Automatic Generation Model For CNN Accelerator
No ratings yet
Cafpga: An Automatic Generation Model For CNN Accelerator
30 pages
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
No ratings yet
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
13 pages
Zynqnet: An Fpga-Accelerated Embedded Convolutional Neural Network
No ratings yet
Zynqnet: An Fpga-Accelerated Embedded Convolutional Neural Network
102 pages
rongshi2019
No ratings yet
rongshi2019
4 pages
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
No ratings yet
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
8 pages
FP-DNN An Automated Framework For Mapping
No ratings yet
FP-DNN An Automated Framework For Mapping
8 pages
FPGA Convolution Network Acceleration
No ratings yet
FPGA Convolution Network Acceleration
9 pages
A CNN Accelerator On FPGA With A Flexible Structure
No ratings yet
A CNN Accelerator On FPGA With A Flexible Structure
6 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
Systematic Analysis of FPGA-based Hardware Acceler
No ratings yet
Systematic Analysis of FPGA-based Hardware Acceler
9 pages
Implementation of FPGA-based Accelerator For CNN
No ratings yet
Implementation of FPGA-based Accelerator For CNN
7 pages
Efficient Hardware Architectures For Deep Convolutional Neural Network
No ratings yet
Efficient Hardware Architectures For Deep Convolutional Neural Network
13 pages
RM Merged Files
No ratings yet
RM Merged Files
207 pages
Accelerating Low Bit-width Convolutional Neural Networks With Embedded FPGA
No ratings yet
Accelerating Low Bit-width Convolutional Neural Networks With Embedded FPGA
4 pages
Pynq Classification
No ratings yet
Pynq Classification
65 pages
An Implementation of Convolutional Neural Networks
No ratings yet
An Implementation of Convolutional Neural Networks
23 pages
2019 Ics Mcdanel Zhang Kung Dong
No ratings yet
2019 Ics Mcdanel Zhang Kung Dong
12 pages
A Reconfigurable CNN-Based Accelerator Design For Fast and
No ratings yet
A Reconfigurable CNN-Based Accelerator Design For Fast and
20 pages
A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration
No ratings yet
A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration
10 pages
NullHop_A_Flexible_Convolutional_Neural_Network_Accelerator_Based_on_Sparse_Representations_of_Feature_Maps
No ratings yet
NullHop_A_Flexible_Convolutional_Neural_Network_Accelerator_Based_on_Sparse_Representations_of_Feature_Maps
13 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
Towards Reconfigurable CNN Accelerator For FPGA Implementation
No ratings yet
Towards Reconfigurable CNN Accelerator For FPGA Implementation
5 pages
Convolutional Neural Network Layers Implementation On Low-Cost Reconfigurable Edge Computing Platforms
No ratings yet
Convolutional Neural Network Layers Implementation On Low-Cost Reconfigurable Edge Computing Platforms
31 pages
1 s2.0 S1877050922005701 Main
No ratings yet
1 s2.0 S1877050922005701 Main
6 pages
Research On Opencl Optimization For Fpga Deep Learning Application
No ratings yet
Research On Opencl Optimization For Fpga Deep Learning Application
19 pages
applsci-15-00688-v3
No ratings yet
applsci-15-00688-v3
21 pages
CNN hw1
No ratings yet
CNN hw1
13 pages
A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA
No ratings yet
A Reconfigurable CNN-Based Accelerator Design For Fast and Energy-Efficient Object Detection System On Mobile FPGA
8 pages
Image Skin Cancer Classification Based On FPGA and Convolutional Neural Network
No ratings yet
Image Skin Cancer Classification Based On FPGA and Convolutional Neural Network
7 pages
286-1006-1-PB (3)
No ratings yet
286-1006-1-PB (3)
8 pages
A CNN Accelerator on FPGA Using Depthwise
No ratings yet
A CNN Accelerator on FPGA Using Depthwise
5 pages
Research On FPGA Based Convolutional Neural Network Acceleration Method
No ratings yet
Research On FPGA Based Convolutional Neural Network Acceleration Method
4 pages
10.1109VDAT50263.2020.9190274
No ratings yet
10.1109VDAT50263.2020.9190274
6 pages
FPGA CNN Project Paper
No ratings yet
FPGA CNN Project Paper
31 pages
DL Inference FPGA Class1
No ratings yet
DL Inference FPGA Class1
56 pages
10 3390@electronics8030295
No ratings yet
10 3390@electronics8030295
15 pages
A High Performance Reconfigurable Hardware Archite (5)
No ratings yet
A High Performance Reconfigurable Hardware Archite (5)
17 pages
Tutorial On DNN 1 of 9 Background of DNNs
No ratings yet
Tutorial On DNN 1 of 9 Background of DNNs
65 pages
A Convolutional Neural Network Accelerator Architecture
No ratings yet
A Convolutional Neural Network Accelerator Architecture
5 pages
3. BEANNA
No ratings yet
3. BEANNA
5 pages
10190052
No ratings yet
10190052
11 pages
A Reconfigurable CNN-based Accelerator Design For
No ratings yet
A Reconfigurable CNN-based Accelerator Design For
9 pages
Ug4 Proj
No ratings yet
Ug4 Proj
44 pages
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
From Everand
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
S. R. Jena
No ratings yet
2020-Real-to-Binary.11535
No ratings yet
2020-Real-to-Binary.11535
11 pages
2009.RBNN
No ratings yet
2009.RBNN
12 pages
2022 Cmim Net
No ratings yet
2022 Cmim Net
16 pages
2022-BNEXT
No ratings yet
2022-BNEXT
16 pages
2020-MeliusNet
No ratings yet
2020-MeliusNet
21 pages
2003.03488v2
No ratings yet
2003.03488v2
18 pages
1-s2.0-S2046043021000514-main
No ratings yet
1-s2.0-S2046043021000514-main
12 pages
maharmeh2021
No ratings yet
maharmeh2021
5 pages
Runtime Reconfigurable Processing Elements for Binary2020
No ratings yet
Runtime Reconfigurable Processing Elements for Binary2020
45 pages
2205.07690
No ratings yet
2205.07690
11 pages
Zhang 2021
No ratings yet
Zhang 2021
12 pages
A Deep Learning Accelerator Based on a Streaming Architecture for Binary Neural Networks
No ratings yet
A Deep Learning Accelerator Based on a Streaming Architecture for Binary Neural Networks
19 pages
Jpeg Xs Whitepaper
No ratings yet
Jpeg Xs Whitepaper
8 pages
Avishkar Marathon
No ratings yet
Avishkar Marathon
4 pages
Commercial and Industrial Water Meter: Flowiq® 3100
No ratings yet
Commercial and Industrial Water Meter: Flowiq® 3100
4 pages
Crest Solvac Vapor Degreasers
No ratings yet
Crest Solvac Vapor Degreasers
3 pages
PB04051-06 GC2200 Power Line Transceiver Production Product Brief
No ratings yet
PB04051-06 GC2200 Power Line Transceiver Production Product Brief
2 pages
Venky
No ratings yet
Venky
4 pages
Jamtaba - Manual v.1.0.4
No ratings yet
Jamtaba - Manual v.1.0.4
6 pages
Compendium of 75 Agri Entrerenuer
No ratings yet
Compendium of 75 Agri Entrerenuer
190 pages
Java Project File
No ratings yet
Java Project File
16 pages
Unit 3
No ratings yet
Unit 3
43 pages
Trench Less Excavation and Ordinary Methods
No ratings yet
Trench Less Excavation and Ordinary Methods
8 pages
Mechtex Catalogue
No ratings yet
Mechtex Catalogue
56 pages
Gasurveyor 700 Series: Ensuring Compliance
No ratings yet
Gasurveyor 700 Series: Ensuring Compliance
2 pages
13003br PDF
No ratings yet
13003br PDF
3 pages
1.2.4.AK SequentialLogicDesign - Counters - DMS
No ratings yet
1.2.4.AK SequentialLogicDesign - Counters - DMS
7 pages
Dissertation Topics Aviation
100% (2)
Dissertation Topics Aviation
5 pages
Safety and Maintenance of Haual Road
No ratings yet
Safety and Maintenance of Haual Road
42 pages
Reciprocating Air Compressors With Numericals
93% (15)
Reciprocating Air Compressors With Numericals
25 pages
AI
No ratings yet
AI
13 pages
Law #2010/012 of 21 December 2010 Relating To Cybersecurity and Cybercriminality in Cameroon
No ratings yet
Law #2010/012 of 21 December 2010 Relating To Cybersecurity and Cybercriminality in Cameroon
26 pages
IC Work Plan Template WORD
No ratings yet
IC Work Plan Template WORD
3 pages
MC5
No ratings yet
MC5
66 pages
Charger Catalogue
0% (1)
Charger Catalogue
16 pages
A New Paradigm For Talent Management and Work Life Balance
No ratings yet
A New Paradigm For Talent Management and Work Life Balance
46 pages
Types of Threats
No ratings yet
Types of Threats
2 pages
Mimpython 1
No ratings yet
Mimpython 1
6 pages
Mozaik Software - Cadmate Brochure 2023
No ratings yet
Mozaik Software - Cadmate Brochure 2023
12 pages
2388 v10 00 00 Ultratev Plus2 Operating Manual
No ratings yet
2388 v10 00 00 Ultratev Plus2 Operating Manual
80 pages
Types of Software
No ratings yet
Types of Software
6 pages
IOT Practical
No ratings yet
IOT Practical
129 pages

Accelerating Binarized Convolutional 2017

Uploaded by

Accelerating Binarized Convolutional 2017

Uploaded by

Accelerating Binarized Convolutional Neural Networks

with Software-Programmable FPGAs

Abstract 2012 ImageNet recognition challenge [15]. Subsequently,

Variable-width fout Conv

Table 3: Comparison of different configurations — Last

You might also like