Accelerating Binarized Convolutional 2017
Accelerating Binarized Convolutional 2017
15
We observe two trends which may help overcome these 2. Preliminaries
obstacles. The first is a series of recent papers in the machine In this section we briefly review the basic principles and
learning community regarding very-low-precision CNNs. terminology of CNNs, the differences between a CNN and
Networks with binary weights [6], or binary weights and ac- BNN, and the specific CIFAR-10 BNN model that our ac-
tivations [7, 21] have in certain cases demonstrated accuracy celerator will target.
comparable to full precision nets. Such binarized neural net-
works (BNNs) may be the key to efficient deep learning on
FPGA. Binarization reduces storage and memory bandwidth 2.1 Convolutional Neural Network Primer
requirements, and replace FP operations with binary opera- A CNN is a machine learning classifier that typically takes
tions which can be very efficiently performed on the LUT- in a multi-channel image and produces the probabilities of
based FPGA fabric. that image belonging to each output class. A typical CNN
Concerning the cost and effort of FPGA implementation, consists of a pipeline of connected layers. Each layer takes
we see a steady improvement in FPGA design automation as input a set of feature maps (fmaps), performs some com-
tools over the past decade. High-level synthesis (HLS) tools putation on them, and produces a new set of fmaps to be fed
such as Xilinx Vivado HLS [5] and LegUp [1] enable a user into the next layer. The input fmaps of the first layer are the
to write code in a high-level programming language, then al- channels of the input image. Layers may require configura-
gorithmically compile that code down to a register-transfer tion values known as parameters, which must first be de-
level (RTL) design specification. More recent tools such as termined by training the CNN offline on pre-classified data.
Intel FPGA SDK for OpenCL [8] and Xilinx SDSoC [13] of- Once the parameters are finalized, the CNN can be deployed
fer further automation features for generating the hardware- for inference — the classification of new data points. For
software interface and on-chip memory network. In the con- most practical machine learning applications, the first-class
text of deep learning, these tools have the potential to criti- concerns are the accuracy and execution time of online clas-
cally reduce time-to-market on new accelerator designs and sification. This paper will thus focus on accelerating the in-
thus reduce the aforementioned innovation gap. ference task without compromising accuracy.
In this paper we present the design of a BNN accelerator
for FPGAs. In order to take full advantage of the binarized CNN BNN
values and operations, our design differs in multiple aspects
from CNN accelerators in literature. Our specific contribu-
tions are as follows:
M real
fmaps ℝℝ ℝ
ℤ2 M binary
fmaps
• To our best knowledge, we are the first to study FPGA
acceleration for very low precision CNNs. Compared to Convolution Convolution
their full-precision counterparts, such networks are po-
tentially a better fit for the LUT-based fabric and limited N real N integer
on-chip storage in modern FPGAs. fmaps ℝℝ ℝℤ fmaps
• We employ an HLS design methodology for productive
development of our FPGA-based BNN accelerator. Ex- Add Bias Pool
isting HLS work has examined loop ordering, unrolling,
and local buffering for CNNs [27]. Our HLS implemen- Non-linearity Batch Norm
tation leverages these optimizations, and further propose
novel BNN-specific hardware constructs to ensure full
throughput and hardware utilization across the different Pool Binarize
input feature sizes.
N real N binary
• We implement our BNN classifier on a low-cost FPGA
development board (ZedBoard) and show promising im- fmaps
ℝℝ ℝ
ℤ2 fmaps
provements over CPU and embedded GPU baselines as
well as existing FPGA accelerators. Our source code is Figure 1: Comparison of CNNs and BNNs — Left: the or-
publicly available on the authors’ websites. der of operations in a CNN for a conv and pool layer. Right:
the (modified) order of operations in the BinaryNet BNN [7].
The rest of this paper is organized as follows: Section 2 Pooling is performed early and a batch normalization pre-
gives a primer on CNNs and BNNs; Section 3 describes our cedes the binarization to minimize information loss. Biases
BNN accelerator design; Section 4 provides some details on have been removed from the BNN.
our HLS code; Section 5 reports our experimental findings,
Section 6 reviews previous work on FPGA-based CNN ac- Below we describe three layer types which are found in
celerators; and we conclude the paper in Section 7. most CNNs, including our CIFAR-10 BNN model.
16
Convolutional (conv) layers convolve each input fmap again. Biases have been removed (see Section 3.1). Pooling
with a K × K weight filter. The conv results are summed, in the BNN is always performed on the integer data.
added with a bias, and passed through a non-linearity func- The BNN also introduces a new layer type — Batch nor-
tion (such as ReLU or sigmoid) to produce a single output malization [11] layers reduce the information lost during
fmap. In this paper, we assume conv layers pad the input binarization by linearly shifting and scaling the input dis-
fmaps at the borders to produce output fmaps of the same tribution to have zero mean and unit variance. This reduces
size. Equation (1) below shows the operation of a conv layer quantization error compared to an arbitrary input distribu-
with M input fmaps x1 , ..., xM , N output fmaps y1 , ..., yN , tion. 1 The transformation is given in Equation (3) below,
and non-linearity f .
x−µ
y=√ γ+β (3)
M
X σ2 +
yn = f ( xm ∗ wn,m + bn ) (1)
m=1 where x and y are input and output, respectively, µ and σ
are statistics collected over the training set, γ and β are
The parameters of this conv layer are M × N × K × K trained parameters, and is to avoid round-off problems.
weights and N biases. During inference, all parameters are fixed, so we need only
Pooling layers maps each input fmap to an output fmap be concerned with efficiently applying Equation (3) to each
whose every pixel is the max/mean of a K × K window of input fmap pixel. Each output fmap require its own set of
input pixels. Unlike conv layers the windows do not overlap, batch norm parameters.
and the output fmaps are K times smaller in each dimension. The primary advantages of BNNs over their higher preci-
Pooling layers are inserted throughout a CNN to gradually sion counterparts are twofold:
reduce the size of the intermediate feature maps.
Dense or fully-connected (FC) layers take an input vec- 1. The convolution operation in Equation (1) (which nom-
tor of 1×1 feature maps (pixels) and perform a dot prod- inally requires a K × K element multiply-accumulate)
uct with a weight vector. The result is added to a bias and can now be implemented as a bitwise XNOR between
passed through a non-linearity to produce a single 1×1 out- two K × K bit vectors and a popcount. This is highly
put. Equation (2) below shows the operation of a FC layer relevant to FPGA design, as these operations can be im-
with M input pixels, N output pixels, and non-linearity f . plemented very efficiently in the logic fabric.
M 2. Assuming comparable numbers of feature maps and FC
X
yn = f ( xm wn,m + bn ) (2) layer units, binarizing weights and fmaps greatly re-
m=1 duces their memory size. This is again compelling for
FPGAs as existing FPGA accelerators are typically con-
The parameters are M × N weights and N biases. strained in performance by a combination of on-chip stor-
age space and off-chip memory bandwidth.
2.2 Binarized Neural Networks
A BNN is essentially a CNN whose weights and fmap pixels
are binarized to -1 or +1; they can be seen as an extreme 2.3 CIFAR-10 BNN Model
example of the quantized, reduced-precision CNN models
The CIFAR-10 dataset [14] contains sixty thousand 32×32
commonly used for hardware acceleration. In this paper we
3-channel images consisting of photos taken of real world
focus on an architecture developed by Courbariaux et al.
vehicles and animals. The images come from 10 classes
in [6] and later refined in [7]. The first paper binarizes only
(airplane, truck, cat, etc.) and are divided into a training set
the weights while the follow-up binarizes both weights and
of 50000 and a test set of 10000.
fmaps. We focus on the latter version and refer to it as
The BinaryNet architecture consists of six conv layers
the BinaryNet architecture/model. This architecture achieves
followed by three FC layers. All conv layers use 3×3 fil-
near state-of-the-art results on both CIFAR-10 and SVHN
ters and edge padding, and all conv/FC layers apply batch
datasets at time of publication. Other more recent work on
norm before binarization. There is a 2×2 max pooling layer
low precision networks promise accuracy close to state of
after the 2nd, 4th, and 6th conv layers. The first conv layer
the arts on ImageNet [16, 28].
is different from the rest: its input is the image, which is
In the BinaryNet model, the weights and outputs of both
floating-point, not binary; its weights are still binary. The ar-
conv and FC layers are binarized using the Sign function
chitecture is summarized in Table 1; the size of the fmaps
(i.e., positive weights are set to +1 and negatives to -1).
gets smaller deeper into the network, and that the first two
Figure 1 illustrates the flow of data through a conv and
dense layers contain most of the weights.
pooling layer in both a CNN and a BNN. For the CNN, the
order of operations matches Equation (1) and the fmaps are
real-valued at all times. In the BNN, the feature maps go 1 Batch normalization can also speed up training and regularize the activa-
from binary to integer (after convolution) until it is binarized tions in full-precision CNNs, but this is beyond the scope of this paper.
17
Layer Input Output Output Output Weight This reduces the number of operations and cuts the num-
Fmaps Fmaps Dim Bits Bits ber of stored parameters to two. Furthermore, the BNN al-
Conv1 3 128 32 128K 3456 ways binarizes immediately after batch norm. Thus we do
Conv2 128 128 32 128K 144K not need the magnitude of y, only the sign, allowing us scale
Pool 128 128 16 32K k and h by any multiplicative constant. We exploit this prop-
Conv3 128 256 16 64K 288K erty during quantization by scaling each k and h to be within
Conv4 256 256 16 64K 576K the representable range of our fixed-point implementation.
Pool 256 256 8 16K Empirical testing showed that k and h can be quantized to 16
Conv5 256 512 8 32K 1.1M
bits with negligible accuracy loss while being a good fit for
Conv6 512 512 8 32K 2.3M
power-of-2 word sizes. We also quantized the floating point
Pool 512 512 4 8192
BNN inputs to 20-bit fixed point. Table 2 summarizes the
FC1 8192 1024 1 1024 8.0M
FC2 1024 1024 1 1024 1.0M impact of each algorithmic modification on test error. The
FC3 1024 10 1 10 10K HLS accelerator has the same accuracy as the C++ code.
Total 13.4M
Conv 4.36M 3.2 Retraining for +1 Edge-Padding
FC 9.01M One complication in the BinaryNet model is the interaction
between binarization and edge padding. The model binarizes
Table 1: Architecture of the BinaryNet CIFAR-10 BNN each activation to -1 or +1, but each input fmap is edge
— The weight bits exclude batch norm parameters, whose padded with zeros, meaning that a convolution can see up
total size after optimization (see Section 3.1) is 0.12M bits, to 3 values: -1, 0, or +1. Thus the BinaryNet model actually
less than 1% of the size of the weights. requires some 2-bit operators (though the fmap data can still
Training of the CIFAR-10 BNN model was done using be stored in binary form). We managed to modify and retrain
open-source Python code provided by Courbariaux et al. 2 , the BinaryNet model to pad with +1, eliminating the zeros
which uses the Theano and Lasagne deep learning frame- and creating a truly binarized CNN. This +1 padded BNN
works. We reached 11.58% test error out-of-the-box, in line achieves a test error of 11.82% in Python and 12.27% in
with their results. Their paper also presents more advanced C++/FPGA, only slightly worse than the original.
training techniques such as stochastic binarization, which For our FPGA implementation we used the 0 padded
further reduce error rate. We did not use them in this work. BNN as the resource savings of the +1 padded version was
Different training schemes do not affect the inference pass not particularly relevant for the target device.
or the compatibility of our accelerator.
Source Model Padding Test Error
3. FPGA Accelerator Design From [7] - 0 11.40%
Python Default 0 11.58%
In this section, we first outline how we optimize the Bina- Python no-bias 0 11.32%
ryNet model for hardware, then describe the design of our Python no-bias +1 11.82%
system and the specific compute units.
C++ no-bias, fixed-point 0 11.46%
3.1 Hardware Optimized BNN Model C++ no-bias, fixed-point +1 12.27%
As with the design of conventional CNN accelerators, a key Table 2: Accuracy of the BNN with various changes —
optimization we made to the BNN model is parameter quan- no-bias refers to retraining after removing biases from all
tization. While the weights are already binarized, the biases layers and fixed-point refers to quantization of the inputs
and batch norm parameters are real numbers. During bias and batch norm parameters.
quantization, we noticed that nearly every bias was much
smaller than 1. Given that the inputs have magnitude 1, we
tried setting the biases to zero and observed no effect on ac- 3.3 System Architecture
curacy. We then retrained the network with biases removed Our system architecture, shown in Figure 2(a), consists of
from the model, and reached a test error of 11.32%. For the three compute units, data and weight buffers, a direct mem-
rest of the paper we use this as the baseline error rate. ory access (DMA) system for off-chip memory transfer, and
A second optimization involved noting that the batch an FSM controller. The three compute units work on differ-
norm calculation (Equation (3)) is a linear transformation, ent types of layers: the FP-Conv unit for the (non-binary)
and can thus be formulated as y = kx + h, where: first conv layer, the Bin-Conv unit for the five binary conv
γ µγ layers, and the Bin-FC unit for the three binary FC layers. Of
k=√ and h=β−√ (4) the three, the Bin-Conv and Bin-FC units must handle differ-
2
σ + σ2 +
ent numbers of input and output fmaps, and (in the case of
2 https://github.com/MatthieuCourbariaux/BinaryNet Bin-Conv) different fmap sizes.
18
fin Convolvers fout output streams
Data Buffers Compute Units
A FP-Conv BitSel ∗
Off-chip mem
B Pooling,
DMA
Bin-Conv
…
+ + Bnorm,
Param Binarize
Buffers Bin-FC BitSel ∗ Integer
buffer
CPU
(a) (b)
Figure 2: Architectural diagrams of our BNN accelerator — (a) system-level block diagram showing the three compute
units, buffers, and how the accelerator is connected to the CPU and off-chip memory hierarchy; (b) architecture of the Bin-
Conv unit with input and output parallelization factors fin = 2 and fout = 3. The unit can stream in two words per cycle and
produce three output fmaps per invocation.
The storage of intermediate data in our accelerator differs a 3-channel 32×32 input. While the input pixels are 20-bit
from most existing designs. In full-precision CNN accelera- fixed-point, the weights are binarized, so we can replace
tors, the size of a single set of fmaps between two layers typ- the multiplies in the conv operation with sign inversions.
ically exceed the size of FPGA on-chip storage. This neces- We fully parallelize across the three input channels: each
sitates the continuous transfer of fmaps to and from off-chip cycle we stream in three input pixels, add them to three line
RAM. However, as Table 1 shows, the size of the largest set buffers, and compute a 3×3×3 convolution. The result is put
of fmaps in our BNN is only 128K bits, which easily fits on- through batch norm and binarization to produce one output
chip even in smaller FPGAs. Our design uses two in-out data bit per cycle. Greater parallelism in this unit is achievable,
buffers A and B of equal size. One layer reads from A and but the first conv layer takes up a very small portion of the
write its outputs to B; then (without any off-chip data trans- overall runtime, and we focused our efforts elsewhere.
fers) the next layer can read from B and write to A. Thus,
Bin-Conv — The binary conv unit is the most critical
off-chip memory transfers are only needed for the input im-
component of the accelerator, as it will be responsible for
age, output prediction, and loading each layer’s weights.
the five binary conv layers which take up the vast major-
Unlike the fmaps, there is only enough memory on-chip
ity of the runtime. The unit must maintain high throughput
to store a portion of a layer’s weights. Multiple accelerator
and resource efficiency while handling different input widths
invocations may be needed for a layer; in each invocation we
at runtime; our design targets 8, 16, or 32, and can sup-
load in a new set of weights and produce a new set of fmaps.
port larger power-of-two widths with minor changes. To ef-
The next invocation produces the next set of fmaps, and etc,
ficiently compute a convolution, multiple rows of input pix-
until all output fmaps have been generated and stored in the
els need to be buffered for simultaneous access. However,
on-chip data buffer. Invoking the accelerator requires pass-
a standard line buffer (i.e., from video processing applica-
ing it arguments such as pointers to the weights, the layer
tions) is unsuitable for this task due to two reasons:
type and size, the fmap size, and whether pooling should be
applied. Inside the accelerator, the controller decodes these 1. A line buffer must be sized for the largest input fmap; in
inputs and coordinates the other modules. this case it must have a width of 32. This not only causes
buffer under-utilization when the fmap is 16 or 8 wide, it
3.4 Compute Unit Architectures also leads to loss of throughput, as we can only perform
In our accelerator, each compute unit must store binarized as many convolutions per cycle as the input width.
data to the on-chip RAMs at the end of its execution. As 2. A line buffer is designed to shift one pixel per cycle
Figure 1 reveals, the first operation of a conv or FC layer to always store the most recent rows. However, with
transforms the binary inputs to integers; we make sure each binarized in puts we have access to not one but many lines
unit will also perform the subsequent batch-norm, pooling, of new pixels each cycle (for instance, a 32-bit word can
and binarization before writing data out to the buffers. One hold 4 lines from an 8×8 fmap). This radically changes
of our design goals is to limit the amount of integer-valued how we should update the line buffer.
intermediate data buffered inside each compute unit,
In full precision CNN accelerators, the size problem can be
FP-Conv — The fixed-point conv unit utilizes the well- addressed by tiling, where input fmaps are processed in tiles
known line buffer architecture for 2D convolutions. Because which are always the same size. This is unsuitable in a BNN
this unit only targets a single layer, we hardwire it to handle since the fmaps are typically very small in terms of number
19
Input Word Line Buffer Output Line
Input width 32 32
line 1
(a) line 3 line 2 ∗ line 1
conv
line 3
BitSel
Input width 8 8
z line 3 line 4 line 5 line 6
(b) line 5 line 6 line 7 line 8 line 4 line 5 line 6 line 7 ∗ line 3 line 4 line 5 line 6
conv
line 5 line 6 line 7 line 8
BitSel
Figure 3: Example usage of the variable-width line buffer — we show how a 32-bit input word is divided up by BitSel and
inserted into the VWLB. The line buffer has height 3 and width 32. Sliding a 3×3 conv filter across the VWLB produces 32
output pixels, ignoring edge effects. (a) for a 32-wide input fmap, each row of the VWLB stores one line and applying the conv
filter produces one 32-wide output line; (b) for an 8-wide input fmap, each row of the VWLB stores four lines and applying
the conv filter produces four consecutive 8-wide output lines given the mapping of input lines to banks shown.
of bits. One possible solution to the second problem is to is written to the bottom row. We can then slide the 3×3
reorganize the data so each bit in an input word comes from conv window across the VWLB to generate one 32-bit
a different fmap, and assign each bit to a separate line buffer. line of conv outputs.
However, this requires an exorbitant number of line buffers 2. For an 8-wide input, each word contains four lines. We
and is not area efficient. split each VWLB row into four banks, and map each
To address the above issues, we introduce two new mod- input line to one or more VWLB banks. The mapping is
ules: the BitSel module and the variable-width line buffer done in such a way that sliding the conv window across
(VWLB). Figure 2(b) shows the basic structure of the Bin- the VWLB produces four consecutive 8-bit output lines.
Conv unit, whose execution proceeds in two phases. In Each cycle the VWLB shifts both up and to the left.
the first phase, input fmaps from the on-chip buffers are
streamed in on the left side, through the BitSel modules, and The BitSel is responsible for slicing the input word and
into the VWLBs. The Convolver modules compute the par- mapping the slices to the row banks. Because the smallest
tial conv sums and accumulates them in the integer buffers. input width is 8, each slice and VWLB bank is sized at 8 bits.
The BitSel is responsible for reordering the input bits so that For a 32-wide input, BitSel maps four contiguous 8-bit slices
the Convolver logic can be agnostic of the fmap width. In to the bottom row. For an 8-wide input, the mapping is more
Figure 2(b), fin is the input parallelization factor — the Bin- complex, but still highly regular and can be computed in
Conv unit accepts fin input words per cycle and the data hardware with just adds and shifts. Each pixel in the output
buffers are partitioned to match this rate. fout is the output lines in Figure 3 is an integer conv sum, and each sum is
parallelization factor — each Convolver applies fout 3 × 3 accumulated at a different location in the integer buffer.
conv filters per cycle to the data in the VWLB and gener- The BitSel and VWLB provides three primary advan-
ates partial sums for fout different fmaps. The first phase tages: (1) the VWLB achieves full hardware utilization re-
ends when all input fmaps in the current layer have been gardless of input width, (2) a new input word can be buffered
processed. At this point each integer buffer contains a fin- every cycle, and (3) the BitSel deals with various input
ished conv map. In the second phase we compute max pool- widths by itself, allowing the actual buffer and convolution
ing, batch norm, and binarization to produce fout binary out- logic to be fixed. Note that the VWLB used in our design dif-
put fmaps. Note that max-pooling and binarization are non- fers from Figure 3 in a few details. First, we have neglected
linear operations, so we cannot apply them to partially fin- edge padding. The actual VWLB contains two additional el-
ished conv maps and accumulate afterwards. ements per bank to hold horizontal pad bits; vertical padding
Figure 3 explains the operation of the BitSel and VWLB is handled by inserting lines of zeros. Second, because the
in greater detail. The diagram assumes we have a word size pad bits are 0 rather than +1 or -1, we must make each ele-
of 32 bits and a 3×3 conv filter, which requires a VWLB ment in the VWLB two bits instead of one. The conv oper-
with three rows and 32 elements per row. We demonstrate ation is performed between the 2-bit data and 1-bit weights,
how the VWLB works for input fmap widths 32 and 8 and and can be implemented as sign inversion and accumulate.
ignore edge padding for the sake of simplicity.
Bin-FC — The binary FC unit is comparatively simple.
1. For a 32-wide input, each word contains exactly one line. Each cycle we read in fin data words and an equal number
Each cycle, the VWLB shifts up and the new 32-bit line of weight words. fin here is the input parallelization factor
20
just as in Bin-Conv. Because there is no edge padding in an 1 VariableLineBuffer linebuf;
FC layer the computations can be truly binary. We perform 2 ConvWeights wts;
a dot product between the data and weight words by apply- 3 IntegerBuffer outbuf;
ing a bitwise XOR operation and then summing the resulting 4
bits with a popcount. Similar to the Bin-Conv unit, we accu- 5 for (i = 0; i < n_input_words; i++) {
mulate the sum in an integer buffer and apply binarization 6 #pragma HLS pipeline
after all inputs have been processed. Note that the FC layers 7
are typically bound by memory bandwidth of the off-chip 8 // read input word, update linebuffer
connection, rather than the throughput of the accelerator. 9 WordType word = input_data[i];
10 BitSel(linebuf, word, input_width);
Data Buffers — To accommodate multiple reads per cy- 11
cle, the data buffers are partitioned into fin banks, and fea- 12 // update the weights each time we
ture maps are interleaved across the different banks. Figure 4 13 // begin to process a new fmap
shows an example with fin = 2 and four words per fmap. 14 if (i % words_per_fmap == 0)
The data words are read sequentially by address, so a com- 15 wts = weights[i / words_per_fmap];
pute unit always accesses fin consecutive fmaps in parallel. 16
17 // perform conv across linebuffer
fmap 0 fmap 2 fmap 1 fmap 3 18 for (c = 0; c < LINE_BUF_COLS; c++) {
19 #pragma HLS unroll
20 outbuf[i % words_per_fmap][c] +=
0 1 2 3 … 0 1 2 3 … 21 conv(c, linebuf, wts);
22 }
Bank 0 Bank 1 23 }
Figure 5: HLS pseudocode for part of the Bin-Conv unit
— the pseudocode implements a pipeline which reads and
Convolver 0 Convolver 1 performs convolution on one input word each cycle. Many
details are left out; the goal is to illustrate how our design
Figure 4: Example of data buffer banking — The compute can be expressed in high-level code.
unit and memory system have fin = 2. Each fmap contains
four words which are laid out sequentially. The fmaps are factor in our design, as 64 is already a very large paralleliza-
interleaved across banks, and both Bin-Conv and Bin-FC tion factor (it means we perform 64 convolutions per cycle),
benefit from this banking. and there are other sources of parallelism to exploit in the
BNN. We chose a word size of 64 bits for the data buffers
4. HLS Accelerator Implementation and sized each data buffer A and B at 2048 words, which is
Figure 5 shows the HLS pseudocode for the front half of just enough to store the largest set of fmaps in the BNN.
the Bin-Con unit, and demonstrates a key difference between We also explored different values for fin and fout in Bin-
BNN and CNN hardware design. For a CNN the code typ- Conv. It was observed that both have roughly similar effects
ically loops over an fmap processing one pixel at a time; on execution time, but increasing fout has a more severe
key design decisions include loop ordering and unroll fac- effect on total area. fin controls the number of BitSels and
tors (see [27] for a good example). In our BNN accelera- VWLBs while fout controls the number of pooling/batch
tor, the basic atom of processing is not a pixel but a word. norm units and integer buffers. In terms of logic a BitSel
The example code is designed to sustain one word per cy- and a pooling/batch norm unit is similar, but each VWLB
cle throughput over the entire input feature map set. Each contains 32 × 3 2-bit registers while each integer buffer
fmap consists of words per fmap words, a number which contains 32 × 32 12-bit registers. Thus all else being equal
differs between layers. As it processes the input set, the code it is better to increase fin . This result shows the importance
updates the weights on each new fmap and accumulates the of minimizing the storage of intermediate values and only
conv results in outbuf. We call BitSel and conv inside the committing binarized data to memory.
loop to instantiate the BitSel units and conv logic as shown We use Xilinx SDSoC as the primary design tool for our
in Figure 2(b). To increase the number of input streams we BNN application. SDSoC takes as input a software program
can tile the loop and unroll the inner loop body. with certain functions marked as “hardware”. It invokes Vi-
A key design decision here is the input word size, which vado HLS under the hood to synthesize the “hardware” por-
controls the level of parallelism across the pixels of an fmap. tion into RTL. In addition, it automatically generates the data
To guarantee correctness, words per fmap must be an inte- motion network and DMA necessary for memory transfer
ger greater than zero; this constrains the word size to at most between CPU and FPGA based on the specified software-
the size of the smallest input fmap (8 × 8 = 64 bits in our hardware partitioning. We selected a DMA engine built for
case). The word size restriction is not a significant limiting contiguous memory since it has the highest throughput, and
21
a neural network’s data and weights can be laid out contigu- tains 15.1x better performance and 11.6x better throughput
ously. We used directives to ensure that data is only trans- per Watt over mGPU, which has a similar power envelope.
ferred on the first and last accelerator invocation; weights Against the x86 processor, it achieves a 2.5x speedup. While
are transferred on every invocation. the binary conv layers were faster, the FC layers were slower,
which is unsurprising as the FC layers are bound by external
memory bandwidth. Versus GPU, the FPGA is 8.1x worse
5. Experimental Results
in performance. But as expected, it has much lower power
We evaluate our design on a ZedBoard, which uses a low- consumption and better throughput per Watt.
cost Xilinx Zynq-7000 SoC containing an XC7Z020 FPGA To show that the FC layers are indeed limited by memory
alongside an ARM Cortex-A9 embedded processor. We bandwidth, we created a design where the FC computations
make use of Xilinx SDSoC 2016.1 as the primary design are removed but the memory transfers are kept. The new
tool, which leverages Vivado HLS and Vivado to perform execution time of the FC layers is within 5% that of the
the actual HLS compilation and FPGA implementation. original, demonstrating that there is not much to gain by
We compared our design against two server-class comput- further parallelizing them beyond the current design.
ing platforms: an Intel Xeon E5-2640 multicore processor
(CPU) and an NVIDIA Tesla K40 GPU (GPU). We also Table 4: Performance comparison — Conv1 is the first FP
compared against an NVIDIA Jetson TK1 embedded GPU conv layer, Conv2-5 are the binary conv layers, FC1-3 are
board (mGPU). As BNNs are a recent development, our the FC layers. A – indicates a value we could not measure.
baseline applications will not be as well optimized compared Numbers with * are sourced from datasheets. The last row
to CNN baselines (where implementations can be found in shows power efficiency in throughput per Watt.
frameworks such as Caffe). The CPU and GPU baselines
are adapted from code provided in [7]. The code leverages
Execution time per image (ms)
Theano, and calls OpenBLAS for CPU and CUDA for GPU.
mGPU CPU GPU FPGA
However, it does not perform bitwise optimizations since
they are not natively supported in Theano, and instead uses Conv1 – 0.68 0.01 1.13
floating-point values binarized to -1 and +1. For the base- Conv2-5 – 13.2 0.68 2.68
lines we used the BNN model with no biases and k and h, FC1-3 – 0.92 0.04 2.13
and on the GPU we always used the largest batch size. Total 90 14.8 0.73 5.94
Power measurement is obtained via a power monitor. We Speedup 1.0x 6.1x 123x 15.1x
measured 4.5W idle and 4.7W max power on the Zedboard
power suppy line when running our BNN. This indicates the Power (Watt) 3.6 95* 235* 4.7
dynamic power consumption of the FPGA is very low. imgs/sec/Watt 3.09 0.71 5.83 35.8
22
CNNs have also made great strides into achieving near state- find the optimal architectural parameters. Rahman et al. [20]
of-the-art accuracy on the ImageNet dataset [16, 28]. propose a scalable array-based CNN accelerator with heavy
input reuse. Motamedi [17] uses a roofline model for perfor-
Table 5: Comparison of our work against state-of-the-art mance to guide hardware generation. Wang [26] proposes
FPGA accelerators — GOPS counts multiplies and adds DeepBurning, which targets a variety of CNN architectures
per second. * indicates values approximated from charts. and performs data layout optimization.
OpenCL frameworks for deep learning on FPGA have
[23] [19] [25] Ours also been proposed. Suda et al. [23] use parameterized
OpenCL alongside analytical models for performance and
Stratix-V Zynq Zynq Zynq
Platform resource, enabling a genetic algorithm to search for the opti-
GSD8 7Z045 7Z020 7Z020
mal configuration. Venieris and Bouganis [25] study the use
Capacity of synchronous dataflow to capture CNN workloads and use
695 218.6 53.2 53.2
(kLUTs) graph partitioning to control resource consumption.
Clock(MHz) 120 150 100 143 There is also a great deal of research work on ASIC CNN
co-processors. Among the most well know is the DianNao
Power(W) 19.1 9.6 - 4.7 line of architectures [2]. The Eyeriss paper [3] contains a
Precision 8-16b 16b - 1-2b comprehensive study of popular dataflows for spatial archi-
tectures and derive an optimal one.
GOPS (conv) 136.5 187.8 - 318.9 Our approach differs from existing work in two major
GOPS (all) 117.8 137.0 12.73 207.8 ways: (1) we are the first to study BNNs for FPGA accel-
eration; (2) we make use of a C-based HLS methodology
kLUTs 120* 182.6 43.2 46.9
and propose design constructs to maximize throughput on
DSPs 760* 780 208 3 different layers. Existing CNN accelerators on FPGA are not
GOPS/ well-equipped to handle BNNs due significant differences in
0.98 0.75 0.29 4.43 compute and storage requirements, and layer organization.
kLUT
Our final design differs greatly from previous work.
GOPS/
6.17 14.3 7.27 44.2
Watt
7. Conclusions and Future Work
While it may not be completely fair to compare GOPS We are the first to implement an accelerator for binarized
between a binarized and conventional network, it is cur- neural networks on FPGA. BNNs feature potentially re-
rently the (de facto) standard practice for hardware accel- duced storage requirements and binary arithmetic opera-
erator studies to compare reduced and full-precision imple- tions, making them well suited to the FPGA fabric. However,
mentations that use different data types. these characteristics also render CNN design constructs such
as input tiles and line buffers ineffective. We introduce new
6. Related Work design constructs such as a variable-width line buffer to ad-
Our paper owes much to the groundbreaking work on BNNs dress these challenges, creating an accelerator radically dif-
in the machine learning community [6, 7]. These papers ferent from existing work. We leverage modern HLS tools
contain some discussion on the advantages of BNNs over to write our design in productive, high-level code, and our
CNN for hardware, but to our best knowledge we are the accelerator outperforms existing work in raw throughput,
first to present a working FPGA implementation. throughput per area, and throughput per Watt.
There have been many studies on the design of CNN ac- Future BNN work should focus both on algorithmic and
celerators for FPGA. Zhang et al. [27] describe how to opti- architectural improvements. On the algorithmic side we
mize an HLS design by reordering and tiling loops, inserting would like to explore techniques to reduce model size. From
the proper pragmas, and organizing external memory trans- the architectural side one action item is to implement a
fers. Ensuing publications have mostly eschewed HLS in low-precision network for ImageNet, which would involve
favor of RTL designs. Qiu et al. [19] propose an architec- a much larger and more complicated accelerator design.
ture that computes conv and FC layers on the same hard-
ware, as well as dynamic fixed-point quantization. Their pa-
per demonstrates an area-efficient accelerator for AlexNet Acknowledgements
on the Xilinx ZC706 board. This research was supported in part by DARPA Award
A related line of research focuses on creating CNN de- HR0011-16-C-0037, a DARPA Young Faculty Award, NSF
sign compilers which can generate optimized hardware for Awards #1337240, #1453378, #1512937, and a research gift
a family of models. These works typically use a set of RTL from Xilinx, Inc. The Tesla K40 GPU used for this research
modules combined with a design space exploration tool to was donated by the NVIDIA Corporation.
23
References
[1] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
T. Czajkowski, S. D. Brown, and J. H. Anderson. LegUp: Classification with Deep Convolutional Neural Networks.
An Open-Source High-Level Synthesis Tool for FPGA-Based Advances in Neural Information Processing Systems (NIPS),
Processor/Accelerator Systems. ACM Trans. on Embedded pages 1097–1105, 2012.
Computing Systems (TECS), 13(2):24, 2013. [16] F. Li and B. Liu. Ternary Weight Networks. arXiv e-print,
[2] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and arXiv:1605.04711, May 2016.
O. Temam. Diannao: A Small-Footprint High-Throughput [17] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi. Design
Accelerator for Ubiquitous Machine-earning. Int’l Conf. Space Exploration of FPGA-Based Deep Convolutional Neural
on Architectural Support for Programming Languages and Networks. Asia and South Pacific Design Automation Conf.
Operating Systems (ASPLOS), Mar 2014. (ASP-DAC), pages 575–580, Jan 2016.
[3] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze. Eyeriss: [18] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss,
An Energy-Efficient Reconfigurable Accelerator for Deep and E. Chung. Accelerating Deep Convolutional Neural
Convolutional Neural Networks. Int’l Symp. on Computer Networks Using Specialized Hardware. Microsoft Research,
Architecture (ISCA), Jun 2016. Feb 2015.
[4] A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and [19] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang,
B. Catanzaro. Deep Learning with COTS HPC Systems. N. Xu, S. Song, et al. Going Deeper with Embedded FPGA
Int’l Conf. on Machine Learning (ICML), pages 1337–1345, Platform for Convolutional Neural Network. Int’l Symp. on
Jun 2013. Field-Programmable Gate Arrays (FPGA), pages 26–35, Feb
[5] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and 2016.
Z. Zhang. High-Level Synthesis for FPGAs: From Prototyping [20] A. Rahman, J. Lee, and K. Choi. Efficient FPGA Acceleration
to Deployment. IEEE Trans. on Computer-Aided Design of of Convolutional Neural Networks using Logical-3D Compute
Integrated Circuits and Systems (TCAD), Apr 2011. Array. Design, Automation, and Test in Europe (DATE), pages
[6] M. Courbariaux, Y. Bengio, and J.-P. David. BinaryConnect: 1393–1398, Apr 2016.
Training Deep Neural Networks with binary weights during [21] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-
propagations. Advances in Neural Information Processing Net: ImageNet Classification Using Binary Convolutional
Systems (NIPS), pages 3123–3131, 2015. Neural Networks. European Conference on Computer Vision
[7] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and (ECCV), Oct 2016. arXiv:1603.05279.
Y. Bengio. Binarized Neural Networks: Training Deep Neural [22] K. Simonyan and A. Zisserman. Very Deep Convolutional
Networks with Weights and Activations Constrained to +1 or Networks for Large-Scale Image Recognition. arXiv e-print,
-1. arXiv e-print, arXiv:1602.02830, Feb 2016. arXiv:1409.15568, Apr 2015.
[8] T. S. Czajkowski, U. Aydonat, D. Denisenko, J. Freeman, [23] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrud-
M. Kinsner, D. Neto, J. Wong, P. Yiannacouras, and D. P. hula, J.-s. Seo, and Y. Cao. Throughput-Optimal OpenCL-
Singh. From OpenCL to High-Performance Hardware on based FPGA Accelerator for Large-Scale Convolutional Neu-
FPGAs. Int’l Conf. on Field Programmable Logic and ral Networks. Int’l Symp. on Field-Programmable Gate Arrays
Applications (FPL), pages 531–534, Aug 2012. (FPGA), pages 16–25, Feb 2016.
[9] M. A. et al. TensorFlow: Large-Scale Machine Learning [24] Theano Development Team. Theano: A Python framework for
on Heterogeneous Systems, 2015. Software available from fast computation of mathematical expressions. arXiv e-print,
tensorflow.org. arXiv:1605.02688, May 2016.
[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning [25] S. I. Venieris and C.-S. Bouganis. fpgaConvNet: A Frame-
for Image Recognition. arXiv e-print, arXiv:1512.0338, Dec work for Mapping Convolutional Neural Networks on FPGAs.
2015. IEEE Symp. on Field Programmable Custom Computing Ma-
[11] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating chines (FCCM), May 2016.
Deep Network Training by Reducing Internal Covariate Shift. [26] Y. Wang, J. Xu, Y. Han, H. Li, and X. Li. DeepBurning:
arXiv e-print, arXiv:1502.03167, Mar 2015. Automatic Generation of FPGA-based Learning Accelerators
[12] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- for the Neural Network Family. Design Automation Conf.
shick, S. Guadarrama, and T. Darrell. Caffe: Convolutional (DAC), page 110, Jun 2016.
architecture for fast feature embedding. arXiv preprint, [27] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong.
arXiv:1408.5093, 2014. Optimizing FPGA-based Accelerator Design for Deep Convo-
[13] V. Kathail, J. Hwang, W. Sun, Y. Chobe, T. Shui, and lutional Neural Networks. Int’l Symp. on Field-Programmable
J. Carrillo. SDSoC: A Higher-level Programming Environment Gate Arrays (FPGA), pages 161–170, Feb 2015.
for Zynq SoC and Ultrascale+ MPSoC. Int’l Symp. on Field- [28] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou.
Programmable Gate Arrays (FPGA), pages 4–4, Feb 2016. DoReFar-Net: Training Low Bitwidth Convolutional Neural
[14] A. Krizhevsky and G. Hinton. Learning Multiple Layers of Networks with Low Bitwidth Gradients. arXiv e-print,
Features from Tiny Images, 2009. Master’s Thesis. Department arXiv:1606.06160, Jul 2016.
of Coumputer Science, University of Toronto.
24