0% found this document useful (0 votes)
11 views

Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC

Uploaded by

zhuochentao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC

Uploaded by

zhuochentao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Accelerating Binarized Neural Networks:

Comparison of FPGA, CPU, GPU, and ASIC


Eriko Nurvitadhi, David Sheffield, Jaewoong Sim, Asit Mishra, Ganesh Venkatesh and Debbie Marr
Accelerator Architecture Lab, Intel Corporation

Abstract— Deep neural networks (DNNs) are widely used in servers [3]. For IoT platforms, real-time requirements will be
data analytics, since they deliver state-of-the-art accuracies. even more stringent, and batching may not be feasible at all.
Binarized neural networks (BNNs) are recently proposed
optimized variant of DNNs. BNNs constraint network weight Binarized Neural Networks (BNNs) [1][2] have very
and/or neuron value to either +1 or -1, which is representable in 1 recently been proposed to address the aforementioned
bit. This leads to dramatic algorithm efficiency improvement, due challenge. A BNN offers an extremely more compact
to reduction in the memory and computational demands. This representation of network weights and neuron values than a
paper evaluates the opportunity to further improve the execution normal DNN by constraining each value to either +1 or -1. As
efficiency of BNNs through hardware acceleration. We first such, storage need is dramatically reduced since the weights
proposed a BNN hardware accelerator design. Then, we can be stored in a single bit (i.e., +1 stored as 1, and -1 as 0).
implemented the proposed accelerator on Aria 10 FPGA as well Furthermore, multiply operations can be replaced by bit-wise
as 14-nm ASIC, and compared them against optimized software operations instead, thereby reducing computational demand as
on Xeon server CPU, Nvidia Titan X server GPU, and Nvidia well. So far, BNNs have been shown to offer comparable
TX1 mobile GPU. Our evaluation shows that FPGA provides accuracies to full-precision DNNs for some known datasets
superior efficiency over CPU and GPU. Even though CPU and (e.g., CIFAR10), and they are actively being studied to
GPU offer high peak theoretical performance, they are not as improve accuracies for more datasets (e.g., ImageNet).
efficiently utilized since BNNs rely on binarized bit-level However, while prior works [1][2] have offered in-depth
operations that are better suited for custom hardware. Finally,
algorithm studies and analyses of BNNs, we are not aware of
even though ASIC is still more efficient, FPGA can provide
orders of magnitudes in efficiency improvements over software,
any that has proposed a hardware accelerator for BNNs.
without having to lock into a fixed ASIC solution. Neural network analytics workloads are deployed in a wide
range of settings, from high-end servers in data centers for
Keywords— Deep learning, binarized neural networks, FPGA, cloud-scale analytics to mobile platforms for Internet-of-
CPU, GPU, ASIC, data analytics, hardware accelerator. Things (IoT) applications. In all cases, there is a strong need
for extreme energy efficiency in addition to high performance.
I. INTRODUCTION To this end, both cloud servers as well as IoT platforms have
The proliferation of Internet technologies led to the become heterogeneous in recent years, where they integrate
abundance and rapidly growing digital data, from sources such hardware accelerators alongside general purpose CPUs to
as social media, blogs, Internet-of-things (IoT) applications, deliver significant execution efficiency for computations
etc. Data analytics extract knowledge from such data, often by offloaded to these accelerators, while maintaining generality to
using machine learning (ML) algorithms. In particular, deep execute the rest of the workloads. FPGAs, GPUs, and ASICs
neural networks (DNNs) have been widely adopted, as they are the well-known accelerators available in the market today.
show state-of-the-art accuracies for various analytics In particular, FPGAs have become more widely adopted in
classification tasks (e.g., computer vision, speech, etc). cloud servers as well IoT platforms. Leading technology
companies are pushing towards integrating FPGAs into data
With advances in DNNs, there is a trend towards deeper centers (e.g., Intel Xeon+FPGA, Microsoft Catapult). There are
networks that consequently carry more network parameters also IoT platforms (e.g., Altera SoC FPGA family) integrating
with increased model size. For example, AlexNet [8] contains embedded processor(s) and FPGA in a single package.
60M parameters, which demands storage size of 240MB when
stored as 32-bit numbers. This paper investigates the opportunities for accelerating
BNNs. We made the following contributions. First, we propose
Larger DNN models are challenging to execute efficiently. hardware accelerator architecture for BNNs. Second, we
Especially, in fully connected layers where there is no data explore software enhancements for BNNs (e.g., replace full-
reuse, processing a larger model that does not fit in on-chip precision with binary operations) for CPU and GPU. Third, we
RAMs would lead to off-chip DRAM accesses. Such accesses evaluate our accelerators on state-of-the-art Altera Aria 10
are very energy inefficient compared to on-chip operations FPGA and 14nm ASIC, and compare them against optimized
(e.g., for 45nm CMOS [9], a 32-bit DRAM access requires software on a cloud-server with Intel Xeon CPU and Nvidia
172x more energy than a floating point multiply). Moreover, Titan X GPU and IoT platform with mobile Nvidia TX1 GPU.
performance becomes limited by bandwidth available to access We show that software optimized for BNNs deliver significant
the model from DRAM. Batching multiple inputs together can performance improvements over standard DNNs. Moreover,
help improve data re-use, but in practice only small batch size we show that hardware accelerators offer further order of
is tolerable due to real-time latency requirements in analytics magnitude efficiency improvements over optimized BNN

978-1-5090-5602-6/16/$31.00©2016 IEEE

Authorized licensed use limited to: Fuzhou University. Downloaded on November 24,2022 at 03:06:27 UTC from IEEE Xplore. Restrictions apply.
software CPU and GPU implementations, since they take better vo = f(W.vi + b) (1)
take advantage of BNN bitwise data formats and operations.
Where vi is a vector of input neurons, W is a matrix of the
The rest of the paper is organized as follows. Section II network weights, b is the bias, and vo is the vector of output
gives background on ML analytics and BNNs. Section III neurons for the layer. f is the activation function, such as
presents the proposed BNN accelerator. Section IV details the Rectified Linear Unit (ReLU). Optionally, f may also include
BNN software optimizations on CPU and GPU. Section V normalization (e.g., batch normalization [10]) prior to applying
presents our evaluation results. Finally, section VI and VII the activation function. Often times, b can be merged into vi.
offer related work and concluding remarks, respectively. Figure 1(a) shows an example DNN layer computation as a W
matrix x vi vector operation.
II. BACKGROUND There has been recent trend towards deeper networks with
more parameters, since such networks can provide better
A. Machine Learning for Data Analytics accuracies. As such, the size of W, vi, and vo have become
Classification vs. Training. Many data analytics noticeable large. For example, one of the fully connected layers
workloads rely on machine learning (ML) algorithms. A in AlexNet [8] and VGG [11] use a 4K x 4K weight matrix
typical ML setup for data analytics consists of two phases. (W). When each weight is represented as a 32-bit number,
First, during training phase, a known set of data samples is fed storing the W matrix would require 64MB of storage. In
into an ML algorithm, which then creates a model with practice, processing such a model efficiently is very
predictive power. Then, in the classification phase, this model challenging, since it does not fit in on-chip RAMs of a typical
is used by the ML algorithm to make predictions for any new system. Hence, some or most of the model will have to reside
given data samples. This paper focuses on binarized neural in DRAM memory, which is power consuming and has much
networks (BNNs) for classification phase. lower bandwidth than on-chip RAMs, thereby imposing
performance constraints. As stated earlier, DRAM accesses are
Batched Classification. In the classification phase, a significantly more energy consuming than on-chip operations.
popular optimization is to process a batch of multiple input
samples together to improve data reuse and throughput. Binarized neural networks (BNNs) have the potential to
However, batching increases processing latency since a batch address this issue. BNNs [1][2] have been proposed recently to
of outputs is produced at a time, instead of a single output at a improve the efficiency of the standard neural networks. In a
time. Moreover, batching can increase implementation BNN, each network weight and neuron value is constrained to
complexity, due to the need to group incoming requests into be of only two possible values, +1 or -1. As such, it can be
batches and schedule them properly for processing. In practice, represented using a single bit. Therefore, BNNs require
it can be impractical to use large batch sizes. E.g., in a significantly less storage than standard DNNs. In our previous
commercial analytics based on neural networks in [3], ~90% of example of 4K x 4K weight matrix, instead of needing 64MB
the time there are only up to 4 inputs that can be grouped storage when using 32-bit number representation, a binarized
together (batch size of 4), with a maximum of 10 inputs (batch weight matrix would require only 2MB storage (i.e., 32x less).
size of 10). This is due to the need to meet the stringent BNNs also improve computation efficiency, as discussed next.
processing latency constraints. This paper considers normal as
There are three types of computations in a BNN.
well as batch-mode in our evaluations.
Binarized Matrix x Full-Precision Vector. In the first
layer, the input neurons represent the input sample data. Thus,
they cannot be binarized. So, in this case, vi is still represented
using full 32-bit floating or fixed point. Each weight in a
binarized weight matrix (Wb), however, is a 1-bit value. Thus,
the computation for the first layer is a multiplication of 32-bit
vi against binarized Wb. This operation can efficiently be done
by adjusting the sign bit of vi against the 1-bit weight of Wb.
I.e., if they are of the same sign, the output should maintain the
sign bit. Otherwise, the output should have the opposite sign.
Binarized Matrix x Binarized Vector. Since activation
function in BNN [1][2] produces a +1 or -1 value, neurons (vi
and vo) after the first BNN layer would be representable as 1-
bit values. As such, the computation multiplies a binarized
Fig. 1. In binarized neural networks, the matrix x vector operation to compute vector of input neurons (vib) against a binarized weight matrix
each network layer can be replaced by xnor and bit counting because weights Wb. Such operation can be done using xnor and a variant of a
and neurons are constrained to either +1 or -1, each representable in 1-bit.
population count (pcnt), thereby eliminating the need for full-
precision operations. Figure 1(b) illustrates how a matrix x
B. Binarized Neural Networks (BNNs) vector operation of +1 and -1 values can be binarized and
In a deep neural network, a fully connected layer performs computed using xnor and pcnt.
the following computation Normalization and Activation Function. Lastly,
normalization and activation function are applied to finalize the

Authorized licensed use limited to: Fuzhou University. Downloaded on November 24,2022 at 03:06:27 UTC from IEEE Xplore. Restrictions apply.
output neurons. It has been recommended [2] to use batch performing the computation in parallel. In result, the many on-
normalization [10] with BNN, which involves applying several chip RAMs deliver sufficient bandwidth to the PEs to achieve
constant parameters obtained from training phase (i.e., ϒ, β). high throughput at extreme efficiency. This section first
For the activation function, ReLU is very commonly used, and describes the architecture of the proposed accelerator. Then, it
is also used in BNN [2]. As such, this paper uses ReLU. details implementations of such architecture onto an Altera
Aria 10 FPGA as well as 14nm ASICs.

A. Accelerator Architecture
Architecture Details. The high-level architecture of the
proposed BNN accelerator is shown in Figure 2(a). The
architecture consists of a number of processing elements (PEs).
It can be scaled up (or down) by adding more (or less) PEs.
Each PE works on computing either a single full-precision
neuron value or multiple binarized values in a packed format.
The PEs are connected to on-chip RAM buffers, which are
used to keep the input and output neuron values, as well as
temporary values, for the BNN layers being processed. The
data management unit (DMU) handles the movement of data in
and out of the accelerator. It brings in the input neuron values
Fig. 2. The proposed accelerator for binarized neural networks (BNNs).
and writes out the final output neuron values. It also loads
network parameters to internal PE RAMs.
The PE internal design is shown in Figure 2(b). It consists
of a local RAM that keeps network weights. Each weight is 1-
bit. In our PE design, we pack 32 weights into a 32-bit value
for efficient processing. The RAM also keeps initialization
(e.g., 0, b, b-μ) and batch normalization (i.e., β, ϒ) parameters.
A PE also contains a multiplier unit (MUL), an adder unit
(ADD), an accumulator register (ACC), and an AF/I2F unit. To
cover all BNN operations discussed earlier, the PE supports
both full-precision and binarized operations. However, since
binarized operations are more performance critical, and there is
only few BNN operations that rely on full-precision, we chose
to evaluate the more efficient fixed point for full precision
support in this paper (i.e., instead of floating point).
The MUL unit supports both full-precision fixed point
(FMUL) and binarized multiplication (BMUL) operations. The
datapath to support BMUL is shown in Figure 2(c). It consists
of an xnor unit, as well as a set of look up tables and adders to
perform the specialized population count needed for BMUL.
The PE ADD unit is a full-precision adder, used either to
accumulate the integer BMUL output or full-precision results
from the first-layer computation or batch normalization.
The AF/I2F unit applies transformations to the accumulated
value prior to writing it to the output RAM buffer. These
transformations include: applying activation function (we use
Fig. 3. Sequences of operations that a processing element takes to process a ReLU in this study) and converting integer to fixed point.
BNN layer. (a) Load initial ACC constant. (b) Multiplication of input neurons
against weights. (c)(d) Normalization and activation function. Accelerator Operations. The proposed accelerator
supports all the operations needed to process BNNs. Figure 3
illustrates the sequence of PE operations when processing a
III. HARDWARE ACCELERATOR BNN layer. They work as follows.
We propose hardware accelerator architecture for BNNs. It First, an initialization parameter is loaded to ACC register.
supports all the operations needed to process arbitrary BNNs. It An initialization parameter is the constant offsets to be applied
is especially designed to realize the efficiency benefits of to output neuron values. In a typical setup where a BNN layer
BNNs. It contains a scalable number of processing elements, includes a bias node and utilizes batch normalization at its
along many distributed on-chip RAMs. The network output [10], the offset would be b-μ. This parameter can be
parameters (e.g., binarized weights, normalization constants) adjusted for other BNN variants. For example, if batch
are kept in these on-chip RAMs and supplied to the many PEs

Authorized licensed use limited to: Fuzhou University. Downloaded on November 24,2022 at 03:06:27 UTC from IEEE Xplore. Restrictions apply.
normalization is not used, then this could be set to the bias binarized weights for interesting problem sizes can fit in many
parameter b. Further, if bias is not used, this could be set to 0. distributed on-chip FPGA RAMs that deliver abundance of on-
chip bandwidth to the reconfigurable fabric and DSPs to
Second, the input neuron values to the layer are multiplied perform high-throughput computation on packed binarized
against network weights. For first BNN layer, input neurons neuron and weight values.
are fixed points. Hence, a PE will multiply-accumulate a single
neuron value with a single weight at a time (i.e., FMUL and This paper targets a high-end Altera Aria 10 FPGA, which
FADD). For other layers (hidden and output layers), the input contains ~6MB of on-chip RAMs (i.e., 2713 M20Ks resources)
neuron values and the weights are binarized (single bit each). and 1518 hard DSP units. Note that while Aria 10 is the latest
Therefore, the PE can multiply-accumulate a set of packed Altera family available today, the next-generation Stratix 10
weights and neuron values at a time (i.e., BMUL and integer family is slated for release soon. Stratix 10 will offer up to ~28
ADD). In our study, we pack 32 weights and neuron values on-chip RAMs, ~5K DSPs, and higher frequency. Thus we
together into 32-bit chunks. So, a PE can perform 32 binarized expect dramatic increase in FPGA performance in the near
multiply-accumulate at a time. This improves efficiency and future when Stratix 10 becomes available.
speeds up computation. E.g., relative to a 32-bit representation
of weights and neurons, this means 32x speedup in multiply- In our evaluation, we first start by using our parameterized
Verilog RTL to produce a small design instance (e.g., few
accumulate computation. The accumulated results are then
written to PE temporary buffers. If this is the first BNN layer, PEs). Then, we increase the design parameters to scale up, until
no data transformation is needed, and AF/I2F unit is set to we can no longer fit the design onto the FPGA. This largest
simply pass through the result to write out. For other layers, the design will be used to represent a high-performance design for
accumulated result is integer, and AF/I2F unit is set to convert server applications. Additionally, we also study a smaller scale
it into fixed point (I2F operation). design for IoT application.

To produce the final output neurons for the layer, the ACC We use Altera Quartus Prime to do our synthesis and
is loaded with batch normalization parameter β (figure 3(c)). mapping to FPGA. To calculate power estimate for FPGA, we
Then the accumulated result that was written out to temporary use Altera’s PowerPlay Early Power Estimator tool [13]. We
buffer is read back into the PE. It is then multiplied against the check to ensure that we are properly writing the RTL such that
other batched normalization parameter ϒ and accumulated with the tool infers the appropriate FPGA resources. E.g., on-chip
β that was loaded into the ACC earlier. The updated ACC PE RAMs are mapped to M20Ks, and the full-precision
value is then fed into AF/I2F unit, where activation function multiplier units are mapped to DSPs.
(AF) is applied to produce the final neuron output. The final The largest design we can fit our target Aria 10 FPGA
output is written back to the PE buffers. (Figure 3(d)). contains 1024 PEs and ~4MB of on-chip RAMs. We also
chose another smaller scale design to study, which contains 64
B. Implementations on FPGA and ASIC PEs. The specifications for these designs (FPGA64,
For evaluation, we developed a Verilog RTL FPGA1024) are shown in Figure 4(a). In FPGA1024, while we
implementation of the BNN accelerator detailed in the previous are able to utilize all the DSPs in the Aria 10, we are not able to
sub-section. We used the BNN software from [2] as functional use all the on-chip RAMs (M20Ks) due to routing constraints.
reference. The RTL is parameterizable to facilitate design In Figure 4(a), we also report peak throughput as terra
space exploration. For example, a parameter can be set to operations per second (TOP/sec). This represents 1-bit multiply
output an RTL instance with arbitrary number of PEs, which and accumulation operations on network weights and neurons.
we can use to scale up/down various design instances for us to It is calculated as follows. As an example, the FPGA1024
study. From this parameterized Verilog RTL, we map our design contains 1024 PEs and each PE does 32-bit packed
accelerator architecture onto FPGA and ASIC, which we weights calculation in parallel in a pipelined fashion to retire
describe in further detail below. 32 new results each cycle. So, at 150MHz frequency, the peak
throughput is 1024 PEs x (32 bits packed x (1 multiply + 1
accumulate)/PE) x 150M operations per second. This results in
9.8 TOP/sec. Such a high peak throughput is feasible due to the
significant efficiency benefit of binarization.
ASIC. For ASIC evaluation, we study design instances
with 64 and 256 PEs. These designs are synthesized using Intel
14nm ASIC flow, for which the area and power estimates are
obtained. Both designs meet the target frequency of 1 GHz.
Memory elements are modeled using CACTI. The summary of
Fig. 4. FPGA and ASIC accelerators under study. (b) shows ASIC64 design both implementations are provided in Figure 4(a). Figure 4(b)
place and routed on 14nm technology. Each color is a 16-PE tile.
shows a place-and-routed 64-PE design (i.e., ASIC64). In the
FPGA. FPGA technologies have advanced rapidly. There figure, each of the four tiles in the design is highlighted with a
are increasing numbers of on-chip RAMs, hard DSPs for different color, where each tile contains 16 PEs.
arithmetic operations, and reconfigurable fabric resources in In ASIC64, since the design runs at 1GHz, the 64-PE
newer FPGAs. As such, FPGAs have the potential to offer very design can deliver a peak throughput of 64 PEs x (32 bits
efficient BNN accelerator implementations. The compact packed x (1 multiply + 1 accumulate)/PE) x 1G operations per

Authorized licensed use limited to: Fuzhou University. Downloaded on November 24,2022 at 03:06:27 UTC from IEEE Xplore. Restrictions apply.
second, which results in 4 TOP/sec. Scaled accordingly, the the default clock management scheme provided sub-optimal
256-PE design can deliver a peak throughput of 16 TOP/s. In performance (i.e., ran at ~70MHz).
both designs, the on-chip RAMs account for a non-trivial
portion of the total chip power and area. B. Binarized GEMV/GEMM on CPU
For binarized GEMV, our Haswell-EP platform has built-in
IV. SOFTWARE ON CPU AND GPUS instructions for population count exposed through the SSE4a
To evaluate the effectiveness of the proposed hardware extension to the x86 ISA. These instructions are popcnt for 32-
accelerator architecture, we compare the FPGA and ASIC bit operands and popcntl for 64-bit operands. While included in
implementations against a variety of optimized software the SSE4a set of instruction extensions, they are not SIMD
implementations on CPU and GPU platforms. For all the instructions and only execute on scalar register values. On the
platforms, we evaluate optimized software implementations of Haswell microarchitecture, a population count instruction can
baseline SGEMV for standard neural networks as well as be initiated every cycle –yielding 64 “binary ops” per cycle. In
binarized GEMV for BNNs. contrast, a well-tuned single precision implementation of
matrix multiply using AVX2 FMA instructions can retire at
most 32 flops per cycle, or ½ the throughput of the population
#pragma omp parallel for
for(int i=0;i<n;i+=fBlkI) count based binary operation. Therefore, a tuned binary matrix
for(int j=0;j<m;j+=fBlkJ) multiply implementation has a performance roofline of 2x over
for(int k=0;k<_k;k+=fBlkK) { a tuned single precision implementation of matrix multiply.
for(int jj=0;jj<fBlkJ;jj++)
for(int kk=0;kk<fBlkK;kk++) As binary matrix multiply is not included in standard BLAS
bt[jj][kk] = B[(k + kk)*m + j + jj];
for(int ii=0;ii<fBlkI;ii+=fBlkII) packages, we wrote our own implementation (shown in Figure
for(int jj=0;jj<fBlkJ;jj+=fBlkJJ) 5). Our implementation uses an outer level of cache blocking
for(int kk=0;kk<fBlkK;kk+=fBlkKK){ and an inner-level of register blocking in order to achieve
ct_00 = C[(i+ii+0)*m+j+jj+0]; ct_01 = C[(i+ii+0)*m+j+jj+1];
ct_10 = C[(i+ii+1)*m+j+jj+0]; ct_11 = C[(i+ii+1)*m+j+jj+1]; compute-bound performance. The outer block is sized to fit in
ct_20 = C[(i+ii+2)*m+j+jj+0]; ct_21 = C[(i+ii+2)*m+j+jj+1]; the 256 kB L2 cache of our Haswell CPU. In code listing
ct_30 = C[(i+ii+3)*m+j+jj+0]; ct_31 = C[(i+ii+3)*m+j+jj+1]; shown in Figure 5, we explicitly copy and transpose the outer
for(int kkk=0;kkk<fBlkKK;kkk++){
b0 = bt[jj+0][kk+kkk]; b1 = bt[jj+1][kk+kkk]; cache block into the 2d array “bt” in order to achieve better
ct_00 += popcnt(A[(i+ii+0)*_k + k+kk+kkk]^b0); memory locality and increase the cache hit rate.
ct_01 += popcnt(A[(i+ii+0)*_k + k+kk+kkk]^b1);
ct_10 += popcnt(A[(i+ii+1)*_k + k+kk+kkk]^b0); The inner block is sized to fit in CPU registers (“ct_xx” in
ct_11 += popcnt(A[(i+ii+1)*_k + k+kk+kkk]^b1);
ct_20 += popcnt(A[(i+ii+2)*_k + k+kk+kkk]^b0);
the code listing). We experimentally determined that a 4x2
ct_21 += popcnt(A[(i+ii+2)*_k + k+kk+kkk]^b1); register block yields the highest performance on our platform
ct_30 += popcnt(A[(i+ii+3)*_k + k+kk+kkk]^b0); as larger register block sizes incur spilling while smaller block
ct_31 += popcnt(A[(i+ii+3)*_k + k+kk+kkk]^b1);
}
sizes do not have enough register reuse. Finally, we use
C[(i+ii+0)*m+j+jj+0] = ct_00; C[(i+ii+0)*m+j+jj+1] = ct_01; OpenMP to parallelize across CPU cores.
C[(i+ii+1)*m+j+jj+0] = ct_10; C[(i+ii+1)*m+j+jj+1] = ct_11;
C[(i+ii+2)*m+j+jj+0] = ct_20; C[(i+ii+2)*m+j+jj+1] = ct_21;
C[(i+ii+3)*m+j+jj+0] = ct_30; C[(i+ii+3)*m+j+jj+1] = ct_31; C. Binarized GEMV/GEMM on GPUs
}
}
We evaluate a binary matrix multiply kernel (xnor_gemm)
from BinaryNet [2]. The CUDA implementation uses shared
Fig. 5. CPU implementation of binarized matrix multiply (C = A x B). memory blocking to reduce the number of access to global
memory. For matrix multiplication of C = A x B, each thread
A. Baseline SGEMV/SGEMM on CPU/GPUs block loads sub-matrices of A and B from global memory into
shared memory. Then, each thread in thread blocks computes
For CPU evaluation, we use a high-performance 2.3 GHz one element of the sub-matrix C using xnor and __popc()
Intel® Xeon E5-2699v3 server (i.e., Haswell-EP). It has 90 operations. The evaluated xnor_gemm kernel is similar to the
MB of aggregate LLC and 36 physical cores. For baseline blocked version of matrix multiply in the CUDA Programming
SGEMV, we enabled MKL and OpenMP, ensuring that the Guide except for the code for computing the product C.
software is taking advantage of multi-threaded execution
across the 36 physical cores. Runtime and power The population count operation is natively supported in
measurements are done using performance counters. Nvidia GPU devices via __popc() (for 32-bit operands) and
__popcll() (for 64-bit operands) intrinsic functions. These are
For GPU evaluation, we use a high-performance Nvidia directly used in the CUDA kernel, and the CUDA compiler
Titan X GPU, as well as Nvidia mobile GPU (mGPU) on TX1 maps __popc() to a single instruction and __popcll() to a few
embedded development platform. For baseline SGEMV on instructions.
GPU, we use cuBLAS libraries. We measure power using
nvidia-smi utility on Nvidia Titan X. Since TX1 did not On our evaluated GTX Titan X platform, 32 32-bit
provide such facility, we measured power using Kill-A-Watt population count operations can be issued every cycle per
power meter. We ran the software in a loop until wall power Streaming Multiprocessor (SM) – yielding 1024 “binary ops”
measurement stabilized. To get best performance in TX1, we per cycle. As GTX Titan X can issue up to 128 32-bit floating-
forced all clocks to run at maximum speed (i.e., ~1 GHz), as point operations every cycle per SM, the performance roofline
of “binary ops” over FP32 operations is 4x.

Authorized licensed use limited to: Fuzhou University. Downloaded on November 24,2022 at 03:06:27 UTC from IEEE Xplore. Restrictions apply.
V. EVALUATION performance and performance/watt. For the high-performance
We studied a set of neural network layer configurations that platforms (Xeon CPU, Titan X GPU), we also evaluate batched
are used by popular networks, such as AlexNet [8], VGG [11], execution with batch size of 10, as suggested in [3]. For non-
and Neural Talk (NT) [12]. See Table I. We focus on the fully binarized software evaluation, the batched experiments called
connected layer, which contains most of the weights in the CPU or GPU SGEMM kernels, while the non-batched
network and are the most challenging due to the large model experiments called SGEMV kernels.
size. As stated in the introduction, larger models in fully- The evaluation results are shown in Figure 6, 7, and 8.
connected layers are challenging to execute efficiently since Figure 6 and 7 show performance and performance/watt,
they do not fit on-chip, necessitating off-chip DRAM accesses relative to non-batched baseline CPU software. Figure 8
that are very energy inefficient and imposes performance limit depicts the fraction of peak performance that is achievable,
on the DRAM bandwidth available to access these models. indicating platform utilization. E.g., 50% means only half of
the peak performance available in the platform was achievable
TABLE I. NEURAL NETWORK LAYER CONFIGURATIONS UNDER STUDY. during our experiments.
Name Outputs Inputs Binarized model size (MB)
A. CPU versus GPU
Alex/VGG 7 4096 4096 2.00 In a normal (no batching) mode, CPU performs comparably
Alex/VGG 8 1000 4096 0.49
well to GPU, as show in Figure 6. On average, non-batched
CPU has ~90% better performance than non-batched GPU.
NT-We 600 4096 0.29 Among the five network layer configurations, GPU performs
NT-Wd 8791 600 0.63 almost comparable to CPU only for Alex/VGG 7 where the
NTLSTM
number of outputs is equal to number of inputs (i.e., the weight
2400 1201 0.34 matrix is square). In other cases, GPU is always noticeably
inferior to CPU.
For FPGA, we only evaluate layers that can fit on the
Even though GPU has much higher peak performance, it is
~4MB RAMs that our FPGA design could use. We evaluate

Fig. 6. Performance relative to baseline software on CPU. I.e., above 1 means speedup, while less than 1 means slowdown.

Fig. 7. Performance/Watt relative to baseline software on CPU.

Fig. 8. Achieved performance relative to peak. E.g., 50% means only half of peak performance is realized.

Authorized licensed use limited to: Fuzhou University. Downloaded on November 24,2022 at 03:06:27 UTC from IEEE Xplore. Restrictions apply.
extremely underutilized (i.e., ~1% utilization on average, as example, while batching delivers 80% performance boost for
shown in Figure 8). The CPU is also underutilized (~6%), but CPU, binarization offers 5x improvements, which is 6x better
not as much as the case with GPU. The low utilization is due to than batching. GPU has similar trend as well. Moreover,
the challenge in being able to extract fine-grained parallelism binarized operations can be batched as well. Further speedups
out of the weight matrices. Without batching, there is only a can be achieved by combining both batching and binarization.
single set of inputs (i.e., a vector) that is being multiplied
against the weight matrix. Thus, there is limited data re-use. Hence, one can choose to do binarization only, which
Unless the platform can extract sufficient fine-grained delivers improvements better than batching, while meeting low
parallelism from this single matrix x vector operation to utilize latency requirements. Or, one can combine binarization with
the available platform resources, it is inevitable that the batching to achieve better throughput, if latency constraints are
platform would suffer from underutilization. not as stringent.

For CPU and GPU, when scaling up to multiple software Accordingly, as shown in Figure 7 and 8, binarization leads
threads, if there is only a small amount of data to process, the to improvements in performance/watt as well as utilizations.
overhead of threading can end up being the dominant one.
D. Hardware Accceleration
For the mobile GPU, as Figure 6 shows, its performance is Beyond software optimizations, both FPGA and ASIC
much worse than a server CPU (i.e., ~40x worse on average). accelerators can deliver even further improvements in
The mobile GPU also suffer from extreme underutilization performance and performance/watt. As shown in Figure 6, our
(~1% on average, as shown in Figure 8), as in the case with FPGA and ASIC accelerators deliver one to two orders of
high-performance GPU. However, the mobile GPU has much magnitude speedups over the baseline CPU. The high-
lower peak performance. performance FPGA1024 design delivers almost 50x
Consequently, CPU achieved a better overall performance improvement over the baseline CPU.
performance/watt than both the high-performance GPU as well These large performance speedups from accelerators are
as mobile GPU, as depicted in Figure 7. As such, for non- due to the custom hardware design for BNN, which consists of
batched neural networks, CPUs can be a better overall solution PEs that are well integrated with distributed on-chip RAMs to
than GPU, delivering comparable performance while achieving deliver neural network parameters to the PEs at a sufficiently
better energy efficiency. high bandwidth to keep the PEs well utilized. The PE is also
equipped with native support for binarized operations.
B. Impact of Batching Multiple Inputs/Outputs
Batching improves performance as well as utilization for Indeed, as shown in Figure 8, our accelerators achieve
both CPU and GPUs. This is because batching enables more significantly higher utilizations (i.e., ~75%) than the software
data reuse, since there are multiple input vectors (forming an implementations on CPU and GPU. As such, even though our
input matrix) to be multiplied against the weight matrix. accelerators have lower peak performance than the high-
performance GPU, they are able to utilize most of it, resulting
As shown in Figure 6, batching improves performance by in significant performance improvements over GPU.
~80% for CPU and ~5.8x for GPU. Accordingly,
performance/watt improves by similar degree, as shown in As depicted in Figure 7, energy efficiency improvements
Figure 7. achieved by the accelerators are even better. The ASIC
implementations offer four orders of magnitudes in
Batching improves CPU utilization by almost 2x (from 6% improvements over CPU baseline, while the FPGA offers three
to 10%) and GPU by 7x (from 1% to 7%). Even though orders of magnitude.
batching leads to noticeable improvements in utilization, at10%
utilization for CPU and 7% for GPU, in overall these platforms E. FPGA versus ASIC
are still underutilized. The general rule of thumb is that FPGA will be about an
Furthermore, as explained in Section 2, batching increases order of magnitude less efficient than ASIC. However, modern
latency. So, if possible, a solution that improves performance FPGAs contain “hardened” resources, such as DSPs for
without necessitating batched operations would be preferable. arithmetic operations and M20Ks (in Altera FPGA) for on-chip
RAMs. When an FPGA design is implemented such that it uses
C. Impacts of Binarization these hard blocks, the efficiency gap between FPGA and ASIC
can be reduced. This is the case for our BNN accelerators,
Binarization provides the potential to deliver significant which heavily use M20Ks on-chip RAMs and DSPs for
performance improvements, since it reduces the storage arithmetic operations.
requirements as well as computational demands. For CPU and
GPU, smaller datasets means that they are more cacheable and Both FPGA64 and ASIC64 designs adopts the same
can be kept on-chip. Further, binarized GEMV operation microarchitecture (i.e., number of PEs and RAMs), hence they
requires less computation than SGEMV, as discussed earlier. provide a direct comparison between FPGA and ASIC.
Between these two designs, ASIC64 has ~4.5x higher
Indeed, our results in Figure 6 show that binarized CPU performance than FPGA64 since it has higher frequency.
software has 5x better performance than baseline CPU. For
GPU, binarization improves performance by ~11x. In terms of energy efficiency (i.e., performance/watt),
Binarization leads to larger speedups than batching. For ASIC64 is ~11x better than FPGA64. However, Aria 10 FPGA

Authorized licensed use limited to: Fuzhou University. Downloaded on November 24,2022 at 03:06:27 UTC from IEEE Xplore. Restrictions apply.
is fabricated on a 20nm TSMC process technology, while the neural network (NN) implementations on CPU, GPU, FPGA,
ASIC is on 14 nm Intel technology. Normalizing for such and ASIC. However, it focuses on recurrent NNs, not BNNs.
process technology difference, the FPGA/ASIC efficiency gap
in this case is estimated to be less than ~8x, which is lower VII. CONCLUSION
than the abovementioned rule of thumb. However, we think the
less than ~8x ASIC/FPGA gap is due to the fact that our BNN Binarized neural networks offer significant algorithmic
accelerator heavily take advantage of the hard FPGA blocks efficiency improvements over standard full-precision networks.
(M20K for on-chip RAMs, hard DSPs for multiply/add). This paper proposed hardware accelerator architecture for
BNNs, which delivers superior performance while consuming
For the larger scale high-performance designs (FPGA1024, energy efficiently. We evaluated our accelerator to target Aria
ASIC256), the large Aria 10 FPGA allowed us to implement 10 FPGA and 14nm ASIC. We compared these accelerator
1024-PE design, but at a lower frequency than ASIC (150MHz instances against optimized software on a high-performance
vs. 1GHz). Thus, while FPGA has more PEs, it runs slower, multi-core CPU and GPU for cloud server, as well as a mobile
resulting in worse performance than the ASIC256 design. GPU suitable for IoT. Our evaluation results show that the
proposed accelerator can deliver orders of magnitude
F. Opportunities for FPGAs improvements in performance and performance/watt over well-
The upcoming Altera Stratix 10 FPGA will offer even more optimized software on CPU and GPU. Lastly, while FPGA is
M20Ks and DSP hard blocks. Therefore, we can expect to less efficient than ASIC, the FPGA-ASIC gap may be reduced
deploy designs with even more number of PEs in the Stratix 10 for designs that heavily utilize hard blocks (DSP, M20K), such
when it becomes available. as our BNN accelerator. Hence, FPGA offers an attractive
solution, which deliver superior efficiency improvements over
Furthermore, Stratix 10 has the new HyperFlex technology software, without having to lock into a fixed ASIC solution.
to deliver higher operating frequency through retiming. Since
our BNN accelerator does not have tight data dependencies and
REFERENCES
is amenable to re-timing, we expect that our accelerator can
take advantage of Stratix 10 support for higher frequency. [1] M. Courbariaux, Y. Bengio, J-P. David, "BinaryConnect: Training Deep
Neural Networks with binary weights during propagations," Neural
The aforesaid trends highlight the tremendous opportunities Information Processing Systems (NIPS), 2015.
for FPGAs. Unlike with fixed ASIC design, FPGAs can be [2] M. Courbariaux, I. Hubara, D. Soudry, et al., "Binarized Neural
reconfigured for other uses as well as newer improved versions Networks: Training Deep Neural Networks with Weights and
Activations Constrained to +1 or -1," arXiv:1602.02830 [cs.LG].
of an accelerator. Thus, if the FPGA-to-ASIC efficiency gap
[3] D. Amodei, R. Anubhai, E. Battenberg, "Deep Speech 2: End-to-End
narrows, there is a stronger case to adopt FPGA solutions. Speech Recognition in English and Mandarin," arXiv:1512.02595
[cs.CL].
VI. RELATED WORK [4] S. Kestur, J. D. Davis, O. Williams, "BLAS Comparison on FPGA, CPU
and GPU," ISVLSI, 2010.
To the best of our knowledge, we are the first to propose
[5] D. Mukonoki, T. Imamura, D. Takahashi, "Fast implementation of
hardware accelerator for BNNs. The original BNN paper [1] General Matrix-Vector Multiplication (GEMV) on Kepler GPUs,"
focused on the BNN algorithm. It describes the benefits of Euromicro International Conference on Parallel, Distributed, and
BNNs through algorithmic complexity analysis. A more recent Network-based Processing, 2015.
BNN work (BinaryNet [2]) shows an evaluation of binarized [6] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun. "Cnp: An fpga-based
GEMM on GPU using xnor and population count. In contrast, processor for convolutional networks," In Field Programmable Logic
this paper proposes hardware accelerator architecture for and Applications (FPL), 2009.
BNNs, and offers comprehensive comparative evaluation [7] T. Chen, Z. Du, N. Sun, et. al, "Diannao: A small-footprint high-
throughput accelerator for ubiquitous machine-learning," Architectural
across various interesting problem sizes, on FPGA, ASIC, Support for Programming Languages and Operating Systems, 2014.
server CPU, server GPU, and mobile GPU. [8] A. Krizhevsky, et al., "Imagenet classificationwith deep convolutional
Aside from BNNs, there are myriad of existing accelerators neural networks," NIPS, 2012.
for Deep Learning (DL), targeting both FPGAs (e.g., [6]) as [9] M. Horowitz. Energy table for 45nm process, Stanford VLSI wiki.
well as ASICs (e.g., [7]). However, none of them target BNNs. [10] S. Ioffe, C. Szegedy "Batch Normalization: Accelerating Deep Network
Training by Reducing Internal Covariate Shift," arXiv:1502.03167
BNNs are unique, since they represent each network weight [cs.LG].
using a single bit, which requires a proper acceleration strategy
[11] K. Simonyan and A. Zisserman, "Very deep convolutional networks for
to take full advantage of such bit-level representations. There large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
are also existing studies on machine learning accelerators (e.g., [12] A. Karpathy and L. Fei-Fei, "Deep visual-semantic alignments for
[14][15]), which unlike this work, target non-DL algorithms. generating image descriptions," arXiv preprint arXiv:1412.2306, 2014.
Multiplication of a dense matrix against a dense vector [13] Altera's PowerPlay Early Power Estimators (EPE) and Power Analyzer.
URL: https://www.altera.com/support/support-resources/operation-and-
(GEMV) is a well-known construct that is part of the standard testing/power/pow-powerplay.tablet.html
BLAS library. There are existing studies (e.g., [4]) that [14] E. Nurvitadhi, A. Mishra, D. Marr “A sparse matrix vector multiply
evaluate BLAS on CPUs, GPUs, and FPGAs. Unlike prior accelerator for support vector machine,” CASES, 2015.
work, this paper focuses on a binarized GEMV. Moreover, this [15] E. Nurvitadhi, A. Mishra, Y. Wang, G. Venkatesh, D. Marr, “Hardware
work offers comparison with an ASIC, while others only accelerator for analytics of sparse data,” DATE, 2016.
consider CPU, GPU, and/or FPGA. And, this paper targets [16] E. Nurvitadhi, et al, “Accelerating recurrent neural networks in analytics
more modern platforms. Finally, a recent study [16] evaluates servers: Comparison of FPGA, CPU, GPU, and ASIC,” FPL, 2016.

Authorized licensed use limited to: Fuzhou University. Downloaded on November 24,2022 at 03:06:27 UTC from IEEE Xplore. Restrictions apply.

You might also like