0% found this document useful (0 votes)
11 views6 pages

DAC'22_EBSP_Bit_Sparsity_DNN

The paper presents EBSP, a method for efficient inference of quantized deep neural networks by integrating aggressive joint-way compression with hardware design. It addresses challenges in model compression by introducing bit sparsity patterns that optimize quantization and sparsification, resulting in minimal accuracy loss and significant energy savings. Experiments demonstrate that EBSP outperforms existing methods in terms of accuracy and energy efficiency, making it suitable for resource-constrained edge computing platforms.

Uploaded by

zsq643382008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views6 pages

DAC'22_EBSP_Bit_Sparsity_DNN

The paper presents EBSP, a method for efficient inference of quantized deep neural networks by integrating aggressive joint-way compression with hardware design. It addresses challenges in model compression by introducing bit sparsity patterns that optimize quantization and sparsification, resulting in minimal accuracy loss and significant energy savings. Experiments demonstrate that EBSP outperforms existing methods in terms of accuracy and energy efficiency, making it suitable for resource-constrained edge computing platforms.

Uploaded by

zsq643382008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

EBSP: Evolving Bit Sparsity Patterns for Hardware-Friendly

Inference of Quantized Deep Neural Networks


Fangxin Liu1,2 , Wenbo Zhao1 , Zongwu Wang1 , Yongbiao Chen1 , Zhezhi He1 , Naifeng Jing1 ,
Xiaoyao Liang1 and Li Jiang1,2,3 *
1 Shanghai Jiao Tong University, 2 Shanghai Qi Zhi Institute
3 MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University

ABSTRACT contain trillions of parameters and requires GFLOPs (giga floating-


Model compression has been extensively investigated for support- point operations) of computations in a single inference, making it
ing efficient neural network inference on edge-computing plat- a challenging task to perform on-device inference.
forms due to the huge model size and computation amount. Recent To address this challenge, the resource-constrained edge com-
researches embrace joint-way compression across multiple tech- puting platforms require two crucial supports. The first one is the
niques for extreme compression. However, most joint-way methods specialized hardware acceleration for DNN inference. For exam-
adopt a naive solution that applies two approaches sequentially, ple, EIE [12] utilizes sparse neural network models, while requires
which can be sub-optimal, as it lacks a systematic approach to additional hardware overhead to represent the sparse data format.
incorporate them. Laconic [23] tries to optimize the design of multipliers based on
This paper proposes the integration of aggressive joint-way com- Booth coding, which converts inputs into a sequence of signed
pression into hardware design, namely EBSP. It is motivated by 1) terms and sequentially multiplies and accumulates them.
the quantization allows simplifying hardware implementations; 2) The second is the model compression technique, such as network
the bit distribution of quantized weights can be viewed as an inde- sparsification [11] and network quantization [14], which are widely
pendent trainable variable; 3) the exploitation of bit sparsity in the used for inference. To be specific, sparsification makes network
quantized network has the potential to achieve better performance. sparse, quantization reduces network precision and both reduce
To achieve that, this paper introduces the bit sparsity patterns to the required memory bandwidth. Among them, non-uniform quan-
construct both highly expressive and inherently regular bit distribu- tization method, represented by the deep compression [7], applies
tion in the quantized network. We further incorporate our sparsity k-means to cluster the weights, and the quantized values are de-
constraint in training to evolve inherently bit distributions to the noted as indexes. Meanwhile, power-of-2 based quantization (e.g.,
bit sparsity pattern. Moreover, the structure of the introduced bit SP2 [3], AFP [17]) maps the weight values to the exponential space,
sparsity pattern engenders minimum hardware implementation un- and then simplifies the multiplication operation into shift operation.
der competitive classification accuracy. Specifically, the quantized In spite of the significant advances, we identify three challenges
network constrained by bit sparsity pattern can be processed using for existing compression techniques: 1) quantization methods focus
LUTs with the fewest entries instead of multipliers in minimally on improving the compression rate of ultra low-precision DNN mod-
modified computational hardware. Our experiments show that com- els, resulting in significant accuracy losses, for example, > 5% under
pared to Eyeriss, BitFusion, WAX, and OLAccel, EBSP with less binary and > 2% for ternary quantization [7]; 2) generally, spar-
than 0.8% accuracy loss, can achieve 87.3%, 79.7%, 75.2% and 58.9% sification methods can achieve higher performance with greater
energy reduction and 93.8%, 83.7%, 72.7% and 49.5% performance compression than quantization methods. However, major draw-
gain on average, respectively. backs are the additional indexing overhead for addressing non-zero
elements and irregular access/execution patterns [11]; 3) sparsifica-
1 INTRODUCTION tion or ultra low-precision quantization methods always introduce
Deep neural network (DNN) models have been designed to solve ancillary overheads in the circuit or architecture design, which is
real-world problems and have achieved significant success in many still complex and implementation-unfriendly [11, 12].
application domains [16, 22]. However, due to the ever-increasing This paper focuses on the DNN compression technique, which
model size of DNNs, the memory and computation overhead have becomes imperative to the DNN hardware acceleration, especially
increased dramatically, making the deployment on embedded and on FPGA and ASIC platforms [7]. Thus, we revisit the quantization
edge devices difficult. For instance, the latest DNN models [7, 13] process from a new angle of bit-level sparsity: the reduction of
the precision of an operand can be taken as forcing one or more
bits among the operand to be zero, where a lower significant bit is
Permission to make digital or hard copies of all or part of this work for personal or more likely to be zero. Alternatively, quantization can be viewed
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
as increasing bit-level sparsity among the operand. By considering
on the first page. Copyrights for components of this work owned by others than ACM the distribution of bits in the model parameters, we propose the co-
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, design method where the hardware-friendly sparsity patterns are
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected]. formed under low-cost constraints and enjoy the benefits of quan-
DAC ’22, July 10–14, 2022, San Francisco, CA, USA tized DNNs with slight hardware modification. Our contributions
© 2022 Association for Computing Machinery. can be outlined as follows:
ACM ISBN 978-1-4503-9142-9/22/07. . . $15.00
https://doi.org/10.1145/3489517.3530660

259
• We propose a soft quantization scheme that evolves inher- sparsity to shorten the computation time of the convolution opera-
ently regular bit-sparsity patterns in training and achieves tion, where the data format is still the 16-bit fixed-point number.
an accuracy that matches the high bit-width quantization. WAX [10] uses a deeply distributed memory hierarchy, leads to
• We design the Look-Up Table (LUT) based approach for ac- data being moved with small overhead, and quantizes operands
celerating the DNN inference, which effectively incorporates from 32-bit to 8-bit. BitFusion [24] proposes to execute quantized
bit sparsity pattern and quantization for performance gains neural network models. OLAccel [20] is the mixed-precision accel-
while reducing the energy consumption of multiplication. erator that utilizes quantization with 4-bit and 16-bit MACs. These
• We describe the minimum required modifications and execu- designs mainly cater to the general single-way network sparsifi-
tion flow on the hardware platform to support such scheme cation or quantization. Consequently, their architectures struggle
effectively. for peak performance because it is difficult to find a perfect fitted
compression method to incorporate sparsity in quantization.
2 BACKGROUND AND MOTIVATION Our work shares conceptual similarities with prior works on
quantization, however, it distinguishes itself by deriving the pro-
2.1 Network Quantization
posed bit sparsity pattern to minimize hardware implementation.
Quantization algorithms compress the network by reducing the Such a sparsity pattern is evolved through training to yield reg-
number of bits required for weight and activation. Representative ular bit distributions, rather than imposing an overall bit-width
quantization methods can be categorized into two classes: constraint by quantizing the model. As a result, EBSP can deliver
1) Uniform quantization, which is one of the most widely-used a competitive accuracy on par with high-precision quantization,
quantization scheme, include binary, ternary, and fixed-point [14]. while the forced sparsity constraint keeps the hardware implemen-
Binary and ternary quantization uses extremely low bit-width to tation overhead at a minimum level.
represent DNN models, that is, 1-bit (-1, +1) for binary quantization
3 ALGORITHM FOR QUANTIZATION WITH
and 2-bit (-1, 0, +1) for ternary quantization. Although binary and
ternary quantization can significantly reduce the operand preci- BIT SPARSITY PATTERN
sion and simplify the hardware implementation, it introduces a In this section, we propose a novel quantization scheme combined
non-negligible accuracy loss (e.g., > 5% accuracy drop for binary). with the hardware-friendly bit sparsity pattern, which enjoys the
In contrast, as represented by INT8 [14] and LSQ [8], fixed-point non-multiplication operations for the DNN inference while achiev-
quantization schemes use modest bit-width to achieve comparable ing negligible inference accuracy loss.
accuracy as the original model. Weights and activations are quan- 3.1 Coupling Quantization with Hardware
tized to the nearest integer up to a scaling factor that is shared The most success of the quantization with low bit-width (e.g.,
through all the weights or activations in the same layer: HAQ [7], DeepCompression [11], and power-of-2 quantization [3])
𝑤ˆ = 𝛼 · 𝑐𝑙𝑖𝑝 (𝑟𝑜𝑢𝑛𝑑 (𝑤/𝛼), −2𝑚−1 +1, 2𝑚−1 −1), (1) can be largely attributed to introducing ancillary overheads, such
as indexes. Hence, it is not trivial to quantize neural networks with
where 𝑐𝑙𝑖𝑝 (𝑥, max, min) clamps the value 𝑥 into range [max, min]
low bit-width for both inputs and weights without indexes aiding
and 𝑤ˆ is the quantized value of 𝑤 with 𝑚-bit fixed-point quantiza-
or accuracy loss during the inference. In this work, EBSP aims at
tion.
eliminating multiplication operations in the (quantized) DNN in-
2) Non-uniform quantization, which uses the distribution of weights
ference models (as the low-bit-width quantization scheme), and at
and activations. One way is to cluster weights into several groups [7,
the same time, is designed to address the non-negligible accuracy
11]. However, such method didn’t bring computation benefits, since
loss of quantization with low bit-width. The proposed hardware-
the cluster centers are still stored as floating numbers. Others are
friendly quantization scheme incorporating the bit sparsity pattern
power-of-2 based methods that quantize weights to power-of-2s
can be considered as a variant of the non-uniform quantization. It
up to a scaling factor [3, 17]. These methods utilize the fact that
is possible to merge the quantization and sparsification constraints
weights and activations have a denser distribution near zero. Thus,
together.
they can replaces the expensive multiplications by cheap shifting
Since multiplication operation is the most widely used compute
operations. Although power-of-2 quantization can simplify the
operation in the DNN workloads. Thus, we attempt to design a
hardware implementation by eliminating multiplication, it cannot
novel hardware-friendly quantization method that takes full ad-
improve the accuracy by increasing bit-width as uniform quanti-
vantage of LUTs to replace multipliers, considering LUTs can be
zation. This is because the interval between quantization levels
reconfigured to support different bit-width and the reduced operand
increases exponentially with bit-width, resulting in finer resolution
precision enables fewer LUT entries [9, 21]. This approach mini-
near the mean with increasing the bit-width while the tails (weights
mally increases the memory area by introducing only hardware
with large value) still remain coarse.
that assists in combining LUT entries to realize multiplications.
The LUT-based scheme inevitably face the problem that there is
2.2 DNN Accelerators for Network an excessive number of entries to cover all the possible combina-
Sparsification and Quantization tions of weights and activations. The number of entries in the LUT
General network sparsification and quantization are widely used plays a significant role in determining the system performance. For
for inference. To be specific, sparsification reduces the number of example, to compute a multiplication with INT8 quantization in
operands, and quantization reduces the bitwidth of the data flowing one cycle, 65,536 (28 × 28 combinations) entries are needed in the
through a neural network model. SnaPEA [1] exploits activation LUT. If the partial sum result is saved with single precision (16-bit),

260
Bit Sparsity Pattern
Soft Quantized Weight Quantized Weight Matrix the most significant bit, the position 𝑝 = 𝑒 to 𝑒 + 𝑠 − 1 (𝑝 from 5 to 1
Matrix (Trainable) Backward pass with bit sparsity pattern
in Fig. 1). Then, the quantized weight is generated by the bit-wise
multiplication of the original weight and the mask. In the forward
phase, the original weight matrices are quantized prior to passing
Bit Mask
through masking. The layer computations are carried out with the
1bit Bitwidth
Forward pass
allow '1' to appear in
bit sparsity pattern of the quantized weight matrices.
Bitwidth
p= 5 4 3 2 1 Pattern
the consecutive 3 bits. In the backpropagation phase, we aim at training the weights to-
Figure 1: Quantization with bit sparsity pattern and weight update wards the quantization sparsity pattern while maintaining network
in training. accuracy. Therefore, we add a normalization term

·(𝑤 − 𝑤𝑞 ) 2 · 2−𝑒
𝜆 ∑︁
128 MB of momery is needed to store them, which makes the design (4)
2
very impractical. In order to further improve the LUT-based scheme
to the loss to decay the weights toward the quantized one. Then,
efficiency, we carefully optimize the LUT overhead in conjunction
the gradient will be calculated as
with the quantization method.
Inspired by the fact that only the bits from the most significant 𝜕𝐿 ′ 𝜕𝐿
bit of a number takes parts in the computation, and the leading ‘0’ = + 𝜆(𝑤 − 𝑤𝑞 ) · 2−𝑒 (5)
𝜕𝑤 𝜕𝑤
bits are redundant, we introduce an innovative co-design paradigm For large weights whose exponent 𝑒 is large, the gradient is still
of compression and hardware to enable fine-grained bits exploited focused on the first term to make sure that the weight is updated in
at a low cost and enjoy the benefits of the quantized DNNs. This the direction to reduce the loss and increase accuracy. On the other
method quantizes weights and activations into adaptive floating- hand, for small weights whose exponent 𝑒 is small, the second term
point numbers. Specifically, each layer shares a shift factor 𝑘, and will account for a larger part in the gradient, guiding them towards
each value has its exponent part 𝑒 and mantissa part 𝑚. In such the quantized value and reducing the quantization
case, each value can be represented as  error.
In addition, we formulate the cost C 𝑁𝑞 , 𝑛 exp in a bit manner:
𝑥 = 2𝑘 2𝑒 · 1 + 𝑚 · 2−𝑛man = 2𝑘 2𝑒 · 𝑀 the multiplication in EBSP need 2𝑛man +𝑛man LUT entries, which
𝑎 𝑤
(2)
 

where 𝑛 man is the bit-width of the mantissa 𝑚. Given a weight 𝑤 also determines the bits for adders (more LUT entries, the larger
and an activation 𝑎, the quantized multiplication of them can be the number of bits for adders). Therefore, we combine the cost
written as function Eq. (6) with accuracy as the overall objective of training a
DNN model based on ADMM-regularized optimization [19]:
𝑎 · 𝑤 = 2𝑘𝑎 2𝑒𝑎 · 𝑀𝑎 × 2𝑘 𝑤 2𝑒 𝑤 · 𝑀𝑤
   
(3) C = 2𝑛man +𝑛man
𝑎 𝑤
(6)
= 2𝑘𝑎 +𝑘 𝑤 (2𝑒𝑎 +𝑒 𝑤 · (𝑀𝑎 𝑀𝑤 ))
𝐿
From Eq. (3), we see that the only need to implement the mul- L = L (𝑊𝑞1:𝐿 ) + 𝛼
∑︁
C𝑙 (7)
tiplication of 𝑀𝑎 and 𝑀𝑤 . Then, the result can be derived after 𝑙=1
element-wise shifting by 𝑒𝑎 + 𝑒 𝑤 and the layer-wise shifting by where L (𝑊𝑞1:𝐿 ) is the original entropy loss evaluated with the quan-
𝑘𝑎 + 𝑘 𝑤 . Thus, we use LUTs to realize the computation of 𝑀𝑎 and tized weights 𝑊𝑞 , 𝐿 denotes the number of layers in the DNN model,
𝑀𝑤 , with 2𝑛man +𝑛man entries, much fewer than the computation of
𝑎 𝑤
𝛼 is a hyperparameter controlling the balance between the entropy
whole 𝑤 and 𝑎. Then, we introduce the evolving sparsity pattern and cost term. As a result, our evolving bit sparsity pattern for
in quantization to further decrease the bit-width of 𝑀𝑎 and 𝑀𝑤 , incorporating sparsity in quantization offers a viable path towards
which lead to yet further computational benefits. efficient hardware inference will be discussed in Section 4.1.

3.2 Evolving Sparsity Pattern in Quantization ABuf. WBuf.


A W
We propose a novel integrated training method that evolves sparsity Weight
Instruction Fetch Module

Buffer PE PE PE
patterns by enforcing the constraint on the bit distribution (called LOD LOD
Accumulation Unit
Central Controller

a w
bit sparsity constraint) by performing soft quantization in each Activation
PE PE PE nexp
+ n exp
DRAM

Buffer
training iteration. We divide the training process into three phases
sequentially according to the execution order: masking, forward SFU Ma Mw
MUX
MUX

PE PE PE LUT
passing, and backward passing, respectively. During the masking
Output
phase, the weight matrix is quantized to the target bit-width and Buffer PE PE PE shifter
Z
imposes bit sparsity constrain in the bits within the target bit-width. Accumulation Unit
As shown in Fig. 1, bit sparsity constraint is that a maximum of Figure 2: Block diagram of the PE implementation in EBSP.
𝑠 = 3 consecutive ‘1’s exists in the bits within the weights at a
given bit-width. Next, the forward pass is performed using the
quantized weight matrix with the bit sparsity constraint, which can 4 ARCHITECTURE FOR QUANTIZATION
be formulated as Eq. (2) and Eq. (3) WITH BIT SPARSITY PATTERN
In the masking phase, the position of the most significant bit is In this section, we discuss the efficient LUT-based PE implementa-
located and denoted as the exponent part 𝑒 of the quantized weight. tion, the organization of EBSP, and its execution flow support for
Then, the bit mask is generated that covers at most 𝑠 ‘1‘ bits after DNN workloads incorporated in the proposed EBSP architecture.

261
Initialization Computation Writeback
Cycle 0 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
Fetch input 36
Load the operands Op (3,6) Add the power: 9 shift by (3-1): Psum send to
parameter (3,4,12) and locate the 1+2=3 36 Adder Tree
matries (6,3,1) power: 1,2
Op (3,6) locate Match the LUT:
3 4 12 6 2 9 the signal: 1,1 1,1 = 9
6 4 2 3 3 4
Op (4,3) 36 +12 = 48
7 8 3 1 7 8 Add the power: 3 shift by (3-1): Psum send to
Activation Weight locate the 2+1=3 12 Adder Tree
Matrix Matrix power: 2,1
Op (4,3) Match the LUT:
locate the 0,1 = 3
signal: 0,1 48 +12 = 60
Op (12,1) Add the power: 3 shift by (3-1): Psum send to Write the
locate the 3+0=3 12 Adder Tree output Reg.
power: 3,0
Op (12,1) Match the LUT:
locate the 0,1 = 3
signal: 1,0

(a) execution pipeline of the Computation Optimization Step


Weight Buf.

Accumul

Output
Input
SFU

ator
Buf.

Buf.

SFU
(Quantizer)
PE Array

Step 1: Step 2: Step 3:


Quantization Computation Optimization Generate Activation
(b) Illustration of the EBSP that runs each modules.
Figure 3: The data computation for convolution. (a) An example of execution pipeline with the EBSP-based computation step
includes initialization, computation, and writeback phase. The blue blocks indicate steering logic, while the green blocks
indicate arithmetic logic. (b) Illustration of the EBSP that runs various modules for computation.

4.1 Non-Multiplication Engine (NME) buffer, i.e., the weight buffer as shown in Fig. 3 (b). We use the
In this section, we will implement the non-multiplication engine EBSP quantized data throughout the execution flow, where the
that can efficiently quantize the given weight and use LUT instead quantized data consists of the exponent and mantissa, as expressed
of the multiplier. As is shown in Fig. 2, the processing unit (PE) in Eq. (3). Therefore, the activation of each layer will also be in the
is designed with two components: steering logic and arithmetic same format. To use these results as input activations and multiply
logic. In detail, the steering logic is composed of leading one de- them with the low precision weights, we need first to quantize them
tector (LOD) that dynamically locate the most significant ‘1’ bit using the quantizer and then store them in the input buffer.
and a multiplexer that extracts significant digits to send to the Step 2: Computation Optimization. As shown in the Fig. 3 (b),
LUT. In contrast, based on the bit sparsity pattern, arithmetic logic after quantization, the low precision inputs and weights are stored
is composed of a LUT with few entries, adder, and shifter (e.g., in the input buffer and the weight buffer. We then perform efficient
barrel shifter) that implements the multiplication of Eq. (3). Take MAC computations using the PE array. Fig. 3(a) shows a detailed
the 5-bit quantization as an example, 1-bit sign, 2-bit exponent, illustration of the EBSP-based computation steps with an example.
2-bit mantissa for weight and 2-bit exponent, 3-bit mantissa for The matrix multiplication operation with once superposition of ad-
activation (0-bit sign since ReLU activation function is used). First, jacent exponents. In the initialization phase, the parameter matrix
𝑒𝑎 + 𝑒 𝑤 , the 2-bit addition, is realized by adder; meanwhile, EBSP (Cycle 0) is loaded, and the operands involved in the calculation (Cy-
uses LUTs to realize 𝑀𝑎 𝑀𝑤 . Finally, the output of the LUT through cle 1) are fetched. Since the initialization is performed only once at
the shifter, which is shifted by 𝑒𝑎 + 𝑒 𝑤 bits yield by the adder. In the beginning, which cause the small overhead of fetching operands,
this way, we can achieve the circuit-implementation-friendly for and only the cycles of computation is proportional to the number
the multiplication. In addition, our method has a better adaptability of multiplications. In the next six cycles (computation phase), three
since LUT entries only fix the bit-width of mantissa and different shifts, three additions and two accumulations are performed to
bit-widths of exponents can use the same LUT. In this example, only produce an element of the output matrix. In Cycle 2, according to
22+3 = 32 LUT entries are needed to stored pre-calculated constant, steering logic, the most significant digit ‘1’ is located and combined
which is quite impressive for implementation. Our architecture’s with our quantization method to obtain the power and the signal
other benefit is not restricting designers from using their preferred whether the superposition is needed. In Cycle 3, the addition is
design as the smaller core multiplier. If a more efficient multiplier performed since the operands of multiplication are simplified to
can be designed in the future, the arithmetic logic can be replaced. power-of-2. In Cycle 4, a left-shift operation is performed on the
The much smaller arithmetic logic justifies the addition of steering result of matching the LUT according to the adder’s output. In
logic, which provides significant overall power and area savings Cycle 5, the psum is fed into the adder tree for accumulation. Then,
than the exact multiplier. the subsequent operands are repeated for Cycle 2-4. The output
will be written back in Cycle 8. This execution pipeline continues
until the matrix multiplication is completed.
4.2 Execution Flow with NME Step 3: Generate Activation. The results of the PE array are
With our NME, we achieve accelerating the quantized NN through accumulated with the partial sum and sent to the Special Function
minor hardware modifications to boost hardware utilization and Unit (SFU) to calculate the final output. The SFU performs activation
inference performance significantly, as explained in Fig. 3: functions (e.g., ReLU), pooling functions (e.g., average pooling and
Step 1: Data preparing (Quantization). At the beginning of max pooling), or even other operations.
the EBSP run, the quantized weights are loaded into an on-chip

262
100% 100%

Top-1 Accuracy
Bit Percentage
80% 90%
60% 80%
40% 70%
20% 60%
0% 50%

WAX

WAX

WAX

WAX

WAX

WAX

WAX

WAX

WAX

WAX
OLAccel

OLAccel
OLAccel

OLAccel

BitFusion

OLAccel

BitFusion

OLAccel

OLAccel

OLAccel

BitFusion

OLAccel

BitFusion

OLAccel
BitFusion

BitFusion

BitFusion

BitFusion

BitFusion

BitFusion
EBSP

EBSP

EBSP

EBSP

EBSP

EBSP

EBSP

EBSP

EBSP

EBSP
Eyeriss

Eyeriss

Eyeriss

Eyeriss

Eyeriss

Eyeriss

Eyeriss

Eyeriss

Eyeriss

Eyeriss
CIFAR-10 ImageNet CIFAR-10 ImageNet CIFAR-10 ImageNet CIFAR-10 ImageNet CIFAR-10 ImageNet
AlexNet ResNet18 ResNet50 VGGNet16 MobileNet-v2
4-bit 5-bit 6-bit 8-bit 16-bit Accuracy
Figure 4: Comparison of DNN accuracy and percentage of bit-width for different networks.

It is worth noticing that other quantization methods such as Eyeriss, our EBSP shows nearly no accuracy loss for CIFAR-10 com-
INT8 quantization [14] are also applicable to our EBSP scheme, pared to Eyesiss (full INT16), BitFusion (full INT8) and WAX (full
which will be discussed in Section 5.2. fixed-point 8-bit), and a 2.2% accuracy improvement over the OLAc-
cel. For ImageNet, our EBSP shows a 0.31% accuracy loss compared
5 EXPERIMENTS to Eyesiss, WAX and BitFusion and a significant 4.32% accuracy
improvement over OLAccel. Fig. 4 also shows the percentage of
5.1 Experimental Methodology
bitwidth used in the computation for these designs. Both EBSP and
1) Validation on NN Accuracy. We evaluate the quantization OLAccel take full advantage of low-bit quantization, where the
framework on CIFAR-10 [15] and ImageNet [6] datasets on widely computation in our EBSP is done with LUT and is realized by the
applied CNN networks for image classification, including AlexNet [16] bit sparsity pattern. However, OLAccel performs quantization with
VGG-16 [7], ResNet-18, ResNet-50 [13], and MobileNet-v2 [22]. The fixed across all the layers and datasets. In contrast, EBSP is evolved
EBSP algorithm is implemented by PyTorch. We use pre-trained to generate the bit sparsity pattern in the quantization and takes
network models from Pytorch Model Zoo as the basis for EBSP into account the cost of LUTs and network accuracy in training, so
algorithm. The network structure is derived from the model in the that the number of bits used for weight and activation values can
torchvision [18]. We reconstruct the convolutional layers, which vary from layer to layer. As demonstrated in the up-to-date compact
insert the quantization process for the weights and activations be- MobileNet-v2 which has a much smaller parameter size, our EBSP
fore the built-in convolution function call. The experiment runs on is more robust for preserving DNN accuracy (0.78% loss) with 7% of
CUDA 10.2 environment with Tesla V100 GPU. 4-bit, 51% of 5-bit and 42% of 6-bit percentage than OLAccel (5.4%
Table 1: Configurations of different accelerators under 45-nm loss) with 85% of 4-bit percentage.
standard-cell library. 1
CORE Global Buffer DRAM

Eyeriss [4] BitFusion [24] WAX [10] OLAccel [20] EBSP 0.8
0.6
Bit-width 16-bit 4-bit 8-bit 4&16-bit 6-bit (3)† 0.4
Data Format Integer Integer Fixed-point Integer Integer 0.2
0
# PEs 224 3168 102 2499 4818
BitFusion

EBSP

BitFusion

EBSP

BitFusion

EBSP

BitFusion

EBSP

BitFusion

EBSP
OLAccel

OLAccel

OLAccel

OLAccel

OLAccel
Eyeriss

WAX

Eyeriss

WAX

Eyeriss

WAX

Eyeriss

WAX

Eyeriss

WAX
Area (mm2) 0.32 0.32 0.32 0.32 0.32
† this denotes the length of bit sparsity pattern, which determines LUT entries. ResNet-18 ResNet-50 AlexNet VGG-16 MobileNet-V2

Figure 5: Energy breakdown of various networks.


2) Modeling Accelerator Architecture. Since the Eyeriss architec-
Fig. 5 shows the the energy consumption of various accelerators
ture [4] is widely used as a baseline in many accelerators [10, 23]
for the five networks, which is decomposed into DRAM, global
and only relevant results are provided in these works, we also
buffer and processing cores (Core). For example, in ResNet-50, com-
choose it as a baseline. Meanwhile, for a fair comparison, we use
pared to Eyeriss, BitFusion, WAX and OLAccel, EBSP consumes
the same global buffer capacity (5 MB) and memory bandwidth
87.3%, 79.7%, 75.2% and 58.9% less energy, respectively. The energy
for all these accelerators and use CACTI [2] to estimate it that can
reduction of EBSP against other accelerators is mainly due to the
satisfy our design goals. In addition, we use the 45nm technology
simplification and reduced precision on PEs, and the resultant nar-
library and Synopsys Design Compiler [5] to study the area and
rower bit-width data transferred between DRAM and global buffer.
energy of the PE unit that we designed with various bit-width
Eyeriss BitFusion WAX OLAccel EBSP
and data-format. Tab. 1 shows the configuration of all accelerators 1
in our experiments. Under 500MHz PE frequency, we verify that 0.8
the required memory bandwidth is much smaller than the typical 0.6
memory bandwidth provided by DDR3. Thus, we can sustain a 0.4
non-blocking convolution. Meanwhile, the EBSP architecture can 0.2
be extended to a larger number of PEs under the same area budget. 0
ResNet-18 ResNet-50 AlexNet VGG-16 MobileNet-V2
5.2 Experimental Results Figure 6: Performance of various DNNs.
1) Accuracy, Performance and Energy Consumption. Fig. 4 first Fig. 6 shows the total execution cycles on different accelerators
shows the DNN accuracy for the five networks with CIFAR-10 and for the five networks, which are all normalized to Eyeriss. Still tak-
ImageNet datasets. Taking ResNet-50 as an example, compared to ing ResNet-50 as an example, compared to Eyeriss, EBSP achieves

263
2.8%

Energy Efficiency
0.8

Accuracy Loss
nearly 93% performance improvement because EBSP mostly uses 0.6 2.1%
simplified PE to realize the MACs for DNNs. However, only EBSP 0.4 1.4%
considers fine-grained bit sparsity in the quantization, EBSP achieves 0.2 0.7%
the highest throughput ( 93.87% performance improvement com- 0 0.0%
8-bit 8-bit 8-bit 8-bit 8-bit 8-bit
pared with Eyeriss) among these designs. Compared to OLAccel (2) (3) (4) (5) (6) (7)
with 16-bit for the first and last layer and 4-bit for the rest layers, Energy BitFusion (8-bit)
EBSP can achieve average 49% performance improvement thanks to Acc loss Trend (Energy)
the larger number of PEs under the same area budget (see Table 1) Figure 8: Analyzation of the bit sparsity pattern on INT8.
in our scheme. Note that, for benchmark tests like ImageNet, EBSP
with the bit sparsity pattern allows the design of LUT-based PEs
delivers a much larger speedup with only 0.42% accuracy loss than
to replace multiplication with minimally modified existing hard-
OLAccel achieved at the cost of about 5% accuracy loss.
ware platforms, thus delivering remarkable performance benefits.
0.4 0.6%
Our evaluation shows that the proposed EBSP scheme outperforms
Energy Efficiency

other similar schemes in performance, energy or accuracy.

Accuracy Loss
0.3 0.4%

0.2 0.2%
ACKNOWLEDGMENTS
0.1 0.0% This work was partially supported by the National Natural Science
0 -0.2% Foundation of China (NSFC) (Grant No. 61834006 and 62102257)
4-bit 4-bit 5-bit 5-bit 6-bit 6-bit 6-bit
and the National Key Research and Development Program of China
(2) (3) (2) (3) (2) (3) (4)
(2018YFB1403400). We thank Wu Wen Jun Honorary Doctoral Schol-
Energy Acc loss
arship, AI Institute, Shanghai Jiao Tong University. Li Jiang is the
Figure 7: Analyzation of the pattern length. m-bit (n) denotes the corresponding author ([email protected]).
value is quantized to 𝑚-bit with the bit sparsity pattern of length 𝑛.
REFERENCES
2) Design Space Exploration. Fig. 7 studies the impact of pattern
[1] Vahideh Akhlaghi et al. 2018. SnaPEA: Predictive Early Activation for Reducing
length on ImageNet. This plot reports the average accuracy loss and Computation in Deep Convolutional Neural Networks. In ISCA.
energy across the five networks. Note that EBSP has the capability [2] Rajeev Balasubramonian et al. 2017. CACTI 7: New Tools for Interconnect
Exploration in Innovative Off-Chip Memories. TACO (2017).
of tuning the length of bit sparsity pattern to sustain the same [3] Sung-En Chang et al. 2021. Mix and Match: A novel FPGA-centric deep neural
accuracy levels as Eyesiss, while gaining notable energy efficiency. network quantization framework. In HPCA. IEEE.
Specifically, EBSP incurs average 0.31% accuracy loss compared to [4] Yu-Hsin Chen et al. 2016. Eyeriss: An energy-efficient reconfigurable accelerator
for deep convolutional neural networks. JSSC (2016).
Eyesiss, BitFusion and WAX, while OLAccel has 4.3% accuracy loss [5] Synopsys Design Compiler. 2019. [Online]. Available: https://www.synopsys.
instead. In terms of energy efficiency, EBSP with different pattern com/support/training/rtlsynthesis/design-compiler-rtl-synthesis.html.
length reduces the energy consumption by an average of 87% com- [6] Jia Deng et al. 2009. Imagenet: A large-scale hierarchical image database. In
CVPR. Ieee, 248–255.
pared to Eyeriss. Moreover, longer pattern length leads to higher [7] Lei Deng et al. 2020. Model compression and hardware acceleration for neural
DNN accuracy, but it will increase the energy consumption. From networks: A comprehensive survey. Proc. IEEE 108, 4 (2020), 485–532.
[8] Steven K Esser et al. 2020. Learned step size quantization. In ICLR.
this plot, we find that quantized DNNs with bitwidth of 5-bit and [9] Michael Gautschi, Michael Schaffner, et al. 2016. 4.6 A 65nm CMOS 6.4-to-
pattern length of 3, EBSP achieves an optimal point (0.13% accuracy 29.2 pJ/FLOP@ 0.8 V shared logarithmic floating point unit for acceleration of
loss with 97.3% energy reduction over Eyeriss) on ImageNet. nonlinear function kernels in a tightly coupled processor cluster. In ISSCC. IEEE.
[10] Sumanth Gudaparthi et al. 2019. Wire-Aware Architecture and Dataflow for CNN
Fig. 8 shows the impact of bit sparsity pattern on INT8 for ResNet- Accelerators. In MICRO.
50. We sweep pattern length from 2 to 7. Generally, a shorter pattern [11] Song Han et al. 2016. Deep Compression: Compressing Deep Neural Network
length means fewer LUT entries and allows for less energy con- with Pruning, Trained Quantization and Huffman Coding. In ICLR.
[12] Song Han et al. 2016. EIE: Efficient inference engine on compressed deep neural
sumption, but it will degrade the network accuracy. In Fig. 7, we network. ACM SIGARCH Computer Architecture News (2016).
normalize the energy efficiency to Eyeriss. We find that INT8 with [13] Kaiming He et al. 2016. Deep residual learning for image recognition. In CVPR.
[14] Benoit Jacob et al. 2018. Quantization and training of neural networks for efficient
the pattern length of 7, the energy efficiency is inferior to that of integer-arithmetic-only inference. In CVPR.
the direct INT8 scheme (i.e., BitFusion), which is caused by the [15] Alex K. et al. 2009. Learning multiple layers of features from tiny images.
excessive number of LUT entries. However, as the pattern length [16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet Classi-
fication with Deep Convolutional Neural Networks. Commun. ACM (2017).
decreases, the energy efficiency increases. We see that the bit spar- [17] Fangxin Liu et al. 2021. Improving Neural Network Efficiency via Post-training
sity pattern of the length of 4 and 5 will be most beneficial for the Quantization with Adaptive Floating-Point. In ICCV.
energy-saving with almost no accuracy loss (0.53% loss for length [18] Sébastien Marcel and Yann Rodriguez. 2010. Torchvision the Machine-Vision
Package of Torch. In MM.
of 4 and 0.27% loss for length of 5). [19] Wei Niu et al. 2020. Patdnn: Achieving real-time DNN execution on mobile
devices with pattern-based weight pruning. In ASPLOS.
6 CONCLUSIONS [20] Eunhyeok Park et al. 2018. Energy-efficient neural network accelerator based on
outlier-aware low-precision computation. In ISCA.
Sparsity and quantization are appealing tools for resource-efficient [21] Akshay Krishna Ramanathan et al. 2020. Look-up table based energy efficient
processing in cache support for neural network acceleration. In MICRO. IEEE.
DNN design. However, it is challenging to incorporate sparsity into [22] Mark Sandler, Andrew Howard, Menglong Zhu, et al. 2018. Mobilenetv2: Inverted
quantization and convert it to practical benefits. We have introduced residuals and linear bottlenecks. In CVPR. 4510–4520.
a novel methodology to form bit sparsity patterns in quantization- [23] Sayeh Sharify et al. 2019. Laconic Deep Learning Inference Acceleration. In ISCA
(Phoenix, Arizona) (ISCA ’19). 304–317.
aware training and reap the full advantages of sparsity and quanti- [24] Hardik Sharma et al. 2018. Bit fusion: Bit-level dynamically composable architec-
zation while reserving better DNN accuracy. Proposed quantization ture for accelerating deep neural network. In ISCA.

264

You might also like