DAC'22_EBSP_Bit_Sparsity_DNN
DAC'22_EBSP_Bit_Sparsity_DNN
259
• We propose a soft quantization scheme that evolves inher- sparsity to shorten the computation time of the convolution opera-
ently regular bit-sparsity patterns in training and achieves tion, where the data format is still the 16-bit fixed-point number.
an accuracy that matches the high bit-width quantization. WAX [10] uses a deeply distributed memory hierarchy, leads to
• We design the Look-Up Table (LUT) based approach for ac- data being moved with small overhead, and quantizes operands
celerating the DNN inference, which effectively incorporates from 32-bit to 8-bit. BitFusion [24] proposes to execute quantized
bit sparsity pattern and quantization for performance gains neural network models. OLAccel [20] is the mixed-precision accel-
while reducing the energy consumption of multiplication. erator that utilizes quantization with 4-bit and 16-bit MACs. These
• We describe the minimum required modifications and execu- designs mainly cater to the general single-way network sparsifi-
tion flow on the hardware platform to support such scheme cation or quantization. Consequently, their architectures struggle
effectively. for peak performance because it is difficult to find a perfect fitted
compression method to incorporate sparsity in quantization.
2 BACKGROUND AND MOTIVATION Our work shares conceptual similarities with prior works on
quantization, however, it distinguishes itself by deriving the pro-
2.1 Network Quantization
posed bit sparsity pattern to minimize hardware implementation.
Quantization algorithms compress the network by reducing the Such a sparsity pattern is evolved through training to yield reg-
number of bits required for weight and activation. Representative ular bit distributions, rather than imposing an overall bit-width
quantization methods can be categorized into two classes: constraint by quantizing the model. As a result, EBSP can deliver
1) Uniform quantization, which is one of the most widely-used a competitive accuracy on par with high-precision quantization,
quantization scheme, include binary, ternary, and fixed-point [14]. while the forced sparsity constraint keeps the hardware implemen-
Binary and ternary quantization uses extremely low bit-width to tation overhead at a minimum level.
represent DNN models, that is, 1-bit (-1, +1) for binary quantization
3 ALGORITHM FOR QUANTIZATION WITH
and 2-bit (-1, 0, +1) for ternary quantization. Although binary and
ternary quantization can significantly reduce the operand preci- BIT SPARSITY PATTERN
sion and simplify the hardware implementation, it introduces a In this section, we propose a novel quantization scheme combined
non-negligible accuracy loss (e.g., > 5% accuracy drop for binary). with the hardware-friendly bit sparsity pattern, which enjoys the
In contrast, as represented by INT8 [14] and LSQ [8], fixed-point non-multiplication operations for the DNN inference while achiev-
quantization schemes use modest bit-width to achieve comparable ing negligible inference accuracy loss.
accuracy as the original model. Weights and activations are quan- 3.1 Coupling Quantization with Hardware
tized to the nearest integer up to a scaling factor that is shared The most success of the quantization with low bit-width (e.g.,
through all the weights or activations in the same layer: HAQ [7], DeepCompression [11], and power-of-2 quantization [3])
𝑤ˆ = 𝛼 · 𝑐𝑙𝑖𝑝 (𝑟𝑜𝑢𝑛𝑑 (𝑤/𝛼), −2𝑚−1 +1, 2𝑚−1 −1), (1) can be largely attributed to introducing ancillary overheads, such
as indexes. Hence, it is not trivial to quantize neural networks with
where 𝑐𝑙𝑖𝑝 (𝑥, max, min) clamps the value 𝑥 into range [max, min]
low bit-width for both inputs and weights without indexes aiding
and 𝑤ˆ is the quantized value of 𝑤 with 𝑚-bit fixed-point quantiza-
or accuracy loss during the inference. In this work, EBSP aims at
tion.
eliminating multiplication operations in the (quantized) DNN in-
2) Non-uniform quantization, which uses the distribution of weights
ference models (as the low-bit-width quantization scheme), and at
and activations. One way is to cluster weights into several groups [7,
the same time, is designed to address the non-negligible accuracy
11]. However, such method didn’t bring computation benefits, since
loss of quantization with low bit-width. The proposed hardware-
the cluster centers are still stored as floating numbers. Others are
friendly quantization scheme incorporating the bit sparsity pattern
power-of-2 based methods that quantize weights to power-of-2s
can be considered as a variant of the non-uniform quantization. It
up to a scaling factor [3, 17]. These methods utilize the fact that
is possible to merge the quantization and sparsification constraints
weights and activations have a denser distribution near zero. Thus,
together.
they can replaces the expensive multiplications by cheap shifting
Since multiplication operation is the most widely used compute
operations. Although power-of-2 quantization can simplify the
operation in the DNN workloads. Thus, we attempt to design a
hardware implementation by eliminating multiplication, it cannot
novel hardware-friendly quantization method that takes full ad-
improve the accuracy by increasing bit-width as uniform quanti-
vantage of LUTs to replace multipliers, considering LUTs can be
zation. This is because the interval between quantization levels
reconfigured to support different bit-width and the reduced operand
increases exponentially with bit-width, resulting in finer resolution
precision enables fewer LUT entries [9, 21]. This approach mini-
near the mean with increasing the bit-width while the tails (weights
mally increases the memory area by introducing only hardware
with large value) still remain coarse.
that assists in combining LUT entries to realize multiplications.
The LUT-based scheme inevitably face the problem that there is
2.2 DNN Accelerators for Network an excessive number of entries to cover all the possible combina-
Sparsification and Quantization tions of weights and activations. The number of entries in the LUT
General network sparsification and quantization are widely used plays a significant role in determining the system performance. For
for inference. To be specific, sparsification reduces the number of example, to compute a multiplication with INT8 quantization in
operands, and quantization reduces the bitwidth of the data flowing one cycle, 65,536 (28 × 28 combinations) entries are needed in the
through a neural network model. SnaPEA [1] exploits activation LUT. If the partial sum result is saved with single precision (16-bit),
260
Bit Sparsity Pattern
Soft Quantized Weight Quantized Weight Matrix the most significant bit, the position 𝑝 = 𝑒 to 𝑒 + 𝑠 − 1 (𝑝 from 5 to 1
Matrix (Trainable) Backward pass with bit sparsity pattern
in Fig. 1). Then, the quantized weight is generated by the bit-wise
multiplication of the original weight and the mask. In the forward
phase, the original weight matrices are quantized prior to passing
Bit Mask
through masking. The layer computations are carried out with the
1bit Bitwidth
Forward pass
allow '1' to appear in
bit sparsity pattern of the quantized weight matrices.
Bitwidth
p= 5 4 3 2 1 Pattern
the consecutive 3 bits. In the backpropagation phase, we aim at training the weights to-
Figure 1: Quantization with bit sparsity pattern and weight update wards the quantization sparsity pattern while maintaining network
in training. accuracy. Therefore, we add a normalization term
·(𝑤 − 𝑤𝑞 ) 2 · 2−𝑒
𝜆 ∑︁
128 MB of momery is needed to store them, which makes the design (4)
2
very impractical. In order to further improve the LUT-based scheme
to the loss to decay the weights toward the quantized one. Then,
efficiency, we carefully optimize the LUT overhead in conjunction
the gradient will be calculated as
with the quantization method.
Inspired by the fact that only the bits from the most significant 𝜕𝐿 ′ 𝜕𝐿
bit of a number takes parts in the computation, and the leading ‘0’ = + 𝜆(𝑤 − 𝑤𝑞 ) · 2−𝑒 (5)
𝜕𝑤 𝜕𝑤
bits are redundant, we introduce an innovative co-design paradigm For large weights whose exponent 𝑒 is large, the gradient is still
of compression and hardware to enable fine-grained bits exploited focused on the first term to make sure that the weight is updated in
at a low cost and enjoy the benefits of the quantized DNNs. This the direction to reduce the loss and increase accuracy. On the other
method quantizes weights and activations into adaptive floating- hand, for small weights whose exponent 𝑒 is small, the second term
point numbers. Specifically, each layer shares a shift factor 𝑘, and will account for a larger part in the gradient, guiding them towards
each value has its exponent part 𝑒 and mantissa part 𝑚. In such the quantized value and reducing the quantization
case, each value can be represented as error.
In addition, we formulate the cost C 𝑁𝑞 , 𝑛 exp in a bit manner:
𝑥 = 2𝑘 2𝑒 · 1 + 𝑚 · 2−𝑛man = 2𝑘 2𝑒 · 𝑀 the multiplication in EBSP need 2𝑛man +𝑛man LUT entries, which
𝑎 𝑤
(2)
where 𝑛 man is the bit-width of the mantissa 𝑚. Given a weight 𝑤 also determines the bits for adders (more LUT entries, the larger
and an activation 𝑎, the quantized multiplication of them can be the number of bits for adders). Therefore, we combine the cost
written as function Eq. (6) with accuracy as the overall objective of training a
DNN model based on ADMM-regularized optimization [19]:
𝑎 · 𝑤 = 2𝑘𝑎 2𝑒𝑎 · 𝑀𝑎 × 2𝑘 𝑤 2𝑒 𝑤 · 𝑀𝑤
(3) C = 2𝑛man +𝑛man
𝑎 𝑤
(6)
= 2𝑘𝑎 +𝑘 𝑤 (2𝑒𝑎 +𝑒 𝑤 · (𝑀𝑎 𝑀𝑤 ))
𝐿
From Eq. (3), we see that the only need to implement the mul- L = L (𝑊𝑞1:𝐿 ) + 𝛼
∑︁
C𝑙 (7)
tiplication of 𝑀𝑎 and 𝑀𝑤 . Then, the result can be derived after 𝑙=1
element-wise shifting by 𝑒𝑎 + 𝑒 𝑤 and the layer-wise shifting by where L (𝑊𝑞1:𝐿 ) is the original entropy loss evaluated with the quan-
𝑘𝑎 + 𝑘 𝑤 . Thus, we use LUTs to realize the computation of 𝑀𝑎 and tized weights 𝑊𝑞 , 𝐿 denotes the number of layers in the DNN model,
𝑀𝑤 , with 2𝑛man +𝑛man entries, much fewer than the computation of
𝑎 𝑤
𝛼 is a hyperparameter controlling the balance between the entropy
whole 𝑤 and 𝑎. Then, we introduce the evolving sparsity pattern and cost term. As a result, our evolving bit sparsity pattern for
in quantization to further decrease the bit-width of 𝑀𝑎 and 𝑀𝑤 , incorporating sparsity in quantization offers a viable path towards
which lead to yet further computational benefits. efficient hardware inference will be discussed in Section 4.1.
Buffer PE PE PE
patterns by enforcing the constraint on the bit distribution (called LOD LOD
Accumulation Unit
Central Controller
a w
bit sparsity constraint) by performing soft quantization in each Activation
PE PE PE nexp
+ n exp
DRAM
Buffer
training iteration. We divide the training process into three phases
sequentially according to the execution order: masking, forward SFU Ma Mw
MUX
MUX
PE PE PE LUT
passing, and backward passing, respectively. During the masking
Output
phase, the weight matrix is quantized to the target bit-width and Buffer PE PE PE shifter
Z
imposes bit sparsity constrain in the bits within the target bit-width. Accumulation Unit
As shown in Fig. 1, bit sparsity constraint is that a maximum of Figure 2: Block diagram of the PE implementation in EBSP.
𝑠 = 3 consecutive ‘1’s exists in the bits within the weights at a
given bit-width. Next, the forward pass is performed using the
quantized weight matrix with the bit sparsity constraint, which can 4 ARCHITECTURE FOR QUANTIZATION
be formulated as Eq. (2) and Eq. (3) WITH BIT SPARSITY PATTERN
In the masking phase, the position of the most significant bit is In this section, we discuss the efficient LUT-based PE implementa-
located and denoted as the exponent part 𝑒 of the quantized weight. tion, the organization of EBSP, and its execution flow support for
Then, the bit mask is generated that covers at most 𝑠 ‘1‘ bits after DNN workloads incorporated in the proposed EBSP architecture.
261
Initialization Computation Writeback
Cycle 0 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
Fetch input 36
Load the operands Op (3,6) Add the power: 9 shift by (3-1): Psum send to
parameter (3,4,12) and locate the 1+2=3 36 Adder Tree
matries (6,3,1) power: 1,2
Op (3,6) locate Match the LUT:
3 4 12 6 2 9 the signal: 1,1 1,1 = 9
6 4 2 3 3 4
Op (4,3) 36 +12 = 48
7 8 3 1 7 8 Add the power: 3 shift by (3-1): Psum send to
Activation Weight locate the 2+1=3 12 Adder Tree
Matrix Matrix power: 2,1
Op (4,3) Match the LUT:
locate the 0,1 = 3
signal: 0,1 48 +12 = 60
Op (12,1) Add the power: 3 shift by (3-1): Psum send to Write the
locate the 3+0=3 12 Adder Tree output Reg.
power: 3,0
Op (12,1) Match the LUT:
locate the 0,1 = 3
signal: 1,0
Accumul
Output
Input
SFU
ator
Buf.
Buf.
SFU
(Quantizer)
PE Array
4.1 Non-Multiplication Engine (NME) buffer, i.e., the weight buffer as shown in Fig. 3 (b). We use the
In this section, we will implement the non-multiplication engine EBSP quantized data throughout the execution flow, where the
that can efficiently quantize the given weight and use LUT instead quantized data consists of the exponent and mantissa, as expressed
of the multiplier. As is shown in Fig. 2, the processing unit (PE) in Eq. (3). Therefore, the activation of each layer will also be in the
is designed with two components: steering logic and arithmetic same format. To use these results as input activations and multiply
logic. In detail, the steering logic is composed of leading one de- them with the low precision weights, we need first to quantize them
tector (LOD) that dynamically locate the most significant ‘1’ bit using the quantizer and then store them in the input buffer.
and a multiplexer that extracts significant digits to send to the Step 2: Computation Optimization. As shown in the Fig. 3 (b),
LUT. In contrast, based on the bit sparsity pattern, arithmetic logic after quantization, the low precision inputs and weights are stored
is composed of a LUT with few entries, adder, and shifter (e.g., in the input buffer and the weight buffer. We then perform efficient
barrel shifter) that implements the multiplication of Eq. (3). Take MAC computations using the PE array. Fig. 3(a) shows a detailed
the 5-bit quantization as an example, 1-bit sign, 2-bit exponent, illustration of the EBSP-based computation steps with an example.
2-bit mantissa for weight and 2-bit exponent, 3-bit mantissa for The matrix multiplication operation with once superposition of ad-
activation (0-bit sign since ReLU activation function is used). First, jacent exponents. In the initialization phase, the parameter matrix
𝑒𝑎 + 𝑒 𝑤 , the 2-bit addition, is realized by adder; meanwhile, EBSP (Cycle 0) is loaded, and the operands involved in the calculation (Cy-
uses LUTs to realize 𝑀𝑎 𝑀𝑤 . Finally, the output of the LUT through cle 1) are fetched. Since the initialization is performed only once at
the shifter, which is shifted by 𝑒𝑎 + 𝑒 𝑤 bits yield by the adder. In the beginning, which cause the small overhead of fetching operands,
this way, we can achieve the circuit-implementation-friendly for and only the cycles of computation is proportional to the number
the multiplication. In addition, our method has a better adaptability of multiplications. In the next six cycles (computation phase), three
since LUT entries only fix the bit-width of mantissa and different shifts, three additions and two accumulations are performed to
bit-widths of exponents can use the same LUT. In this example, only produce an element of the output matrix. In Cycle 2, according to
22+3 = 32 LUT entries are needed to stored pre-calculated constant, steering logic, the most significant digit ‘1’ is located and combined
which is quite impressive for implementation. Our architecture’s with our quantization method to obtain the power and the signal
other benefit is not restricting designers from using their preferred whether the superposition is needed. In Cycle 3, the addition is
design as the smaller core multiplier. If a more efficient multiplier performed since the operands of multiplication are simplified to
can be designed in the future, the arithmetic logic can be replaced. power-of-2. In Cycle 4, a left-shift operation is performed on the
The much smaller arithmetic logic justifies the addition of steering result of matching the LUT according to the adder’s output. In
logic, which provides significant overall power and area savings Cycle 5, the psum is fed into the adder tree for accumulation. Then,
than the exact multiplier. the subsequent operands are repeated for Cycle 2-4. The output
will be written back in Cycle 8. This execution pipeline continues
until the matrix multiplication is completed.
4.2 Execution Flow with NME Step 3: Generate Activation. The results of the PE array are
With our NME, we achieve accelerating the quantized NN through accumulated with the partial sum and sent to the Special Function
minor hardware modifications to boost hardware utilization and Unit (SFU) to calculate the final output. The SFU performs activation
inference performance significantly, as explained in Fig. 3: functions (e.g., ReLU), pooling functions (e.g., average pooling and
Step 1: Data preparing (Quantization). At the beginning of max pooling), or even other operations.
the EBSP run, the quantized weights are loaded into an on-chip
262
100% 100%
Top-1 Accuracy
Bit Percentage
80% 90%
60% 80%
40% 70%
20% 60%
0% 50%
WAX
WAX
WAX
WAX
WAX
WAX
WAX
WAX
WAX
WAX
OLAccel
OLAccel
OLAccel
OLAccel
BitFusion
OLAccel
BitFusion
OLAccel
OLAccel
OLAccel
BitFusion
OLAccel
BitFusion
OLAccel
BitFusion
BitFusion
BitFusion
BitFusion
BitFusion
BitFusion
EBSP
EBSP
EBSP
EBSP
EBSP
EBSP
EBSP
EBSP
EBSP
EBSP
Eyeriss
Eyeriss
Eyeriss
Eyeriss
Eyeriss
Eyeriss
Eyeriss
Eyeriss
Eyeriss
Eyeriss
CIFAR-10 ImageNet CIFAR-10 ImageNet CIFAR-10 ImageNet CIFAR-10 ImageNet CIFAR-10 ImageNet
AlexNet ResNet18 ResNet50 VGGNet16 MobileNet-v2
4-bit 5-bit 6-bit 8-bit 16-bit Accuracy
Figure 4: Comparison of DNN accuracy and percentage of bit-width for different networks.
It is worth noticing that other quantization methods such as Eyeriss, our EBSP shows nearly no accuracy loss for CIFAR-10 com-
INT8 quantization [14] are also applicable to our EBSP scheme, pared to Eyesiss (full INT16), BitFusion (full INT8) and WAX (full
which will be discussed in Section 5.2. fixed-point 8-bit), and a 2.2% accuracy improvement over the OLAc-
cel. For ImageNet, our EBSP shows a 0.31% accuracy loss compared
5 EXPERIMENTS to Eyesiss, WAX and BitFusion and a significant 4.32% accuracy
improvement over OLAccel. Fig. 4 also shows the percentage of
5.1 Experimental Methodology
bitwidth used in the computation for these designs. Both EBSP and
1) Validation on NN Accuracy. We evaluate the quantization OLAccel take full advantage of low-bit quantization, where the
framework on CIFAR-10 [15] and ImageNet [6] datasets on widely computation in our EBSP is done with LUT and is realized by the
applied CNN networks for image classification, including AlexNet [16] bit sparsity pattern. However, OLAccel performs quantization with
VGG-16 [7], ResNet-18, ResNet-50 [13], and MobileNet-v2 [22]. The fixed across all the layers and datasets. In contrast, EBSP is evolved
EBSP algorithm is implemented by PyTorch. We use pre-trained to generate the bit sparsity pattern in the quantization and takes
network models from Pytorch Model Zoo as the basis for EBSP into account the cost of LUTs and network accuracy in training, so
algorithm. The network structure is derived from the model in the that the number of bits used for weight and activation values can
torchvision [18]. We reconstruct the convolutional layers, which vary from layer to layer. As demonstrated in the up-to-date compact
insert the quantization process for the weights and activations be- MobileNet-v2 which has a much smaller parameter size, our EBSP
fore the built-in convolution function call. The experiment runs on is more robust for preserving DNN accuracy (0.78% loss) with 7% of
CUDA 10.2 environment with Tesla V100 GPU. 4-bit, 51% of 5-bit and 42% of 6-bit percentage than OLAccel (5.4%
Table 1: Configurations of different accelerators under 45-nm loss) with 85% of 4-bit percentage.
standard-cell library. 1
CORE Global Buffer DRAM
Eyeriss [4] BitFusion [24] WAX [10] OLAccel [20] EBSP 0.8
0.6
Bit-width 16-bit 4-bit 8-bit 4&16-bit 6-bit (3)† 0.4
Data Format Integer Integer Fixed-point Integer Integer 0.2
0
# PEs 224 3168 102 2499 4818
BitFusion
EBSP
BitFusion
EBSP
BitFusion
EBSP
BitFusion
EBSP
BitFusion
EBSP
OLAccel
OLAccel
OLAccel
OLAccel
OLAccel
Eyeriss
WAX
Eyeriss
WAX
Eyeriss
WAX
Eyeriss
WAX
Eyeriss
WAX
Area (mm2) 0.32 0.32 0.32 0.32 0.32
† this denotes the length of bit sparsity pattern, which determines LUT entries. ResNet-18 ResNet-50 AlexNet VGG-16 MobileNet-V2
263
2.8%
Energy Efficiency
0.8
Accuracy Loss
nearly 93% performance improvement because EBSP mostly uses 0.6 2.1%
simplified PE to realize the MACs for DNNs. However, only EBSP 0.4 1.4%
considers fine-grained bit sparsity in the quantization, EBSP achieves 0.2 0.7%
the highest throughput ( 93.87% performance improvement com- 0 0.0%
8-bit 8-bit 8-bit 8-bit 8-bit 8-bit
pared with Eyeriss) among these designs. Compared to OLAccel (2) (3) (4) (5) (6) (7)
with 16-bit for the first and last layer and 4-bit for the rest layers, Energy BitFusion (8-bit)
EBSP can achieve average 49% performance improvement thanks to Acc loss Trend (Energy)
the larger number of PEs under the same area budget (see Table 1) Figure 8: Analyzation of the bit sparsity pattern on INT8.
in our scheme. Note that, for benchmark tests like ImageNet, EBSP
with the bit sparsity pattern allows the design of LUT-based PEs
delivers a much larger speedup with only 0.42% accuracy loss than
to replace multiplication with minimally modified existing hard-
OLAccel achieved at the cost of about 5% accuracy loss.
ware platforms, thus delivering remarkable performance benefits.
0.4 0.6%
Our evaluation shows that the proposed EBSP scheme outperforms
Energy Efficiency
Accuracy Loss
0.3 0.4%
0.2 0.2%
ACKNOWLEDGMENTS
0.1 0.0% This work was partially supported by the National Natural Science
0 -0.2% Foundation of China (NSFC) (Grant No. 61834006 and 62102257)
4-bit 4-bit 5-bit 5-bit 6-bit 6-bit 6-bit
and the National Key Research and Development Program of China
(2) (3) (2) (3) (2) (3) (4)
(2018YFB1403400). We thank Wu Wen Jun Honorary Doctoral Schol-
Energy Acc loss
arship, AI Institute, Shanghai Jiao Tong University. Li Jiang is the
Figure 7: Analyzation of the pattern length. m-bit (n) denotes the corresponding author ([email protected]).
value is quantized to 𝑚-bit with the bit sparsity pattern of length 𝑛.
REFERENCES
2) Design Space Exploration. Fig. 7 studies the impact of pattern
[1] Vahideh Akhlaghi et al. 2018. SnaPEA: Predictive Early Activation for Reducing
length on ImageNet. This plot reports the average accuracy loss and Computation in Deep Convolutional Neural Networks. In ISCA.
energy across the five networks. Note that EBSP has the capability [2] Rajeev Balasubramonian et al. 2017. CACTI 7: New Tools for Interconnect
Exploration in Innovative Off-Chip Memories. TACO (2017).
of tuning the length of bit sparsity pattern to sustain the same [3] Sung-En Chang et al. 2021. Mix and Match: A novel FPGA-centric deep neural
accuracy levels as Eyesiss, while gaining notable energy efficiency. network quantization framework. In HPCA. IEEE.
Specifically, EBSP incurs average 0.31% accuracy loss compared to [4] Yu-Hsin Chen et al. 2016. Eyeriss: An energy-efficient reconfigurable accelerator
for deep convolutional neural networks. JSSC (2016).
Eyesiss, BitFusion and WAX, while OLAccel has 4.3% accuracy loss [5] Synopsys Design Compiler. 2019. [Online]. Available: https://www.synopsys.
instead. In terms of energy efficiency, EBSP with different pattern com/support/training/rtlsynthesis/design-compiler-rtl-synthesis.html.
length reduces the energy consumption by an average of 87% com- [6] Jia Deng et al. 2009. Imagenet: A large-scale hierarchical image database. In
CVPR. Ieee, 248–255.
pared to Eyeriss. Moreover, longer pattern length leads to higher [7] Lei Deng et al. 2020. Model compression and hardware acceleration for neural
DNN accuracy, but it will increase the energy consumption. From networks: A comprehensive survey. Proc. IEEE 108, 4 (2020), 485–532.
[8] Steven K Esser et al. 2020. Learned step size quantization. In ICLR.
this plot, we find that quantized DNNs with bitwidth of 5-bit and [9] Michael Gautschi, Michael Schaffner, et al. 2016. 4.6 A 65nm CMOS 6.4-to-
pattern length of 3, EBSP achieves an optimal point (0.13% accuracy 29.2 pJ/FLOP@ 0.8 V shared logarithmic floating point unit for acceleration of
loss with 97.3% energy reduction over Eyeriss) on ImageNet. nonlinear function kernels in a tightly coupled processor cluster. In ISSCC. IEEE.
[10] Sumanth Gudaparthi et al. 2019. Wire-Aware Architecture and Dataflow for CNN
Fig. 8 shows the impact of bit sparsity pattern on INT8 for ResNet- Accelerators. In MICRO.
50. We sweep pattern length from 2 to 7. Generally, a shorter pattern [11] Song Han et al. 2016. Deep Compression: Compressing Deep Neural Network
length means fewer LUT entries and allows for less energy con- with Pruning, Trained Quantization and Huffman Coding. In ICLR.
[12] Song Han et al. 2016. EIE: Efficient inference engine on compressed deep neural
sumption, but it will degrade the network accuracy. In Fig. 7, we network. ACM SIGARCH Computer Architecture News (2016).
normalize the energy efficiency to Eyeriss. We find that INT8 with [13] Kaiming He et al. 2016. Deep residual learning for image recognition. In CVPR.
[14] Benoit Jacob et al. 2018. Quantization and training of neural networks for efficient
the pattern length of 7, the energy efficiency is inferior to that of integer-arithmetic-only inference. In CVPR.
the direct INT8 scheme (i.e., BitFusion), which is caused by the [15] Alex K. et al. 2009. Learning multiple layers of features from tiny images.
excessive number of LUT entries. However, as the pattern length [16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet Classi-
fication with Deep Convolutional Neural Networks. Commun. ACM (2017).
decreases, the energy efficiency increases. We see that the bit spar- [17] Fangxin Liu et al. 2021. Improving Neural Network Efficiency via Post-training
sity pattern of the length of 4 and 5 will be most beneficial for the Quantization with Adaptive Floating-Point. In ICCV.
energy-saving with almost no accuracy loss (0.53% loss for length [18] Sébastien Marcel and Yann Rodriguez. 2010. Torchvision the Machine-Vision
Package of Torch. In MM.
of 4 and 0.27% loss for length of 5). [19] Wei Niu et al. 2020. Patdnn: Achieving real-time DNN execution on mobile
devices with pattern-based weight pruning. In ASPLOS.
6 CONCLUSIONS [20] Eunhyeok Park et al. 2018. Energy-efficient neural network accelerator based on
outlier-aware low-precision computation. In ISCA.
Sparsity and quantization are appealing tools for resource-efficient [21] Akshay Krishna Ramanathan et al. 2020. Look-up table based energy efficient
processing in cache support for neural network acceleration. In MICRO. IEEE.
DNN design. However, it is challenging to incorporate sparsity into [22] Mark Sandler, Andrew Howard, Menglong Zhu, et al. 2018. Mobilenetv2: Inverted
quantization and convert it to practical benefits. We have introduced residuals and linear bottlenecks. In CVPR. 4510–4520.
a novel methodology to form bit sparsity patterns in quantization- [23] Sayeh Sharify et al. 2019. Laconic Deep Learning Inference Acceleration. In ISCA
(Phoenix, Arizona) (ISCA ’19). 304–317.
aware training and reap the full advantages of sparsity and quanti- [24] Hardik Sharma et al. 2018. Bit fusion: Bit-level dynamically composable architec-
zation while reserving better DNN accuracy. Proposed quantization ture for accelerating deep neural network. In ISCA.
264