Encodingnet: A Novel Encoding-Based Mac Design For Efficient Neural Network Acceleration
Encodingnet: A Novel Encoding-Based Mac Design For Efficient Neural Network Acceleration
Activation Weight
Trad. Enc. New Enc. Value
In1 In2 𝑏3𝑏2𝑏1𝑏0 𝑏4𝑏3𝑏2𝑏1𝑏0 𝑣
Activations
Partial
Sums
to match 10 01 1110 10111 -2
partial sum MAC unit 11 10 0010 01011 2
MAC MAC MAC 11 11 0001 00001 1
11 00 0000 11111 0
11 01 1111 10101 -1
00 10 0000 11111 0 (b)
(a) (b) 00 11 0000 11111 0
Figure 1: (a) Structure of systolic array according to [5]. (b) Structure 00 00 0000 11111 0
00 01 0000 11111 0
of an MAC unit. 01 10 1110 11011 -2
01 11 1111 11001 -1
the original multiplier outputs. Therefore, the logic complexity 01 00 0000 11111 0
01 01 0001 11101 1
of this mapping becomes much low due to this projection of Í
𝑣 = 𝑖=0𝑀 −1 𝑠 × 𝑏 , where 𝑠 is position weight.
𝑖 𝑖 𝑖
outputs onto wide bits. Trad.: 𝑀 = 4, 𝑠 3 = −8, 𝑠 2 = 4, 𝑠 1 = 2, 𝑠 0 = 1 (c)
New: 𝑀 = 5, 𝑠 4 = −4, 𝑠 3 = 2, 𝑠 2 = 2, 𝑠 1 = −1, 𝑠 0 = 1
• The wide bits at the outputs of the encoding-based multipliers (a)
carry individual position weights, which are trained for specific
Figure 2: (a) Truth tables of multipliers with the traditional encoding
neural networks to enhance inference accuracy. The wide bits and a new encoding. (b) The traditional 2-bit signed multiplier. (c)
and the corresponding position weights are used to calculate The multiplier with a new encoding.
the outputs of neurons by bit-wise weighted accumulation in a
MAC array. These outputs at neurons are in the original formats to realize all the rows in the truth table exactly, the circuit thus be-
specified by the neural networks with either uniform or non- comes complicated quickly. For example, an 8-bit signed multiplier
uniform quantization, so that the proposed design is compatible can contain 417 combinational logic gates. Though approximate
with existing computing systems. computing [16] can be applied to reduce the logic complexity of
• Since the critical paths in the encoding-based MAC design be- multipliers, this technique still uses the two’s complement format
come much shorter, pipelining stages in the MAC array with to represent the multiplication results and does not take advantage
these simplified circuits can be reduced significantly, which can of the full potential of MAC units.
be taken advantage of to reduce the area and power consumption The circuit of a multiplier maps the input combinations to the
of the MAC array. output combinations. In the traditional design, the bit sequences
representing the output combinations of a multiplier are predefined
The rest of the paper is structured as follows. Section 2 explains in the two’s complement format according to the multiplication
the motivation of this work. Section 3 elaborates the details of the function, as shown in the Trad. Enc. column in Fig. 2(a). However,
proposed encoding-based MAC design. Experimental results are if these bit sequences can be adjusted, the new truth table can lead
presented in Section 4 and conclusions are drawn in Section 5. to a multiplier circuit with a lower logic complexity. For example,
2 Motivation the New Enc. column in Fig. 2(a) shows another assignment of bit
sequences to represent the same output values of the multiplier,
In DNNs, there are massive amounts of MAC operations. Existing
where the bit width has been increased from 4 to 5. Since the number
digital hardware platforms use many parallel MAC units, e.g., 65,536
of bits at the output of the multiplier has been increased, different
in the systolic array of TPU v1 [5], to accelerate DNNs. The structure
bit sequences can represent the same integer value. For example,
of this systolic array is sketched in Fig. 1(a), while the internal
both 00111 and 01011 in the New Enc. column in Fig. 2(a) represent
structure of a MAC unit is shown in Fig. 1(b). In the systolic array,
the same decimal value 2. From this new bit sequence assignment,
weights are preloaded and activations are streamed as inputs. The
a much simpler circuit can be generated, as illustrated in Fig. 2(c).
partial sum of a multiplication is propagated along a column to
The bit sequence assignment in Fig. 2(a) is called an encoding.
calculate the multiplication result of an input vector and a weight
The original encoding of the multiplier shown in the column Trad.
vector. Between rows and columns there are flip-flops. Therefore,
Enc. is only one of the possible encodings representing the values
the activations are shifted to match the propagation of the partial
at the output of the multiplier. Since various encodings lead to
sum at the MAC units.
different truth tables for the multiplier, they also result in different
In a MAC unit above, the inputs of the multiplier are represented
circuit complexity after logic synthesis. Therefore, exploring the
in two’s-complement to express integer values. The circuit of the
encoding can be an effective technique to obtain more efficient
multiplier is defined by the truth table which enumerates all the
circuit implementation for the multiplier.
input combinations. For example, Fig. 2(a) shows the truth table of
a multiplier with 2-bit signed inputs In1 and In2. The column Trad. 3 Encoding-based MAC Design
Enc. shows the output bit sequences in the two’s complement format To identify a new encoding to simplify the MAC circuits, two
corresponding to decimal numbers in the last column of Fig. 2(a). challenges should be addressed. First, the number of encodings is
From this truth table, the logic circuit for this multiplier can be huge, up to 2𝑀+16 for an 8-bit multiplier with 𝑀-bit output. For
synthesized as shown in Fig. 2(b). As the bit width of input operands each encoding, a circuit should be generated, which leads to a very
increase, the number of rows in the truth table of a multiplier in- long search time. Second, the identified encoding should not make
creases exponentially. Since the synthesized circuit must be able the accumulation implementation of partial sums generated by
2
In1 Out In1 Out Enc.-based MUL
Bit-wise ACC
Trad. MUL
MUL
Trad. ADD
In2 In2
Position
weights
MUL
(a) (b)
Figure 3: Two samples of logic mapping from input bits to output ACC
Decoder
bits of a multiplier. The position weights evaluated in each sample
are shown at the outputs. The resulting RMSE of each sample is
illustrated at the bottom. (a) A circuit sample with a large RMSE. (b) Figure 4: A column in a MAC array consists of encoding-based mul-
A circuit sample with a small RMSE. tipliers and the circuit implementing the addition function, which
consists of bit-wise accumulator and a decoder.
redesigned multipliers complicated. For example, the new encoding
shown in the New Enc. column in Fig. 2(a) also defines the input bit In a sampled circuit, assume that the bit sequence with 𝑀 bits
combinations of an adder in a MAC operation. However, it is not of the 𝑘th row in the truth table are expressed as b𝑘 = 𝑏𝑘𝑀 −1 . . . 𝑏𝑘0 ,
an easy task to synthesize an efficient circuit for an adder with an e.g., b 0 = 01111 in the New Enc. column in Fig. 2(a), the value this
arbitrary input encoding. bit sequence represents can be calculated as 𝑀
Í −1 𝑘
𝑗=0 𝑏 𝑗 × 𝑠 𝑗 , where
To allow a simple implementation of accumulation, we impose 𝑠 0, . . . 𝑠𝑀 −1 are the position weights whose exact values will be
an additional constraint that the bits in a bit sequence have position determined later. The difference between the value this bit sequence
weights. For example, we assign position weights 𝑠 0 , 𝑠 1 , 𝑠 2 , 𝑠 3 , 𝑠 4 to Í −1 𝑘
approximates and the original value 𝑣 𝑘 is then | 𝑀 𝑘
𝑗=0 𝑏 𝑗 × 𝑠 𝑗 − 𝑣 |.
the bit sequences in the New Enc. column of Fig. 2(a). Accordingly, a
Í4 When all the rows in the truth table of a sampled circuit are
bit sequence 𝑏 4𝑏 3𝑏 2𝑏 1𝑏 0 represents the number 𝑖=0 𝑠𝑖 × 𝑏𝑖 . These
considered together, we can then determine the position weights
position weights are adjustable for different neural networks to
by minimizing the root mean square error (RMSE) that the bit
maintain inference accuracy. Compared with the traditional two’s
sequences approximate the original values of the multiplication
complement number system where the position weights are fixed
results, as
only to power of two values, the adjustable position weights pro-
vide more flexibility for the implementation of the multipliers and s = arg min ∥Bs − v∥ 2 (1)
adders. s
3.1 Encoding-based Multiplier Design where B is the bit sequences derived from a sampled circuit cor-
To determine the encoding and the logic design of a multiplier responding to all the rows in the truth table. s is the vector of all
with a given bit width, e.g., 8 bits, we only use single-level logic the position weights. v is the vector of all the original values of the
as illustrated in Fig. 2(c). This can decrease the critical path of the multiplier, e.g., the Value column in Fig. 2(a).
circuit effectively while reducing the area. Under this assumption, After the position weights for a sampled circuit is determined
an output bit of the multiplier is driven by a single logic gate, which as described above, we can also obtain the RMSE for each sampled
takes operand bits to the multiplier as its input. In our design, we circuit. We execute the sampling process for up to 104 times and
consider the single-level logic gates SET, IN, NAND2, NAND3, AND2, track the trend of the RMSE with the increasing number of samples.
OR2, NOT, XOR3, where the SET gate always outputs a high signal When the RMSE becomes stable, we stop the sample process and
‘1’ to allow a constant bias in the result to approximate the original the circuit with the minimum RMSE will be returned as the circuit
multiplication function. The IN gate connects the input signals to design for the multiplier.
output signals without any logic gate on the connections. The sampling process described above is based on the assump-
Even with the assumption of single-level logic, the search space tion that a bit width 𝑀 at the output of the new multiplier is given.
to generate the logic for the multiplier can still be large, because for To determine the minimum bit width at the output of the new mul-
every output bit the gate type to drive it should be selected and the tiplier, a binary search algorithm is used. Initially, the minimum
inputs of such a gate should be selected from the input bits of the and the maximum bit width are set to 16 and 128 for an 8-bit multi-
multiplier. To address this issue, we randomly sample the gate types plier, respectively. Afterwards, the middle bit width 72 is used to
and the connections from the input bits to create circuit samples. execute the sampling process above. The sampled circuit candidate
In each sample, we can obtain a candidate of the circuit for the producing the best approximation is returned and the correspond-
multiplier. Fig. 3 illustrates two circuit samples, where the bit width ing RMSE can be evaluated. The RMSE is compared with a target
of the multiplication results is set to 48 bits. With such a sample, RMSE, which is determined by exhaustively evaluating various
we can generate the output bit sequence for every bit combination RMSEs with respect to the inference accuracy of neural networks
of the operands of the multiplier. In other words, we can create a and the one that can maintain the inference accuracy is selected. If
truth table similar to Fig. 2(a) from such a sampled circuit. Each the RMSE of the returned circuit candidate is larger than the target
row in this truth table corresponds to an exact value determined by RMSE, the minimum bit width will be updated to the middle bit
output bit sequence, as illustrated in the Value column in Fig. 2(a). width in the next iteration and vice versa. The search algorithm
3
terminates until the distance between the minimum and maximum Table 1: Power and area of proposed vs. traditional MAC arrays
bit width is equal to 0 or 1. Bit-Wid. of
Size of Power (W) Area (𝑚𝑚 2 )
3.2 Adder Design with Encoding Product
Syst. Arr.
Since the bit sequences at the output of the multiplier do not fol- Trad. Prop. Trad. Prop. Red. Trad. Prop. Red.
low the two’s complement number system, we also need to define a
32×32 16 48 0.181 0.163 9.94% 0.239 0.172 28.03%
new structure to implement the addition function. For the general 48×48 16 48 0.380 0.259 31.84% 0.513 0.268 47.76%
case of accumulating the 𝑀-bit outputs of 𝑁 multipliers, the sum 64×64 16 48 0.652 0.404 38.07% 0.891 0.416 53.36%
Í𝑁 Í𝑀 −1 𝑖 Í𝑀 −1 Í𝑁 𝑖
can thus be expressed as 𝑖=1 𝑗=0 𝑠 𝑗 × 𝑏 𝑗 = 𝑗=0 𝑠 𝑗 × 𝑖=1 𝑏 𝑗 , 128×128 16 48 2.464 1.050 57.38% 3.433 1.043 69.61%
where 𝑏𝑖𝑗 is the 𝑗th bit of the output of the 𝑖th multiplier. Accord- 256×256 16 48 9.572 2.854 70.18% 13.473 2.744 79.63%
ingly, the circuit to implement this sum can be designed as illus-
trated in Fig. 4. In this implementation, the corresponding bits of
and (3𝑁 − 2)𝑇 , respectively. To evaluate the throughput, we assume
the multipliers are accumulated first and the position weights are
that 𝑚 input matrices with sizes of 𝑁 × 𝑁 need to be processed
multiplied with such accumulation results at the bottom of each col-
by the MAC arrays. The throughputs of the encoding-based ar-
umn in a MAC array only once. The results of these multiplication
ray and the traditional MAC array are [ (2𝑁 −1)+𝑁𝑚(𝑚−1) ]×𝑇 and
operations are added by an adder tree to generate the data in the
𝑚
two’s complement format for further functions, e.g., activation and [ (3𝑁 −2)+𝑁 (𝑚−1) ]×𝑇 , respectively. The proposed design exhibits a
batch normalization. We call such multipliers for multiplying posi- higher performance, and the throughputs of these designs becomes
tion weights and adders for generating two’s complement numbers nearly the same as 𝑚 becomes large.
as a decoder. The proposed new encoding technique is not limited to simplify-
3.3 Design and Application of MAC Array ing the traditional multiplier in the MAC array. For example, it can
process the truth table of multiplication with non-uniform quantiza-
We deploy the encoding-based multipliers and adders to con-
tion directly without requiring the conversion of the non-uniform
struct a MAC array with a size 𝑁 × 𝑁 to execute MAC operations
encoding into the two’s complement as required in the traditional
in DNNs efficiently, one column of which is illustrated in Fig. 4.
design. In such a case, the final hardware design becomes specific
At the inputs of each encoding-based multiplier, flip-flops are in-
for neural networks with the corresponding non-uniform quanti-
serted to allow the reuse of inputs, similar to that in the traditional
zation, but the hardware cost can be reduced even further in such
systolic array. Different from the traditional systolic array where
application-specific computing scenarios. Since the inputs of the
each MAC unit has an individual adder, the adders only appear at
multipliers and the final output of the MAC operations are in the
the bottom of each column, which performs the bit-wise weighted
original formats defined by neural networks, this new design is also
accumulation in a column. Another difference is that there are no
compatible with the existing training and inference frameworks of
flip-flops for storing the multiplication result of the multipliers due
neural networks.
to the shorter critical paths inside the encoding-based multipliers.
To execute MAC operations with the encoding-based MAC array, 4 Experimental Results
weights in a neural network are first loaded into the flip-flops in To verify the proposed encoding-based MAC array, we synthe-
each multiplier. Activations are streamed as inputs. Since there are sized the encoded-based MAC circuit with NanGate 15 nm cell
no flip-flops for storing the intermediate multiplication results in libraries [27]. These MAC circuits approximate the results of the
each multiplier, activations are not required to be shifted as in the uniformly quantized 8-bit MAC units. Such MAC circuits were then
traditional systolic array. Activations belonging to the inputs of a used to construct a MAC array similar to Fig. 4. In synthesizing
neuron can enter each column simultaneously and the results of the circuits, the clock frequency was set to 1 GHz. Power and area
multiplication and bit-wise accumulation are obtained after each analysis of this hardware were conducted with Design Compiler
clock cycle. from Synopsys. Such a MAC array can efficiently execute neural
To enhance the inference accuracy of neural networks executed networks while maintaining a high inference accuracy. To verify
on the encoding-based MAC array, we further fine-tune the ad- this, we tested the accuracy of three neural networks together with
justable position weights for specific neural networks. These posi- the corresponding datasets, ResNet18-Cifar10, ResNet20-Cifar100
tion weights are initially set to the values determined by minimizing and ResNet50-ImageNet, using the new MAC design. These neu-
the RMSE that the bit sequences approximate the original multipli- ral networks were initially trained using Pytorch, and pretrained
cation results. In fine-tuning, straight-through estimator (STE) [25] weights were loaded from Torchvision [28] and public repositories
was used for propagating gradients of encoding-based multipliers. on Github [29, 30]. The learning rate of the position weights in the
The encoding-based MAC array have slightly better throughput novel encoding for fine-tuning ResNet18, ResNet20, and ResNet50
and latency than those of the traditional systolic array under the was set to 1e-3, 1e-3, and 1e-8, respectively.
same size while achieving a much lower area cost and power con- Table 1 shows the comparison between the proposed MAC ar-
sumption. Assume that a weight matrix 𝑊 with a size of 𝑁 × 𝑁 has ray and the traditional systolic array in power consumption and
been loaded into the encoding-based array and the traditional MAC area cost. Different sizes were used to verify the advantages of the
array. The clock period is denoted as 𝑇 . To finish a computation hardware platform, as shown in the first column. The second and
between an input matrix 𝐼 0 with a size of 𝑁 × 𝑁 , the latency of the the third columns are the bit width of product, i.e., multiplication
encoding-based array and the traditional MAC array are (2𝑁 − 1)𝑇 result, in the multipliers in the traditional systolic array and the
bit width of the encoding-based multipliers in the proposed MAC
4
Table 2: Inference accuracy of neural networks executed on the proposed MAC array
100 10 12,5 10
Area (mm2)
Power (W)
60 6 7,5 0,1
40 10
4 5 Bit-Width = 19
0,01 Bit-Width = 27
20 RMSE target Bit-Width = 37
2 2,5 1
10−3 RMSE change Bit-Width = 58
0
0 0
32 34 36 38 40 42 44 46 48 50 52 Trad. 32 34 36 38 40 42 44 46 48 50 52 Trad. 72 44 58 51 48 46 47 48 10 100 1000 104 105
Bit-Width of Product (bit) Bit-Width of Product (bit) Bit-Width of Product (bit) The Number of Random Samples
ResNet18 (Cifar-10) Dynamic Power (a) (b)
ResNet20 (Cifar-100) Leakage Power
ResNet50 (ImageNet) Area Figure 6: The relationship between bit width, number of samples
(a) (b) and RMSEs in the binary search algorithm. (a) Bit width vs. RMSEs.
(b) Number of samples vs. RMSEs.
Figure 5: The relationship between the bit width of product in the
encoding-based multipliers and the inference accuracy of neural
level is shown in the third, sixth and ninth columns. According to
networks executed on the proposed MAC array and the power con-
sumption as well as area. (a) Bit width vs. accuracy. (b) Bit width vs.
these columns, it is clear that 8-bit uniform quantization can nearly
power and area. maintain the inference accuracy with floating-point data. The first
setting of 4-bit non-uniform quantization for all layers leads to a
design, respectively. The latter is determined by the search algo- relatively large accuracy degradation due to limited data represen-
rithm in Section 3.1. Although the bit width of the encoding-based tation while the second setting including 8-bit quantization in the
multipliers is larger than that of the traditional one, the power first and the last layers can achieve a better accuracy. The inference
consumption and the area cost of the MAC array exhibit significant accuracy of neural networks executed on the proposed MAC design
advantage, because the logic to generate these intermediate bits is is shown in the fourth, seventh and tenth columns in Table 2. Due
much simpler compared with the traditional multiplication. This to the approximation in the proposed encoding technique, there
advantage can be clearly seen from the last six columns in Table 1 , is a slight accuracy loss compared with that at software level, as
where the column Trad. and the column Prop. show the results of shown in the fifth, eighth and eleventh columns.
power consumption and area of the traditional MAC design and the To determine the minimum bit width of product in the encoding-
proposed design, respectively. The columns Red. show the ratios based multipliers, a binary search algorithm is applied. The results
of reduction. Besides, with the decreasing size of MAC array, the of this search are illustrated in Fig. 5. Fig. 5(a) shows the relation-
reduction of power consumption and area becomes smaller. This ship between the bit widths and the inference accuracy of neural
phenomenon results from the fact that the bit-wise accumulators networks. Fig. 5(b) is the relationship between the bit widths and
and decoders at the bottom of columns incur additional area cost power consumption as well as area cost. According to Fig. 5(a),
and thus power consumption. When the MAC array has a small inference accuracy becomes higher with the increasing bit width of
size, the incurred area and power cost contributes much to the total product and keeps stable around 48 bits. Accordingly, 48 bits were
cost. used for the output bit width of the product. In Fig. 5(b), power
The proposed new MAC design can execute neural networks consumption and area cost increase very slowly when the bit width
with high inference accuracy while consuming less power. Table of product goes larger and there is no linear relationship between
2 demonstrates the inference accuracy of neural networks with the bit width and the power consumption as well as area due to
three different quantization settings, namely, 8-bit uniform quanti- the randomness in the search algorithm and logic simplification in
zation, 4-bit non-uniform quantization for all layers [26], and 4-bit synthesis.
non-uniform quantization with the first and last layers using 8-bit In the binary search algorithm to determine the minimum bit
uniform quantization [26]. For the 8-bit uniform quantization, the width of the product, a target RMSE was used as a guidance, which
corresponding weights and input activations can be directly pro- was set to the one that can maintain the inference accuracy of
cessed with the MAC array. For the two settings of non-uniform neural networks. The results of the search process are illustrated
quantization, the non-uniform levels are first converted to the most in Fig. 6(a). According to this figure, after several search iterations,
close levels in 8-bit uniform quantization and then such 8-bit uni- the RMSE becomes stable and the minimum bit width that can
form quantization levels are used for MAC operations. achieve a RMSE smaller than a target RMSE, which is 48 bits, was
The second column of Table 2 is the inference accuracy evaluated selected. For a specified bit width, a given number of encoding
with 32-bit floating-point weights and input activations at software samples, 104 , were used to generate the encoding-based multipliers
level. The inference accuracy of neural networks with 8-bit uni- and then determine their RMSEs. To verify this, different numbers
form quantization and 4-bit non-uniform quantizations at software of samples were used to evaluate the RMSEs, as shown in Fig. 6(b).
5
3 3
90 2.5 ResNet18 (Cifar-10) 2.5
critical path and area significantly. The position weights allow a bit-
Accuracy (%)
Area (mm2)
Power (W)
80 2 2
1.5 1.5 pipelining stages between rows of multipliers can be reduced to
70
1 1
60 lower area cost further. With this new design, area and power
0.5 0.5
50 0 0 consumption of MAC array can be reduced by up to 79.63% and
25 27 29 31 33 Gen.-Pur. 25 27 29 31 33 Gen.-Pur.
Bit-Width of Product (bit) Bit-Width of Product (bit)
70.18%, respectively, compared with traditional design while the
(a) (b) inference accuracy is still maintained. Future work will study the
ResNet18 (Cifar-10) Dynamic Power
tradeoff between the bit width at the output of the multipliers and
ResNet20 (Cifar-100) Leakage Power the complexity of the single-level as well as the multi-level logic
ResNet50 (ImageNet) Area
3 3 3 3
implementation in the simplified multipliers.
2.5 ResNet20 (Cifar-100) 2.5 2.5 ResNet50 (ImageNet) 2.5 References
Area (mm2)
Area (mm2)
Power (W)
Power (W)
2 2 2 2
[1] https://openai.com/blog/chatgpt.
1.5 1.5 1.5 1.5
[2] T. Brown et al., “Language models are few-shot learners,” in Advances in Neural
1 1 1 1
Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 1877–1901.
0.5 0.5 0.5 0.5
[3] D. Patterson et al., “The carbon footprint of machine learning training will plateau,
0 0 0 0 then shrink,” Computer, vol. 55, no. 7, pp. 18–28, 2022.
25 27 29 31 33 Gen.-Pur. 25 27 29 31 33 Gen.-Pur.
Bit-Width of Product (bit) Bit-Width of Product (bit) [4] https://www.eia.gov/tools/faqs/faq.php?id=97&t=3.
(c) (d) [5] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor processing
unit,” in International Symposium on Computer Architecture (ISCA), 2017.
Figure 7: The relationship between the bit width of product and [6] Y. Chen, T. Chen, Z. Xu, N. Sun, and O. Temam, “Diannao family: Energy-efficient
the inference accuracy of neural networks executed on the task- hardware accelerators for machine learning,” Communications of ACM, vol. 59,
specific hardware platforms and the power consumption as well as no. 11, p. 105–112, 2016.
[7] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient
area. Gen.-Pur. indicates the encoding-based MAC array aiming to reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal
execute various neural networks. (a) Bit width vs. inference accuracy. of Solid-State Circuits (JSSCC), vol. 52, no. 1, pp. 127–138, 2017.
(b)(c)(d) Bit width vs. power and area for different neural networks. [8] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural
network with pruning, trained quantization and huffman coding,” in International
Conference on Learning Representations (ICLR), 2016.
According to this figure, with the increasing number of samples, [9] T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang, “Pruning and quantization
the RMSE is reduced and becomes stable when the sample number for deep neural network acceleration: A survey,” Neurocomputing, vol. 461, pp.
370–403, 2021.
reaches 104 . [10] M. Jiang, J. Wang, A. Eldebiky, X. Yin, C. Zhuo, I.-C. Lin, and G. L. Zhang, “Class-
The proposed encoding technique can benefit task-specific hard- aware pruning for efficient neural networks,” in Design, Automation and Test in
ware platforms more in the reduction of power consumption and Europe (DATE), 2024.
[11] R. Petri, G. L. Zhang, Y. Chen, U. Schlichtmann, and B. Li, “Powerpruning: Select-
area compared with general-purpose encoding-based MAC array, ing weights and activations for power-efficient neural network acceleration,” in
because the truth table of the multiplier after non-uniform quanti- Design Automation Conference (DAC), 2023.
zation can be used directly for searching a new efficient multiplier [12] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”
Neural Information Processing Systems (NeurIPS), 2014.
design. The conversion of the non-uniform quantization into 8-bit [13] Y. Han, G. Huang, S. Song, L. Yang, H. Wang, and Y. Wang, “Dynamic neural net-
two’s complement encoding in the traditional computing system works: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence
(PAMI), vol. 44, pp. 7436–7456, 2021.
can thus be avoided. To verify this advantage, we first trained a [14] J. Wang, B. Li, and G. L. Zhang, “Early-exit with class exclusion for efficient
specific neural network with 4-bit non-uniform quantization in inference of neural networks,” in International Conference on Artificial Intelligence
all layers. Afterwards, we applied the proposed binary search al- Circuits and Systems (AICAS), 2024.
[15] M. Wistuba, A. Rawat, and T. Pedapati, “A survey on neural architecture search,”
gorithm to determine the minimum bit width of the product for ArXiv, 2019.
the encoding-based multipliers. The multipliers are then used to [16] G. Armeniakos, G. Zervakis, D. J. Soudris, and J. Henkel, “Hardware approximate
construct an MAC array with a size of 256 × 256 that is designed techniques for deep neural network accelerators: A survey,” ACM Computing
Surveys, vol. 55, pp. 1–36, 2022.
specifically to execute a given neural network. [17] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey
The results are illustrated in Fig. 7. Fig. 7(a) shows that in the of quantization methods for efficient neural network inference,” ArXiv, 2021.
[18] W. Sun, G. L. Zhang, H. Gu, B. Li, and U. Schlichtmann, “Class-based quantization
search of bit width for a specific neural network, e.g., ResNet18, for neural networks,” in Design, Automation and Test in Europe (DATE), 2023.
the inference accuracy of this neural network improves as the bit [19] N. D. Gundi, T. Shabanian, P. Basu, P. Pandey, S. Roy, K. Chakraborty, and
width increases. When the bit width is around 31 bits, the infer- Z. Zhang, “Effort: Enhancing energy efficiency and error resilience of a near-
threshold tensor processing unit,” in Asia and South Pacific Design Automation
ence accuracy becomes stable for ResNet18. This bit width is much Conference (ASP-DAC), 2020, pp. 241–246.
smaller than the bit width for the 8-bit multiplier, which requires [20] P. Pandey, N. D. Gundi, K. Chakraborty, and S. Roy, “Uptpu: Improving energy
48 bits to represent more information in the computation results. efficiency of a tensor processing unit through underutilization based power-
gating,” in Design Automation Conference (DAC), 2021, pp. 325–330.
The relationship between the bit width and power consumption as [21] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “14.5 envision: A
well as area for ResNet18 is shown in Fig. 7(b). It can be observed 0.26-to-10tops/w subword-parallel dynamic-voltage-accuracy-frequency-scalable
convolutional neural network processor in 28nm fdsoi,” in International Solid-
that the power consumption and area of the task-specific design are State Circuits Conference (ISSCC), 2017, pp. 246–247.
smaller than those of 8-bit MAC design. Similar results on ResNet20 [22] J. Nunez-Yanez, “Energy proportional neural network inference with adaptive
and ResNet50 are shown in Fig. 7(c)(d). voltage and frequency scaling,” IEEE Transactions on Computers (TC), vol. 68,
no. 5, pp. 676–687, 2019.
5 Conclusion [23] D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional neural networks using
logarithmic data representation,” arXiv, 2016.
In this paper, we propose a novel digital MAC design based on [24] M. Valueva, N. Nagornov, P. Lyakhov, G. Valuev, and N. Chervyakov, “Application
encoding. With this technique, the complex logic in traditional of the residue number system to reduce hardware costs of the convolutional
multipliers can be replaced with single-level logic to reduce the neural network implementation,” Mathematics and Computers in Simulation, vol.
6
177, pp. 232–243, 2020. [27] “15nm Open-Cell library and 45nm freePDK,” https://si2.org/open-cell-library/.
[25] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients [28] “Pretrained ImageNet models,” https://pytorch.org/vision/stable/models.html.
through stochastic neurons for conditional computation,” arXiv, 2013. [29] “Pretrained Cifar10 models,” https://github.com/huyvnphan/PyTorch_CIFAR10.
[26] M. Cho, K. Alizadeh-Vahid, S. Adya, and M. Rastegari, “Dkm: Differentiable [30] “Pretrained Cifar100 models,” https://github.com/weiaicunzai/pytorch-cifar100.
k-means clustering layer for neural network compression,” in International Con-
ference on Learning Representations (ICLR), 2021.