0% found this document useful (0 votes)

21 views7 pages

Encodingnet: A Novel Encoding-Based Mac Design For Efficient Neural Network Acceleration

Uploaded by

vinnakotalavanya01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views7 pages

Encodingnet: A Novel Encoding-Based Mac Design For Efficient Neural Network Acceleration

Uploaded by

vinnakotalavanya01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

EncodingNet: A Novel Encoding-based MAC Design for Efficient

Neural Network Acceleration

Bo Liu1 , Grace Li Zhang2 , Xunzhao Yin3 , Ulf Schlichtmann1 , Bing Li1
1 TechnicalUniversity of Munich, 2 Technical University of Darmstadt, 3 Zhejiang University
Email: {bo.liu, ulf.schlichtmann, b.li}@tum.de, [email protected], [email protected]
Abstract instead of reaching the output layers of DNNs. Furthermore, neural
Deep neural networks (DNNs) have achieved great breakthroughs architecture search (NAS) [15] has been explored extensively to ob-
in many fields such as image classification and natural language pro- tain efficient neural network structures with few MAC operations
automatically.
arXiv:2402.18595v1 [cs.AR] 25 Feb 2024

cessing. However, the execution of DNNs needs to conduct massive

numbers of multiply-accumulate (MAC) operations on hardware From the hardware perspective, MAC approximation, MAC sus-
and thus incurs a large power consumption. To address this chal- pension and MAC voltage/frequency scaling have been applied
lenge, we propose a novel digital MAC design based on encoding. In to enhance the computational and power efficiency of executing
this new design, the multipliers are replaced by simple logic gates MAC operations. Specifically, approximate computing [16] allows
to project the results onto a wide bit representation. These bits inaccuracy in MAC operations so that the logic complexity and thus
carry individual position weights, which can be trained for specific power consumption can be reduced. Quantization [17, 18] approxi-
neural networks to enhance inference accuracy. The outputs of the mates floating-point MAC operations with fixed-point operations
new multipliers are added by bit-wise weighted accumulation and to reduce logic complexity and thus power consumption. MAC
the accumulation results are compatible with existing computing suspension disables MAC units in a hardware accelerator when
platforms accelerating neural networks with either uniform or non- they are not used. For example, the multipliers with the weights
uniform quantization. Since the multiplication function is replaced equal to 0 can be disabled by clock/power gating [19, 20]. MAC volt-
by simple logic projection, the critical paths in the resulting circuits age/frequency scaling [21, 22] adjusts voltage/frequency of MAC
become much shorter. Correspondingly, pipelining stages in the units dynamically according to, e.g., required quantization bits, to
MAC array can be reduced, leading to a significantly smaller area reduce energy consumption while maintaining the functionality of
as well as a better power efficiency. The proposed design has been MAC units and inference accuracy.
synthesized and verified by ResNet18-Cifar10, ResNet20-Cifar100 Despite the active research on efficient neural networks and their
and ResNet50-ImageNet. The experimental results confirmed the execution on hardware, most state-of-the-art work is still restricted
reduction of circuit area by up to 79.63% and the reduction of power by the assumption that multipliers and adders for executing MAC
consumption of executing DNNs by up to 70.18%, while the accu- operations are designed based on traditional logic design, where the
racy of the neural networks can still be well maintained. results of multiplication and addition are represented with the two’s-
complement binary number system. This assumption unnecessarily
confines the design space of MAC units and can thus lead to a large
1 Introduction
area and power consumption. Although some previous work has
The last decade has witnessed the success of deep neural net- attempted to take advantage of new data encoding for efficiently
works (DNNs) in many fields, e.g., image classification and speech executing MAC operations, e.g., logarithmic number system [23]
recognition. DNNs achieved this success by executing huge num- and residue number system [24], they still suffer from a large cost
bers of multiply-accumulate (MAC) operations. Executing such due to data conversion into binary number system.
massive numbers of MAC operations requires a huge amount of Different from previous work, we propose a novel digital MAC
dedicated hardware resources and incurs a large power consump- design by directly exploring the encoding to simplify MAC circuits
tion. For example, GPT-3 used in ChatGPT [1] has 96 layers with for efficient DNN acceleration. The key contributions are summa-
175 billion weights for the synapses [2]. This results in trillions of rized as follows:
MAC operations to be executed. To use GPT-3, 10,000 V100 Graph-
ics Processing Units (GPUs) were used [3]. It was estimated that • The encoding at the outputs of multipliers in MAC units is ex-
training GPT-3 consumed 1287 MWh energy [3], which is compa- amined directly to simplify the logic of the multipliers. With
rable to the electricity consumption of 120 years for an average U.S. different encodings, the resulting logic can deviate from the tra-
household [4]. ditional logic function of multipliers but lead to significantly
Various techniques have been proposed to enhance the execu- simpler circuit implementation. This perspective opens up a new
tion efficiency of DNNs on digital hardware. For example, efficient dimension to search for more efficient logic of MAC operations
data flows, e.g., weight-stationary [5], output stationary [6], and beyond existing arithmetic expressions of multiplication and
row-stationary [7] have been introduced to reduce data movement addition functions to accelerate DNNs efficiently.
in executing MAC operations. Pruning has been deployed to com- • The logic implementing the mapping from the inputs of the
press DNNs by pruning unnecessary weights [8–11]. Knowledge multipliers to their outputs is determined by randomly searching
distillation [12] transfers a large DNN model to a compact model simple logic gates to approximate the original output values. The
consisting of few MAC operations. In addition, dynamic neural bit width of the encoded outputs is much wider than that of
networks [13, 14] skip MAC operations to make decisions early
Partial sum
MAC MAC MAC

Activation Weight
Trad. Enc. New Enc. Value
In1 In2 𝑏3𝑏2𝑏1𝑏0 𝑏4𝑏3𝑏2𝑏1𝑏0 𝑣
Activations

MAC MAC MAC 10 10 0100 01111 4

10 11 0010 00111 2
Shifted 10 00 0000 11111 0

Partial
Sums
to match 10 01 1110 10111 -2
partial sum MAC unit 11 10 0010 01011 2
MAC MAC MAC 11 11 0001 00001 1
11 00 0000 11111 0
11 01 1111 10101 -1
00 10 0000 11111 0 (b)
(a) (b) 00 11 0000 11111 0
Figure 1: (a) Structure of systolic array according to [5]. (b) Structure 00 00 0000 11111 0
00 01 0000 11111 0
of an MAC unit. 01 10 1110 11011 -2
01 11 1111 11001 -1
the original multiplier outputs. Therefore, the logic complexity 01 00 0000 11111 0
01 01 0001 11101 1
of this mapping becomes much low due to this projection of Í
𝑣 = 𝑖=0𝑀 −1 𝑠 × 𝑏 , where 𝑠 is position weight.
𝑖 𝑖 𝑖
outputs onto wide bits. Trad.: 𝑀 = 4, 𝑠 3 = −8, 𝑠 2 = 4, 𝑠 1 = 2, 𝑠 0 = 1 (c)
New: 𝑀 = 5, 𝑠 4 = −4, 𝑠 3 = 2, 𝑠 2 = 2, 𝑠 1 = −1, 𝑠 0 = 1
• The wide bits at the outputs of the encoding-based multipliers (a)
carry individual position weights, which are trained for specific
Figure 2: (a) Truth tables of multipliers with the traditional encoding
neural networks to enhance inference accuracy. The wide bits and a new encoding. (b) The traditional 2-bit signed multiplier. (c)
and the corresponding position weights are used to calculate The multiplier with a new encoding.
the outputs of neurons by bit-wise weighted accumulation in a
MAC array. These outputs at neurons are in the original formats to realize all the rows in the truth table exactly, the circuit thus be-
specified by the neural networks with either uniform or non- comes complicated quickly. For example, an 8-bit signed multiplier
uniform quantization, so that the proposed design is compatible can contain 417 combinational logic gates. Though approximate
with existing computing systems. computing [16] can be applied to reduce the logic complexity of
• Since the critical paths in the encoding-based MAC design be- multipliers, this technique still uses the two’s complement format
come much shorter, pipelining stages in the MAC array with to represent the multiplication results and does not take advantage
these simplified circuits can be reduced significantly, which can of the full potential of MAC units.
be taken advantage of to reduce the area and power consumption The circuit of a multiplier maps the input combinations to the
of the MAC array. output combinations. In the traditional design, the bit sequences
representing the output combinations of a multiplier are predefined
The rest of the paper is structured as follows. Section 2 explains in the two’s complement format according to the multiplication
the motivation of this work. Section 3 elaborates the details of the function, as shown in the Trad. Enc. column in Fig. 2(a). However,
proposed encoding-based MAC design. Experimental results are if these bit sequences can be adjusted, the new truth table can lead
presented in Section 4 and conclusions are drawn in Section 5. to a multiplier circuit with a lower logic complexity. For example,
2 Motivation the New Enc. column in Fig. 2(a) shows another assignment of bit
sequences to represent the same output values of the multiplier,
In DNNs, there are massive amounts of MAC operations. Existing
where the bit width has been increased from 4 to 5. Since the number
digital hardware platforms use many parallel MAC units, e.g., 65,536
of bits at the output of the multiplier has been increased, different
in the systolic array of TPU v1 [5], to accelerate DNNs. The structure
bit sequences can represent the same integer value. For example,
of this systolic array is sketched in Fig. 1(a), while the internal
both 00111 and 01011 in the New Enc. column in Fig. 2(a) represent
structure of a MAC unit is shown in Fig. 1(b). In the systolic array,
the same decimal value 2. From this new bit sequence assignment,
weights are preloaded and activations are streamed as inputs. The
a much simpler circuit can be generated, as illustrated in Fig. 2(c).
partial sum of a multiplication is propagated along a column to
The bit sequence assignment in Fig. 2(a) is called an encoding.
calculate the multiplication result of an input vector and a weight
The original encoding of the multiplier shown in the column Trad.
vector. Between rows and columns there are flip-flops. Therefore,
Enc. is only one of the possible encodings representing the values
the activations are shifted to match the propagation of the partial
at the output of the multiplier. Since various encodings lead to
sum at the MAC units.
different truth tables for the multiplier, they also result in different
In a MAC unit above, the inputs of the multiplier are represented
circuit complexity after logic synthesis. Therefore, exploring the
in two’s-complement to express integer values. The circuit of the
encoding can be an effective technique to obtain more efficient
multiplier is defined by the truth table which enumerates all the
circuit implementation for the multiplier.
input combinations. For example, Fig. 2(a) shows the truth table of
a multiplier with 2-bit signed inputs In1 and In2. The column Trad. 3 Encoding-based MAC Design
Enc. shows the output bit sequences in the two’s complement format To identify a new encoding to simplify the MAC circuits, two
corresponding to decimal numbers in the last column of Fig. 2(a). challenges should be addressed. First, the number of encodings is
From this truth table, the logic circuit for this multiplier can be huge, up to 2𝑀+16 for an 8-bit multiplier with 𝑀-bit output. For
synthesized as shown in Fig. 2(b). As the bit width of input operands each encoding, a circuit should be generated, which leads to a very
increase, the number of rows in the truth table of a multiplier in- long search time. Second, the identified encoding should not make
creases exponentially. Since the synthesized circuit must be able the accumulation implementation of partial sums generated by
2
In1 Out In1 Out Enc.-based MUL

Bit-wise ACC

Trad. MUL
MUL

Trad. ADD
In2 In2

Position
weights
MUL

(a) (b)

Figure 3: Two samples of logic mapping from input bits to output ACC
Decoder
bits of a multiplier. The position weights evaluated in each sample
are shown at the outputs. The resulting RMSE of each sample is
illustrated at the bottom. (a) A circuit sample with a large RMSE. (b) Figure 4: A column in a MAC array consists of encoding-based mul-
A circuit sample with a small RMSE. tipliers and the circuit implementing the addition function, which
consists of bit-wise accumulator and a decoder.
redesigned multipliers complicated. For example, the new encoding
shown in the New Enc. column in Fig. 2(a) also defines the input bit In a sampled circuit, assume that the bit sequence with 𝑀 bits
combinations of an adder in a MAC operation. However, it is not of the 𝑘th row in the truth table are expressed as b𝑘 = 𝑏𝑘𝑀 −1 . . . 𝑏𝑘0 ,
an easy task to synthesize an efficient circuit for an adder with an e.g., b 0 = 01111 in the New Enc. column in Fig. 2(a), the value this
arbitrary input encoding. bit sequence represents can be calculated as 𝑀
Í −1 𝑘
𝑗=0 𝑏 𝑗 × 𝑠 𝑗 , where
To allow a simple implementation of accumulation, we impose 𝑠 0, . . . 𝑠𝑀 −1 are the position weights whose exact values will be
an additional constraint that the bits in a bit sequence have position determined later. The difference between the value this bit sequence
weights. For example, we assign position weights 𝑠 0 , 𝑠 1 , 𝑠 2 , 𝑠 3 , 𝑠 4 to Í −1 𝑘
approximates and the original value 𝑣 𝑘 is then | 𝑀 𝑘
𝑗=0 𝑏 𝑗 × 𝑠 𝑗 − 𝑣 |.
the bit sequences in the New Enc. column of Fig. 2(a). Accordingly, a
Í4 When all the rows in the truth table of a sampled circuit are
bit sequence 𝑏 4𝑏 3𝑏 2𝑏 1𝑏 0 represents the number 𝑖=0 𝑠𝑖 × 𝑏𝑖 . These
considered together, we can then determine the position weights
position weights are adjustable for different neural networks to
by minimizing the root mean square error (RMSE) that the bit
maintain inference accuracy. Compared with the traditional two’s
sequences approximate the original values of the multiplication
complement number system where the position weights are fixed
results, as
only to power of two values, the adjustable position weights pro-
vide more flexibility for the implementation of the multipliers and s = arg min ∥Bs − v∥ 2 (1)
adders. s

3.1 Encoding-based Multiplier Design where B is the bit sequences derived from a sampled circuit cor-
To determine the encoding and the logic design of a multiplier responding to all the rows in the truth table. s is the vector of all
with a given bit width, e.g., 8 bits, we only use single-level logic the position weights. v is the vector of all the original values of the
as illustrated in Fig. 2(c). This can decrease the critical path of the multiplier, e.g., the Value column in Fig. 2(a).
circuit effectively while reducing the area. Under this assumption, After the position weights for a sampled circuit is determined
an output bit of the multiplier is driven by a single logic gate, which as described above, we can also obtain the RMSE for each sampled
takes operand bits to the multiplier as its input. In our design, we circuit. We execute the sampling process for up to 104 times and
consider the single-level logic gates SET, IN, NAND2, NAND3, AND2, track the trend of the RMSE with the increasing number of samples.
OR2, NOT, XOR3, where the SET gate always outputs a high signal When the RMSE becomes stable, we stop the sample process and
‘1’ to allow a constant bias in the result to approximate the original the circuit with the minimum RMSE will be returned as the circuit
multiplication function. The IN gate connects the input signals to design for the multiplier.
output signals without any logic gate on the connections. The sampling process described above is based on the assump-
Even with the assumption of single-level logic, the search space tion that a bit width 𝑀 at the output of the new multiplier is given.
to generate the logic for the multiplier can still be large, because for To determine the minimum bit width at the output of the new mul-
every output bit the gate type to drive it should be selected and the tiplier, a binary search algorithm is used. Initially, the minimum
inputs of such a gate should be selected from the input bits of the and the maximum bit width are set to 16 and 128 for an 8-bit multi-
multiplier. To address this issue, we randomly sample the gate types plier, respectively. Afterwards, the middle bit width 72 is used to
and the connections from the input bits to create circuit samples. execute the sampling process above. The sampled circuit candidate
In each sample, we can obtain a candidate of the circuit for the producing the best approximation is returned and the correspond-
multiplier. Fig. 3 illustrates two circuit samples, where the bit width ing RMSE can be evaluated. The RMSE is compared with a target
of the multiplication results is set to 48 bits. With such a sample, RMSE, which is determined by exhaustively evaluating various
we can generate the output bit sequence for every bit combination RMSEs with respect to the inference accuracy of neural networks
of the operands of the multiplier. In other words, we can create a and the one that can maintain the inference accuracy is selected. If
truth table similar to Fig. 2(a) from such a sampled circuit. Each the RMSE of the returned circuit candidate is larger than the target
row in this truth table corresponds to an exact value determined by RMSE, the minimum bit width will be updated to the middle bit
output bit sequence, as illustrated in the Value column in Fig. 2(a). width in the next iteration and vice versa. The search algorithm
3
terminates until the distance between the minimum and maximum Table 1: Power and area of proposed vs. traditional MAC arrays
bit width is equal to 0 or 1. Bit-Wid. of
Size of Power (W) Area (𝑚𝑚 2 )
3.2 Adder Design with Encoding Product
Syst. Arr.
Since the bit sequences at the output of the multiplier do not fol- Trad. Prop. Trad. Prop. Red. Trad. Prop. Red.
low the two’s complement number system, we also need to define a
32×32 16 48 0.181 0.163 9.94% 0.239 0.172 28.03%
new structure to implement the addition function. For the general 48×48 16 48 0.380 0.259 31.84% 0.513 0.268 47.76%
case of accumulating the 𝑀-bit outputs of 𝑁 multipliers, the sum 64×64 16 48 0.652 0.404 38.07% 0.891 0.416 53.36%
Í𝑁 Í𝑀 −1 𝑖 Í𝑀 −1 Í𝑁 𝑖
can thus be expressed as 𝑖=1 𝑗=0 𝑠 𝑗 × 𝑏 𝑗 = 𝑗=0 𝑠 𝑗 × 𝑖=1 𝑏 𝑗 , 128×128 16 48 2.464 1.050 57.38% 3.433 1.043 69.61%
where 𝑏𝑖𝑗 is the 𝑗th bit of the output of the 𝑖th multiplier. Accord- 256×256 16 48 9.572 2.854 70.18% 13.473 2.744 79.63%
ingly, the circuit to implement this sum can be designed as illus-
trated in Fig. 4. In this implementation, the corresponding bits of
and (3𝑁 − 2)𝑇 , respectively. To evaluate the throughput, we assume
the multipliers are accumulated first and the position weights are
that 𝑚 input matrices with sizes of 𝑁 × 𝑁 need to be processed
multiplied with such accumulation results at the bottom of each col-
by the MAC arrays. The throughputs of the encoding-based ar-
umn in a MAC array only once. The results of these multiplication
ray and the traditional MAC array are [ (2𝑁 −1)+𝑁𝑚(𝑚−1) ]×𝑇 and
operations are added by an adder tree to generate the data in the
𝑚
two’s complement format for further functions, e.g., activation and [ (3𝑁 −2)+𝑁 (𝑚−1) ]×𝑇 , respectively. The proposed design exhibits a
batch normalization. We call such multipliers for multiplying posi- higher performance, and the throughputs of these designs becomes
tion weights and adders for generating two’s complement numbers nearly the same as 𝑚 becomes large.
as a decoder. The proposed new encoding technique is not limited to simplify-
3.3 Design and Application of MAC Array ing the traditional multiplier in the MAC array. For example, it can
process the truth table of multiplication with non-uniform quantiza-
We deploy the encoding-based multipliers and adders to con-
tion directly without requiring the conversion of the non-uniform
struct a MAC array with a size 𝑁 × 𝑁 to execute MAC operations
encoding into the two’s complement as required in the traditional
in DNNs efficiently, one column of which is illustrated in Fig. 4.
design. In such a case, the final hardware design becomes specific
At the inputs of each encoding-based multiplier, flip-flops are in-
for neural networks with the corresponding non-uniform quanti-
serted to allow the reuse of inputs, similar to that in the traditional
zation, but the hardware cost can be reduced even further in such
systolic array. Different from the traditional systolic array where
application-specific computing scenarios. Since the inputs of the
each MAC unit has an individual adder, the adders only appear at
multipliers and the final output of the MAC operations are in the
the bottom of each column, which performs the bit-wise weighted
original formats defined by neural networks, this new design is also
accumulation in a column. Another difference is that there are no
compatible with the existing training and inference frameworks of
flip-flops for storing the multiplication result of the multipliers due
neural networks.
to the shorter critical paths inside the encoding-based multipliers.
To execute MAC operations with the encoding-based MAC array, 4 Experimental Results
weights in a neural network are first loaded into the flip-flops in To verify the proposed encoding-based MAC array, we synthe-
each multiplier. Activations are streamed as inputs. Since there are sized the encoded-based MAC circuit with NanGate 15 nm cell
no flip-flops for storing the intermediate multiplication results in libraries [27]. These MAC circuits approximate the results of the
each multiplier, activations are not required to be shifted as in the uniformly quantized 8-bit MAC units. Such MAC circuits were then
traditional systolic array. Activations belonging to the inputs of a used to construct a MAC array similar to Fig. 4. In synthesizing
neuron can enter each column simultaneously and the results of the circuits, the clock frequency was set to 1 GHz. Power and area
multiplication and bit-wise accumulation are obtained after each analysis of this hardware were conducted with Design Compiler
clock cycle. from Synopsys. Such a MAC array can efficiently execute neural
To enhance the inference accuracy of neural networks executed networks while maintaining a high inference accuracy. To verify
on the encoding-based MAC array, we further fine-tune the ad- this, we tested the accuracy of three neural networks together with
justable position weights for specific neural networks. These posi- the corresponding datasets, ResNet18-Cifar10, ResNet20-Cifar100
tion weights are initially set to the values determined by minimizing and ResNet50-ImageNet, using the new MAC design. These neu-
the RMSE that the bit sequences approximate the original multipli- ral networks were initially trained using Pytorch, and pretrained
cation results. In fine-tuning, straight-through estimator (STE) [25] weights were loaded from Torchvision [28] and public repositories
was used for propagating gradients of encoding-based multipliers. on Github [29, 30]. The learning rate of the position weights in the
The encoding-based MAC array have slightly better throughput novel encoding for fine-tuning ResNet18, ResNet20, and ResNet50
and latency than those of the traditional systolic array under the was set to 1e-3, 1e-3, and 1e-8, respectively.
same size while achieving a much lower area cost and power con- Table 1 shows the comparison between the proposed MAC ar-
sumption. Assume that a weight matrix 𝑊 with a size of 𝑁 × 𝑁 has ray and the traditional systolic array in power consumption and
been loaded into the encoding-based array and the traditional MAC area cost. Different sizes were used to verify the advantages of the
array. The clock period is denoted as 𝑇 . To finish a computation hardware platform, as shown in the first column. The second and
between an input matrix 𝐼 0 with a size of 𝑁 × 𝑁 , the latency of the the third columns are the bit width of product, i.e., multiplication
encoding-based array and the traditional MAC array are (2𝑁 − 1)𝑇 result, in the multipliers in the traditional systolic array and the
bit width of the encoding-based multipliers in the proposed MAC
4
Table 2: Inference accuracy of neural networks executed on the proposed MAC array

8-bit Uniform 4-bit Non-Uniform 4-bit Non-Uniform

Model-Dataset 32-FP Quantization for all layers Quantization for all layers [26] 8-bit Uni. in first and last layer [26]
Orig. Prop. Acc. loss Orig. Prop. Acc. loss Orig. Prop. Acc. loss
ResNet18-Cifar10 93.07% 93.03% 92.59% 0.44% 91.06% 90.93% 0.13% 92.81% 92.58% 0.23%
ResNet20-Cifar100 68.82% 68.45% 68.27% 0.18% 62.60% 62.52% 0.08% 67.87% 67.19% 0.68%
ResNet50-ImageNet 80.84% 80.32% 80.23% 0.09% 73.97% 73.28% 0.69% 79.50% 79.32% 0.18%

100 10 12,5 10

Root Mean Square Error

80 8 1 100
10
Accuracy (%)

Area (mm2)
Power (W)

60 6 7,5 0,1
40 10
4 5 Bit-Width = 19
0,01 Bit-Width = 27
20 RMSE target Bit-Width = 37
2 2,5 1
10−3 RMSE change Bit-Width = 58
0
0 0
32 34 36 38 40 42 44 46 48 50 52 Trad. 32 34 36 38 40 42 44 46 48 50 52 Trad. 72 44 58 51 48 46 47 48 10 100 1000 104 105
Bit-Width of Product (bit) Bit-Width of Product (bit) Bit-Width of Product (bit) The Number of Random Samples
ResNet18 (Cifar-10) Dynamic Power (a) (b)
ResNet20 (Cifar-100) Leakage Power
ResNet50 (ImageNet) Area Figure 6: The relationship between bit width, number of samples
(a) (b) and RMSEs in the binary search algorithm. (a) Bit width vs. RMSEs.
(b) Number of samples vs. RMSEs.
Figure 5: The relationship between the bit width of product in the
encoding-based multipliers and the inference accuracy of neural
level is shown in the third, sixth and ninth columns. According to
networks executed on the proposed MAC array and the power con-
sumption as well as area. (a) Bit width vs. accuracy. (b) Bit width vs.
these columns, it is clear that 8-bit uniform quantization can nearly
power and area. maintain the inference accuracy with floating-point data. The first
setting of 4-bit non-uniform quantization for all layers leads to a
design, respectively. The latter is determined by the search algo- relatively large accuracy degradation due to limited data represen-
rithm in Section 3.1. Although the bit width of the encoding-based tation while the second setting including 8-bit quantization in the
multipliers is larger than that of the traditional one, the power first and the last layers can achieve a better accuracy. The inference
consumption and the area cost of the MAC array exhibit significant accuracy of neural networks executed on the proposed MAC design
advantage, because the logic to generate these intermediate bits is is shown in the fourth, seventh and tenth columns in Table 2. Due
much simpler compared with the traditional multiplication. This to the approximation in the proposed encoding technique, there
advantage can be clearly seen from the last six columns in Table 1 , is a slight accuracy loss compared with that at software level, as
where the column Trad. and the column Prop. show the results of shown in the fifth, eighth and eleventh columns.
power consumption and area of the traditional MAC design and the To determine the minimum bit width of product in the encoding-
proposed design, respectively. The columns Red. show the ratios based multipliers, a binary search algorithm is applied. The results
of reduction. Besides, with the decreasing size of MAC array, the of this search are illustrated in Fig. 5. Fig. 5(a) shows the relation-
reduction of power consumption and area becomes smaller. This ship between the bit widths and the inference accuracy of neural
phenomenon results from the fact that the bit-wise accumulators networks. Fig. 5(b) is the relationship between the bit widths and
and decoders at the bottom of columns incur additional area cost power consumption as well as area cost. According to Fig. 5(a),
and thus power consumption. When the MAC array has a small inference accuracy becomes higher with the increasing bit width of
size, the incurred area and power cost contributes much to the total product and keeps stable around 48 bits. Accordingly, 48 bits were
cost. used for the output bit width of the product. In Fig. 5(b), power
The proposed new MAC design can execute neural networks consumption and area cost increase very slowly when the bit width
with high inference accuracy while consuming less power. Table of product goes larger and there is no linear relationship between
2 demonstrates the inference accuracy of neural networks with the bit width and the power consumption as well as area due to
three different quantization settings, namely, 8-bit uniform quanti- the randomness in the search algorithm and logic simplification in
zation, 4-bit non-uniform quantization for all layers [26], and 4-bit synthesis.
non-uniform quantization with the first and last layers using 8-bit In the binary search algorithm to determine the minimum bit
uniform quantization [26]. For the 8-bit uniform quantization, the width of the product, a target RMSE was used as a guidance, which
corresponding weights and input activations can be directly pro- was set to the one that can maintain the inference accuracy of
cessed with the MAC array. For the two settings of non-uniform neural networks. The results of the search process are illustrated
quantization, the non-uniform levels are first converted to the most in Fig. 6(a). According to this figure, after several search iterations,
close levels in 8-bit uniform quantization and then such 8-bit uni- the RMSE becomes stable and the minimum bit width that can
form quantization levels are used for MAC operations. achieve a RMSE smaller than a target RMSE, which is 48 bits, was
The second column of Table 2 is the inference accuracy evaluated selected. For a specified bit width, a given number of encoding
with 32-bit floating-point weights and input activations at software samples, 104 , were used to generate the encoding-based multipliers
level. The inference accuracy of neural networks with 8-bit uni- and then determine their RMSEs. To verify this, different numbers
form quantization and 4-bit non-uniform quantizations at software of samples were used to evaluate the RMSEs, as shown in Fig. 6(b).
5
3 3
90 2.5 ResNet18 (Cifar-10) 2.5
critical path and area significantly. The position weights allow a bit-
Accuracy (%)

wise accumulation to calculate the addition result. Correspondingly,

Area (mm2)
Power (W)
80 2 2
1.5 1.5 pipelining stages between rows of multipliers can be reduced to
70
1 1
60 lower area cost further. With this new design, area and power
0.5 0.5
50 0 0 consumption of MAC array can be reduced by up to 79.63% and
25 27 29 31 33 Gen.-Pur. 25 27 29 31 33 Gen.-Pur.
Bit-Width of Product (bit) Bit-Width of Product (bit)
70.18%, respectively, compared with traditional design while the
(a) (b) inference accuracy is still maintained. Future work will study the
ResNet18 (Cifar-10) Dynamic Power
tradeoff between the bit width at the output of the multipliers and
ResNet20 (Cifar-100) Leakage Power the complexity of the single-level as well as the multi-level logic
ResNet50 (ImageNet) Area
3 3 3 3
implementation in the simplified multipliers.
2.5 ResNet20 (Cifar-100) 2.5 2.5 ResNet50 (ImageNet) 2.5 References
Area (mm2)

Area (mm2)
Power (W)

Power (W)

2 2 2 2
[1] https://openai.com/blog/chatgpt.
1.5 1.5 1.5 1.5
[2] T. Brown et al., “Language models are few-shot learners,” in Advances in Neural
1 1 1 1
Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 1877–1901.
0.5 0.5 0.5 0.5
[3] D. Patterson et al., “The carbon footprint of machine learning training will plateau,
0 0 0 0 then shrink,” Computer, vol. 55, no. 7, pp. 18–28, 2022.
25 27 29 31 33 Gen.-Pur. 25 27 29 31 33 Gen.-Pur.
Bit-Width of Product (bit) Bit-Width of Product (bit) [4] https://www.eia.gov/tools/faqs/faq.php?id=97&t=3.
(c) (d) [5] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor processing
unit,” in International Symposium on Computer Architecture (ISCA), 2017.
Figure 7: The relationship between the bit width of product and [6] Y. Chen, T. Chen, Z. Xu, N. Sun, and O. Temam, “Diannao family: Energy-efficient
the inference accuracy of neural networks executed on the task- hardware accelerators for machine learning,” Communications of ACM, vol. 59,
specific hardware platforms and the power consumption as well as no. 11, p. 105–112, 2016.
[7] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient
area. Gen.-Pur. indicates the encoding-based MAC array aiming to reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal
execute various neural networks. (a) Bit width vs. inference accuracy. of Solid-State Circuits (JSSCC), vol. 52, no. 1, pp. 127–138, 2017.
(b)(c)(d) Bit width vs. power and area for different neural networks. [8] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural
network with pruning, trained quantization and huffman coding,” in International
Conference on Learning Representations (ICLR), 2016.
According to this figure, with the increasing number of samples, [9] T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang, “Pruning and quantization
the RMSE is reduced and becomes stable when the sample number for deep neural network acceleration: A survey,” Neurocomputing, vol. 461, pp.
370–403, 2021.
reaches 104 . [10] M. Jiang, J. Wang, A. Eldebiky, X. Yin, C. Zhuo, I.-C. Lin, and G. L. Zhang, “Class-
The proposed encoding technique can benefit task-specific hard- aware pruning for efficient neural networks,” in Design, Automation and Test in
ware platforms more in the reduction of power consumption and Europe (DATE), 2024.
[11] R. Petri, G. L. Zhang, Y. Chen, U. Schlichtmann, and B. Li, “Powerpruning: Select-
area compared with general-purpose encoding-based MAC array, ing weights and activations for power-efficient neural network acceleration,” in
because the truth table of the multiplier after non-uniform quanti- Design Automation Conference (DAC), 2023.
zation can be used directly for searching a new efficient multiplier [12] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”
Neural Information Processing Systems (NeurIPS), 2014.
design. The conversion of the non-uniform quantization into 8-bit [13] Y. Han, G. Huang, S. Song, L. Yang, H. Wang, and Y. Wang, “Dynamic neural net-
two’s complement encoding in the traditional computing system works: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence
(PAMI), vol. 44, pp. 7436–7456, 2021.
can thus be avoided. To verify this advantage, we first trained a [14] J. Wang, B. Li, and G. L. Zhang, “Early-exit with class exclusion for efficient
specific neural network with 4-bit non-uniform quantization in inference of neural networks,” in International Conference on Artificial Intelligence
all layers. Afterwards, we applied the proposed binary search al- Circuits and Systems (AICAS), 2024.
[15] M. Wistuba, A. Rawat, and T. Pedapati, “A survey on neural architecture search,”
gorithm to determine the minimum bit width of the product for ArXiv, 2019.
the encoding-based multipliers. The multipliers are then used to [16] G. Armeniakos, G. Zervakis, D. J. Soudris, and J. Henkel, “Hardware approximate
construct an MAC array with a size of 256 × 256 that is designed techniques for deep neural network accelerators: A survey,” ACM Computing
Surveys, vol. 55, pp. 1–36, 2022.
specifically to execute a given neural network. [17] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey
The results are illustrated in Fig. 7. Fig. 7(a) shows that in the of quantization methods for efficient neural network inference,” ArXiv, 2021.
[18] W. Sun, G. L. Zhang, H. Gu, B. Li, and U. Schlichtmann, “Class-based quantization
search of bit width for a specific neural network, e.g., ResNet18, for neural networks,” in Design, Automation and Test in Europe (DATE), 2023.
the inference accuracy of this neural network improves as the bit [19] N. D. Gundi, T. Shabanian, P. Basu, P. Pandey, S. Roy, K. Chakraborty, and
width increases. When the bit width is around 31 bits, the infer- Z. Zhang, “Effort: Enhancing energy efficiency and error resilience of a near-
threshold tensor processing unit,” in Asia and South Pacific Design Automation
ence accuracy becomes stable for ResNet18. This bit width is much Conference (ASP-DAC), 2020, pp. 241–246.
smaller than the bit width for the 8-bit multiplier, which requires [20] P. Pandey, N. D. Gundi, K. Chakraborty, and S. Roy, “Uptpu: Improving energy
48 bits to represent more information in the computation results. efficiency of a tensor processing unit through underutilization based power-
gating,” in Design Automation Conference (DAC), 2021, pp. 325–330.
The relationship between the bit width and power consumption as [21] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “14.5 envision: A
well as area for ResNet18 is shown in Fig. 7(b). It can be observed 0.26-to-10tops/w subword-parallel dynamic-voltage-accuracy-frequency-scalable
convolutional neural network processor in 28nm fdsoi,” in International Solid-
that the power consumption and area of the task-specific design are State Circuits Conference (ISSCC), 2017, pp. 246–247.
smaller than those of 8-bit MAC design. Similar results on ResNet20 [22] J. Nunez-Yanez, “Energy proportional neural network inference with adaptive
and ResNet50 are shown in Fig. 7(c)(d). voltage and frequency scaling,” IEEE Transactions on Computers (TC), vol. 68,
no. 5, pp. 676–687, 2019.
5 Conclusion [23] D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional neural networks using
logarithmic data representation,” arXiv, 2016.
In this paper, we propose a novel digital MAC design based on [24] M. Valueva, N. Nagornov, P. Lyakhov, G. Valuev, and N. Chervyakov, “Application
encoding. With this technique, the complex logic in traditional of the residue number system to reduce hardware costs of the convolutional
multipliers can be replaced with single-level logic to reduce the neural network implementation,” Mathematics and Computers in Simulation, vol.

6
177, pp. 232–243, 2020. [27] “15nm Open-Cell library and 45nm freePDK,” https://si2.org/open-cell-library/.
[25] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients [28] “Pretrained ImageNet models,” https://pytorch.org/vision/stable/models.html.
through stochastic neurons for conditional computation,” arXiv, 2013. [29] “Pretrained Cifar10 models,” https://github.com/huyvnphan/PyTorch_CIFAR10.
[26] M. Cho, K. Alizadeh-Vahid, S. Adya, and M. Rastegari, “Dkm: Differentiable [30] “Pretrained Cifar100 models,” https://github.com/weiaicunzai/pytorch-cifar100.
k-means clustering layer for neural network compression,” in International Con-
ference on Learning Representations (ICLR), 2021.

David Hudson-Has The Philosophers Stone...
No ratings yet
David Hudson-Has The Philosophers Stone...
60 pages
MACcelerator Approximate Arithmetic Unit For Computational Acceleration
No ratings yet
MACcelerator Approximate Arithmetic Unit For Computational Acceleration
6 pages
2020 a Reconfigurable Approximate Multiplier for Quantized CNN Applications
No ratings yet
2020 a Reconfigurable Approximate Multiplier for Quantized CNN Applications
6 pages
Residue-Net Multiplication-free Neural Network by in-situ No-loss Migration to Residue Number Systems
No ratings yet
Residue-Net Multiplication-free Neural Network by in-situ No-loss Migration to Residue Number Systems
7 pages
Weight-Oriented Approximation for Energy-Efficient Neural Network Inference Accelerators
No ratings yet
Weight-Oriented Approximation for Energy-Efficient Neural Network Inference Accelerators
14 pages
An Empirical Approach To Enhance Performance For Scalable CORDIC-Based Deep Neural Networks
No ratings yet
An Empirical Approach To Enhance Performance For Scalable CORDIC-Based Deep Neural Networks
32 pages
IJME Vol 7 Iss 4 Paper 9 1260 1264
No ratings yet
IJME Vol 7 Iss 4 Paper 9 1260 1264
5 pages
Reconfigurable Multiplier
No ratings yet
Reconfigurable Multiplier
16 pages
Approximate Computing Papers Summary Final Spaced
No ratings yet
Approximate Computing Papers Summary Final Spaced
6 pages
Tesla Patent
No ratings yet
Tesla Patent
12 pages
2023-Low-Complexity - Precision-Scalable - Multiply-Accumulate - Unit - Architectures - For - Deep - Neural - Network - Accelerators
No ratings yet
2023-Low-Complexity - Precision-Scalable - Multiply-Accumulate - Unit - Architectures - For - Deep - Neural - Network - Accelerators
5 pages
10.1515 - Nanoph 2020 0297
No ratings yet
10.1515 - Nanoph 2020 0297
12 pages
Priyanka - 50300 16 130
No ratings yet
Priyanka - 50300 16 130
4 pages
00) TD-SRAM - Time-Domain-Based - In-Memory - Computing - Macro - For - Binary - Neural - Networks
No ratings yet
00) TD-SRAM - Time-Domain-Based - In-Memory - Computing - Macro - For - Binary - Neural - Networks
11 pages
Efficient Design of Artificial Neural Networks Using Approximate Compressors and Multipliers
No ratings yet
Efficient Design of Artificial Neural Networks Using Approximate Compressors and Multipliers
4 pages
Integration, The Vlsi Journal: Taiyu Cheng, Yukata Masuda, Jun Chen, Jaehoon Yu, Masanori Hashimoto
No ratings yet
Integration, The Vlsi Journal: Taiyu Cheng, Yukata Masuda, Jun Chen, Jaehoon Yu, Masanori Hashimoto
13 pages
A Low-Power Sparse Convolutional Neural Network Accelerator With Pre-Encoding Radix-4 Booth Multiplier
No ratings yet
A Low-Power Sparse Convolutional Neural Network Accelerator With Pre-Encoding Radix-4 Booth Multiplier
5 pages
07) A Time-Domain Binary CNN Engine With Error-Detection-Based Resilience in 28nm CMOS
No ratings yet
07) A Time-Domain Binary CNN Engine With Error-Detection-Based Resilience in 28nm CMOS
5 pages
5- Logic Shrinkage Learned Connectivity
No ratings yet
5- Logic Shrinkage Learned Connectivity
25 pages
DRD
No ratings yet
DRD
16 pages
A2 Intro
No ratings yet
A2 Intro
28 pages
Area Efficient VLSI ASIC Implementation of Multilayer Perceptrons
No ratings yet
Area Efficient VLSI ASIC Implementation of Multilayer Perceptrons
4 pages
10 1109@isvlsi49217 2020 00027
No ratings yet
10 1109@isvlsi49217 2020 00027
6 pages
Quantization and Deployment Od DNN On Microcontroller
No ratings yet
Quantization and Deployment Od DNN On Microcontroller
34 pages
Xnor
No ratings yet
Xnor
11 pages
Sensors 22 08845
No ratings yet
Sensors 22 08845
16 pages
MEGA MAC a Merged Accumulation Based App
No ratings yet
MEGA MAC a Merged Accumulation Based App
4 pages
02Computing-in-Memory With SRAM and RRAM For Binary Neural Networks
No ratings yet
02Computing-in-Memory With SRAM and RRAM For Binary Neural Networks
4 pages
DSLOT-NN: Digit-Serial Left-to-Right Neural Network Accelerator
No ratings yet
DSLOT-NN: Digit-Serial Left-to-Right Neural Network Accelerator
7 pages
Design and Implementation of MAC using approx. Multiplier
No ratings yet
Design and Implementation of MAC using approx. Multiplier
7 pages
MAC - Low Power and Area
No ratings yet
MAC - Low Power and Area
6 pages
BR-CIM An Efficient Binary Representation Computation-In-Memory Design
No ratings yet
BR-CIM An Efficient Binary Representation Computation-In-Memory Design
14 pages
Hardware Approximate Techniques For Deep Neural Network Accelerators: A Survey
No ratings yet
Hardware Approximate Techniques For Deep Neural Network Accelerators: A Survey
36 pages
DesignandimplementationofMultiplierunitMAC ROBA
No ratings yet
DesignandimplementationofMultiplierunitMAC ROBA
10 pages
On The Use of Low-Power Devices, Approximate Adders and Near-Threshold Operation For Energy-Efficient Multipliers
No ratings yet
On The Use of Low-Power Devices, Approximate Adders and Near-Threshold Operation For Energy-Efficient Multipliers
12 pages
An Area-Power-Efficient Multiplier-Less Processing Element Design For CNN Accelerators
No ratings yet
An Area-Power-Efficient Multiplier-Less Processing Element Design For CNN Accelerators
4 pages
MACIo T
No ratings yet
MACIo T
5 pages
Handling_Stuck-at-Fault_Defects_Using_Matrix_Transformation_for_Robust_Inference_of_DNNs
No ratings yet
Handling_Stuck-at-Fault_Defects_Using_Matrix_Transformation_for_Robust_Inference_of_DNNs
13 pages
Tesla Patent ACCELERATED MATHEMATICAL ENGINE
No ratings yet
Tesla Patent ACCELERATED MATHEMATICAL ENGINE
12 pages
A Novel Low Power and High Speed Multiply-Accumulate MAC Unit Design For Floating-Point Numbers
No ratings yet
A Novel Low Power and High Speed Multiply-Accumulate MAC Unit Design For Floating-Point Numbers
7 pages
Enabling BNN by Edge
No ratings yet
Enabling BNN by Edge
19 pages
Batch A7
No ratings yet
Batch A7
22 pages
A Logic-Compatible EDRAM Compute-In-Memory With Embedded ADCs For Processing Neural Networks
No ratings yet
A Logic-Compatible EDRAM Compute-In-Memory With Embedded ADCs For Processing Neural Networks
13 pages
Approximate Softmax Functions For Energy-Efficient Deep Neural Networks
No ratings yet
Approximate Softmax Functions For Energy-Efficient Deep Neural Networks
13 pages
Example of Multiplier
No ratings yet
Example of Multiplier
4 pages
Approximate Multiplier
No ratings yet
Approximate Multiplier
12 pages
2022-BitBlade - Energy-Efficient - Variable - Bit-Precision - Hardware - Accelerator - For - Quantized - Neural - Networks
No ratings yet
2022-BitBlade - Energy-Efficient - Variable - Bit-Precision - Hardware - Accelerator - For - Quantized - Neural - Networks
12 pages
Tutorial-on-DNN-6-of-9-Network-and-Hardware-Co-Design
No ratings yet
Tutorial-on-DNN-6-of-9-Network-and-Hardware-Co-Design
60 pages
Approximate_MAC_unit_using_Static_Segmentation
No ratings yet
Approximate_MAC_unit_using_Static_Segmentation
12 pages
Jacob Quantization and Training
No ratings yet
Jacob Quantization and Training
10 pages
Approximate Hybrid High Radix Encoding For Energy-Efficient Inexact Multipliers
No ratings yet
Approximate Hybrid High Radix Encoding For Energy-Efficient Inexact Multipliers
10 pages
A_28_nm_16-kb_Sign-Extension-Less_Digital-Compute-in-Memory_Macro_With_Extension-Friendly_Compute_Units_and_Accuracy-Adjustable_Adder-Tree
No ratings yet
A_28_nm_16-kb_Sign-Extension-Less_Digital-Compute-in-Memory_Macro_With_Extension-Friendly_Compute_Units_and_Accuracy-Adjustable_Adder-Tree
5 pages
THESIS LucasHuijbregts Final
No ratings yet
THESIS LucasHuijbregts Final
86 pages
10T SRAM Computing-in-Memory Macros For Binary and
No ratings yet
10T SRAM Computing-in-Memory Macros For Binary and
15 pages
Approximate Radix-8 Booth Multipliers For Low-Power and High-Performance Operation
No ratings yet
Approximate Radix-8 Booth Multipliers For Low-Power and High-Performance Operation
8 pages
2021 A Hybrid Radix-4 and Approximate Logarithmic Multiplier - Lotric
No ratings yet
2021 A Hybrid Radix-4 and Approximate Logarithmic Multiplier - Lotric
20 pages
Approx - Multiplierusing NN
No ratings yet
Approx - Multiplierusing NN
12 pages
Feedforward-Cutset-Free Pipelined Multiply-Accumulate Unit For The Machine Learning Accelerator
No ratings yet
Feedforward-Cutset-Free Pipelined Multiply-Accumulate Unit For The Machine Learning Accelerator
9 pages
Vesti Energy-Efficient In-Memory Computing Accelerator For Deep Neural Networks
No ratings yet
Vesti Energy-Efficient In-Memory Computing Accelerator For Deep Neural Networks
14 pages
T&F FORMAT
No ratings yet
T&F FORMAT
6 pages
Efficient Numerical Computing with Intel MKL: Definitive Reference for Developers and Engineers
From Everand
Efficient Numerical Computing with Intel MKL: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advance MS Office Program Details Course Moudle.docx
No ratings yet
Advance MS Office Program Details Course Moudle.docx
6 pages
1 Mark: Important Instructions To Examiners
No ratings yet
1 Mark: Important Instructions To Examiners
14 pages
Unregistered Copy of Bakoma Tex: Midterm Exam Report
No ratings yet
Unregistered Copy of Bakoma Tex: Midterm Exam Report
2 pages
LLS A of Expertise: SKI ARE
No ratings yet
LLS A of Expertise: SKI ARE
2 pages
12 (E.M.) Super-70
No ratings yet
12 (E.M.) Super-70
18 pages
Peb 2024
No ratings yet
Peb 2024
10 pages
39380691518
No ratings yet
39380691518
2 pages
Science and Tech_(0)
No ratings yet
Science and Tech_(0)
103 pages
Digital Lock
No ratings yet
Digital Lock
4 pages
Carnegie Rules v1.2 WEB
No ratings yet
Carnegie Rules v1.2 WEB
20 pages
(eBook PDF) Statistics in Context by Barbara Blatchleypdf download
100% (7)
(eBook PDF) Statistics in Context by Barbara Blatchleypdf download
51 pages
Chapter 10 Python
No ratings yet
Chapter 10 Python
45 pages
Phy PPR
No ratings yet
Phy PPR
2 pages
Ericsson RNC 3810 Cards
100% (1)
Ericsson RNC 3810 Cards
5 pages
Techniques of Costing Isa I PDF
No ratings yet
Techniques of Costing Isa I PDF
2 pages
Convection Section Cleaning
No ratings yet
Convection Section Cleaning
10 pages
Example of A Simple Report
No ratings yet
Example of A Simple Report
7 pages
U7l1 2
No ratings yet
U7l1 2
17 pages
MSC Materials Science Nanotechnology Syllabus MSU Baroda
No ratings yet
MSC Materials Science Nanotechnology Syllabus MSU Baroda
12 pages
An Interactive Rfid-Based Bracelet For Airport Luggage Tracking System
No ratings yet
An Interactive Rfid-Based Bracelet For Airport Luggage Tracking System
4 pages
White Paper Factors That Drive LED Reliability
No ratings yet
White Paper Factors That Drive LED Reliability
6 pages
Forming, Working and Heat-Treating Metal
No ratings yet
Forming, Working and Heat-Treating Metal
2 pages
Alistair Cockburn Usecase
No ratings yet
Alistair Cockburn Usecase
8 pages
Crane Beam Design
No ratings yet
Crane Beam Design
23 pages
Lattice
No ratings yet
Lattice
56 pages
Microsoft Office Word 2007 Shortcuts
No ratings yet
Microsoft Office Word 2007 Shortcuts
5 pages
Periodontal Ligament
No ratings yet
Periodontal Ligament
38 pages
Favorite Prime Numbers To Base The QRD, N 7, 13 and 41
No ratings yet
Favorite Prime Numbers To Base The QRD, N 7, 13 and 41
4 pages

Encodingnet: A Novel Encoding-Based Mac Design For Efficient Neural Network Acceleration

Uploaded by

Encodingnet: A Novel Encoding-Based Mac Design For Efficient Neural Network Acceleration

Uploaded by

EncodingNet: A Novel Encoding-based MAC Design for Efficient

Neural Network Acceleration

cessing. However, the execution of DNNs needs to conduct massive

MAC MAC MAC 10 10 0100 01111 4

8-bit Uniform 4-bit Non-Uniform 4-bit Non-Uniform

Root Mean Square Error

Root Mean Square Error

wise accumulation to calculate the addition result. Correspondingly,

You might also like