Weight-Oriented Approximation for Energy-Efficient Neural Network Inference Accelerators
Weight-Oriented Approximation for Energy-Efficient Neural Network Inference Accelerators
Abstract— Current research in the area of Neural Networks computing domains. Especially, embedded system applications
(NN) has resulted in performance advancements for a variety rely more and more on the integration of Neural Networks
of complex problems. Especially, embedded system applications (NNs) in order to provide more sophisticated services and
rely more and more on the utilization of convolutional NNs to
provide services such as image/audio classification and object
enhance user experience. As embedded devices are generally
detection. The core arithmetic computation performed during NN characterized by limited computing capabilities and they are
inference is the multiply-accumulate (MAC) operation. In order also energy constrained, custom hardware accelerators prevail
to meet tighter and tighter throughput constraints, NN acceler- as a solution to the accuracy-throughput trade-off. The core
ators integrate thousands of MAC units resulting in a signifi- arithmetic operation performed by NNs during inference is the
cant increase in power consumption. Approximate computing is multiply-accumulate (MAC) operation. Particularly, the con-
established as a design alternative to improve the efficiency of
computing systems by trading computational accuracy for high volution and fully connected layers of NNs perform millions
energy savings. In this work, we bring approximate computing of multiplications and additions [1]. However, as state-of-art
principles and NN inference together by designing NN specific NNs are becoming deeper and more complex, such hardware
approximate multipliers that feature multiple accuracy levels accelerators integrate thousands or even more MAC units [2]
at run-time. We propose a time-efficient automated framework in order to keep up with the required throughput. For example,
for mapping the NN weights to the accuracy levels of the
approximate reconfigurable accelerator. The proposed weight-
the cloud-oriented tensor processing unit (TPU) integrates
oriented approximation mapping is able to satisfy tight accuracy 64K MACs [3] while the embedded-oriented Samsung’s NPU
loss thresholds, while significantly reducing energy consumption uses 1K MACs [2] and Google’s Edge TPU comprises 4K
without any need for intensive NN retraining. Our approach MACs [4]. However, this vast number of MAC units results in
is evaluated against several NNs demonstrating that it delivers a significant increase in energy consumption, greatly affecting
high energy savings (17.8% on average) with a minimal loss in the integration in energy constrained embedded devices.
inference accuracy (0.5%).
According to the principle of approximate computing,
Index Terms— Approximate computing, neural network infer- modern systems can trade-off computation accuracy in order to
ence, low-power, reconfigurable approximate multipliers. reduce both execution time and power consumption [5]–[20].
A wide variety of modern applications, such as digital signal
I. I NTRODUCTION processing, image processing, video analytics, and wireless
communications, support this principle and they are good
W ITH the recent and rapid advancements in the area
of artificial intelligence, machine learning has become
the driving force both in general purpose and embedded
candidates for approximation belonging to the Recognition,
Mining, and Synthesis (RMS) application class [5].
Particularly, previous research works [20], [21] revealed that
Manuscript received May 8, 2020; revised July 24, 2020; accepted NNs feature increased error resilience. At the implementation
August 16, 2020. Date of publication September 4, 2020; date of current level, approximate computing is performed with the
version December 1, 2020. This work was partially funded by the German design of approximate circuits targeting additions [6]–[9]
Research Foundation (DFG) through the Project Approximate Computing
aCROss the System Stack (ACCROSS). This article was recommended by and multiplications [9]–[12]. Such approximate circuits
Associate Editor W. Liu. (Zois-Gerasimos Tasoulas and Georgios Zervakis consume significantly less energy at the cost of reduced
contributed equally to this work.) (Corresponding author: Zois-Gerasimos output quality. Since different applications have different
Tasoulas.)
Zois-Gerasimos Tasoulas and Iraklis Anagnostopoulos are with the Depart-
error tolerance, the design of customized approximate circuits
ment of Electrical and Computer Engineering, Southern Illinois Univer- per application is challenging [13]–[15]. Most approximate
sity, Carbondale, IL 62901 USA (e-mail: [email protected]; circuits are generated having fixed approximation and they
[email protected]).
Georgios Zervakis, Hussam Amrouch, and Jörg Henkel are with the
require re-design for different applications [6]–[8], [10]–[12].
Chair for Embedded Systems, Karlsruhe Institute of Technology, 76131 To accelerate the whole process, automation frameworks for
Karlsruhe, Germany (e-mail: [email protected]; [email protected]; designing approximate circuits have been proposed in the
[email protected]). past [16]–[19]. However, these methods also apply fixed
Color versions of one or more of the figures in this article are available
online at https://ieeexplore.ieee.org. approximation and the generated circuits do not support input
Digital Object Identifier 10.1109/TCSI.2020.3019460 adaptive run-time reconfigurability.
1549-8328 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
TASOULAS et al.: WEIGHT-ORIENTED APPROXIMATION FOR ENERGY-EFFICIENT NN INFERENCE ACCELERATORS 4671
Given the inherent error resilience of NNs [20] and the The developed methodology was evaluated, using four
advantages of approximate computing, the design of approxi- datasets [24]–[26], on the Mobilenet-v2 [27], several
mate multipliers and inference accelerators has gained signifi- ResNet [1], and VGG [28] NNs. Experimental results
cant research interest the past few years. Design methodologies showed an average 17.8% energy reduction with only
focus on the utilization of approximate arithmetic units (adders 0.5% accuracy loss. Our weight-oriented approach outper-
and/or multipliers) or memories in order to reduce energy con- forms layer-wise and fixed approximation state-of-the-art
sumption. However, the utilization of approximate components methods [20], [23], [29].
in MAC arrays during NN inference is challenging as modern
NNs are very deep, approximation error is input dependent II. R ELATED W ORK
[15], and layers differ in error resilience [22]. For example, In order to achieve accuracy reconfiguration during NN
ResNet-50 [1] employs approximately 120M multiplications inference, the authors in [23] proposed a heterogeneous archi-
making an exhaustive search impossible, in order to identify tecture built upon several static approximate multipliers [30].
which multiplications should be approximated in order to Specifically, they apply at run-time a layer-wise approximation
keep accuracy high. Previous approaches analyse the impact and they power-gate any approximate multipliers that are
of error in NNs [22], use fixed approximation multipliers not used. However, this approach requires a heterogeneous
[20], [21], or follow a layer-based approximation, in which architecture design, weight tuning, and it also has a high area
they try to find the right approximation level for different overhead resulting in throughput loss due to the underutilized
convolution layers separately [23]. However, such methods hardware. In [29], Simulated Annealing is used to produce
require NN retraining or a layer optimization phase, which approximate reconfigurable multipliers for NN inference by
can be time-consuming for deep NNs. Additionally, all the combining gate-level pruning [18] and wire-by-switch replace-
aforementioned approaches consider fixed error in the utilized ment [31]. Nevertheless, similar to [23], the approximate
approximate circuits without any support for input-adaptive multipliers generated are optimized for the Mean Relative
run-time management. Error (MRE) metric and they apply only layer-wise approx-
In this article, we present a bottom-top design method- imation limiting the potential benefits. In [32] approximate
ology for enabling adaptive and run-time approximation for reconfigurable circuits are generated using wire-by-switch
NN inference accelerators. Our goal is to support weight- replacement and by identifying closed logic island regions.
oriented fine-grain approximation in order to reduce energy The authors in [33] used reconfigurable bloom filters in order
consumption while controlling the accuracy loss and without to support approximate layer-based pattern matching. In this
requiring NN retraining. The contributions of the article are article, we follow a more fine-grain approach by deciding
many-fold: the approximation level based on the weight values of the
• NN-oriented reconfigurable approximate multipliers: We NN, while the approximate multipliers are optimized for low
design a convolution-specific approximate reconfigurable dispersion, which is a more suitable metric for NN inference as
multiplier. Its design is performed based on the error presented in Section III-A. Previous methodologies have also
variance instead of the mean relative or absolute error tried to control the accuracy of the approximations at run-
that existing methods follow. Build on prior art we time by enabling reconfiguration [31], [34]–[38]. Particularly,
design LVRM (Low-Variance Reconfigurable Multiplier), the methods in [34]–[36] apply power gating to achieve recon-
an approximate reconfigurable multiplier that supports figuration. Considering that thousands of MACs are integrated
exact operations (LVRM0) and two approximate modes in NN accelerators, such fine-grained power-gating approach
(LVRM1 and LVRM2). is inefficient. Reference [31] splits the addition to small sub-
• A time-efficient methodology for mapping the different adders and multiplexers select between the exact and predicted
approximation modes based on the weight values of the NN: carries inducing considerable delay overhead. In [37] Cartesian
We present a weight-oriented methodology that considering genetic programming and clock gating is used to generate
an inference accelerator comprising approximate approximate multipliers that feature significant area overhead
reconfigurable multipliers, it decides which approximation and high error value, thus being unsuitable for deep NNs
mode (i.e., LVRM0, LVRM1, or LVRM2) will be used inference. In [38] a synthesis framework for approximate
for each weight value for each layer of the NN, such that reconfigurable circuits is proposed but it targets delay and
the final accuracy of the NN during inference satisfies a not power efficiency. In this article, the proposed multipliers
user provided error threshold and the energy consumption support an exact and two approximation levels and along with
is minimized. Our method offers fine-grain optimization the heuristic algorithm for weight mapping, we are able to
compared to existing coarse-grain layer-wise or fixed control the introduced error for any NN, regardless of its
approximation approaches. size. In [39], the authors presented a NN accelerator which
• No NN retraining is required: Given a trained NN, the pro- integrates approximate multipliers along with a compensation
posed framework selects an approximate mode for each module in order to reduce energy consumption. Similarly,
weight (i.e., maps the approximate accelerator to the NN) the authors in [21] analyzed the impact of error in NNs
without requiring time consuming retraining (i.e., map the by utilizing approximate multipliers in different convolutional
NN to the approximate accelerator). Moreover, we present layers. However, the latter two approaches considered the
how the biases can be used to compensate the error induced LeNet NN, which is shallow comparing to current sate-of-
by the approximate multiplications, thus achieving a zero- art architectures, and the developed multipliers offer a single
cost error correction (no additional hardware or retraining). level of approximation thus not being flexible for deeper NNs.
Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
4672 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 67, NO. 12, DECEMBER 2020
Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
TASOULAS et al.: WEIGHT-ORIENTED APPROXIMATION FOR ENERGY-EFFICIENT NN INFERENCE ACCELERATORS 4673
k
k
= (W j , A j ) − μ(W j ). (6)
j =1 j =1
k
k
μ(Y ) = μ (W j , A j ) − μ(W j )
j =1 j =1
k
k
= μ (W j , A j ) − μ(W j )
j =1 j =1
Fig. 3. Abstract overview of a systolic MAC array similar to the one used
k
k
in Google TPU [3]. = μ(W j ) − μ(W j ) = 0. (7)
j =1 j =1
(W, A) = W × A − W × A|approximat e , (2)
and the variance V ar (Y ) of Y , ∀A j , is given by:
and the approximate convolution output Y is given by:
k
k
k
V ar (Y ) = V ar (W j , A j ) − μ(W j )
Y = B+ (W j × A j − (W j , A j ))
j =1 j =1
j =1
*0
k k
k
k
= B− (W j , A j ) + Wj × Aj. (3) = V ar (W j , A j ) + V ar μ(W j )
j =1 j =1
j =1 j =1
The error value of an approximate multiplier can be viewed k
as a random variable defined by its mean value μ() and its = V ar (W j ). (8)
variance V ar () [40]. Hence, if the error of the approximate j =1
multiplier is systematic, its mean value μ() can be compen-
sated by a constant correction term [40]. During the inference Note that in Equation 8, W j are independent variables (i.e.,
phase, for each filter, the weights are constant while the activa- the error of one multiplication does not depend on the error
tions flow through the accelerator (MAC array) to perform the of another multiplication) and thus their covariance is zero.
convolution [3]. We define μ(W ) and V ar (W ) the mean error As a result, when using our error correction through bias
and the variance of performing the multiplication W × A, ∀A, update (Equation 4), the error value Y of the convolution
i.e., multiplying a fixed weight W with any input A. Therefore, output Y features:
using the bias value we can compensate the error induced
by the employed approximate multiplication. To achieve this, μ(Y ) = 0
the bias B is replaced by B as Equation 4 shows: k
and V ar (Y ) = V ar (W j ). (9)
k j =1
B = B + μ(W j ). (4)
j =1 Hence, leveraging our bias update, the error value of the
output Y is defined only by the V ar (W ), ∀W . Therefore,
Hence, using the bias update, Equation 3 is written as: the accuracy of the convolution is only subject to the error
k
k variance of the approximate multiplier (V ar (W ), ∀W ). In
Y = B − (W j , A j ) + Wj × Aj other words, approximate multipliers with low dispersion
j =1 j =1 will merely impact the output quality. Qualitatively, we use
the mean error μ(W ) (i.e., expected error value) of each
k
k
k
= B+ μ(W j ) − (W j , A j ) + W j × A j . (5) approximate multiplier to correct the generated error. In that
j =1 j =1 j =1
way, if the variance (V ar (W )) is small, the multiplication
errors are concentrated close to the mean value, boosting the
Thus, the error Y of the output Y is given by: efficiency of our correction.
As Equations 1-9 demonstrate, the proposed error correc-
Y = Y − Y tion efficiently nullifies the mean error of the convolution.
k Additionally, combining approximate multipliers with low
= B+ Wj × Aj dispersion (low variance), we can deliver very high accuracy
j =1 at convolution level. What is important to note is that our error
k
k
k correction requires only to update the biases after quantization.
−B− μ(W j ) + (W j , A j ) − Wj × Aj Therefore, it comes with zero-cost since it does not require any
j =1 j =1 j =1 retraining or additional hardware.
Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
4674 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 67, NO. 12, DECEMBER 2020
2) Low-Variance Reconfigurable Multiplier Design: Con- Equation 4. Note that, in contrast to [29], we specify that
sidering the above analysis, we utilize the widely used approx- the variance constraint must be satisfied ∀W . We consider
imation technique wire-by-switch replacement [29], [31], [32] three accuracy levels with Vi bounds 0, 82 , and 202 , respec-
in order to generate approximate reconfigurable multipliers tively, where 0 refers to exact operation. Due to such tight
that can replace the exact ones of a NN inference accelerator. variance constraints, the obtained solutions featured moderate
The wire-by-switch approximation technique replaces a wire energy reduction. Hence, we relax the variance constraint
by a switch that selects between the wire’s value and an V ar (W ) ≤ Vi and we replace it by:
approximate one. When the switch is off, the output of the
switch equals to the wire’s value and exact results are obtained. p V ar (W ) ≤ Vi ≥ 0.90. (12)
When the switch is on, the output of the switch equals
to the approximate value (e.g., ‘0’ or ‘1’). This technique In other words, we specify that at least the 90% of the weights
is widely used to boost performance by limiting the carry have to satisfy the variance constraint. Then, at run-time,
propagation and/or to decrease the power consumption by the 10% of the weights that violate the variance constraint may
limiting the circuit’s switching activity [29], [31], [32]. A be mapped to exact operation. The variance constraints (82
control signal is used to turn on/off the switches and select and 202 ) are selected after we evaluated all the Pareto-optimal
between exact and approximate execution. Therefore, by using multipliers in [30] and examined how their error variance
different control signals to enable different switches, we can impacts the final accuracy of different NNs (see Section IV).
achieve multiple varying accuracy levels [29]. In our work, Additionally, they provide a good balance between error
we consider three accuracy levels, while a 2-bit input control variance and energy gain (as shown later). The constraint 82
signal is required to select the desired accuracy level (i.e., is very tight, delivering high accuracy, but also leaves limited
enable the respective switches of each level). When the control room for approximations. The 202 constraint is more relaxed
signal is set to 2 b00, all the switches are off and exact and thus, although it results in a lower accuracy, it allows
computations are performed. When it is 2 b01, the switches for more approximation and delivers higher energy gains.
of the first approximate level are turned on. Similarly, when Finally, considering that the 8-bit multiplier is a small circuit,
it is 2 b11, the switches of the second approximate level are supporting more accuracy levels would limit the delivered
turned on. To apply wire-by-switch replacement we need to energy savings since the latter highly depend on the circuit
identify which wires will be replaced by switches as well as characteristics (e.g., size and activity) and the number of
the respective approximate value (‘0’ or ‘1’) that will be used induced switches [29].
in the replacement. In [29], Simulated Annealing combined Using the proposed quality function and the 8-bit exact
with a high-level power estimator is used to identify the wires multiplier of [30] as our baseline circuit, we generate an
to be approximated. In [32] the authors aim to identify logic approximate reconfigurable multiplier with three accuracy
isolation islands, i.e., closed logic cones, and apply wire-by- levels, named LVRM, that satisfies the aforementioned vari-
switch replacement. However, such heuristics do not guarantee ance constraints. Our exhaustive design space exploration to
power optimality. In our work, since we only need to generate generate LVRM operates as follows:
an approximate reconfigurable 8-bit multiplier (i.e., a small i We synthesize the exact multiplier at its critical path
circuit), we employ an exhaustive design space exploration to delay using Synopsys DesignCompiler targeting the 7nm
ensure power optimality. In addition, [29] utilizes the mean FinFET library [41] and obtain its gate-level netlist.
relative error (MRE) as a quality function to generate approx- ii Since we target low error variance, we extract from the
imate reconfigurable multipliers. For each accuracy level i , netlist only the wires that affect the eight least significant
with error constraint E i , [29] identifies the approximations bits of the multiplier.
that satisfy: iii We generate all the possible configurations where each
(W, A) of the extracted wires in the previous step is either not
aver age ≤ Ei . (10) modified (exact), or it is replaced by ‘0’ or ‘1’.
W×A iv We simulate all the configurations of the previous step to
Nevertheless, since the relative error is not an additive metric, calculate their output error. To speedup the error evalua-
MRE is not a representative error metric when targeting a tion, we use a Verilog-to-C converter [29] to run all the
convolution operation as the latter requires the summation of simulations at C-level. All the possible input combinations
a large number of products. Therefore, we replace the quality (65, 536 in total) are considered in the simulations.
function of [29] with Equation 11. v For each accuracy level, we keep only the configurations
that satisfy the respective error bound (variance bound)
∀i, W : of the quality function (Equations 11-12).
V ar (W ) ≤ Vi vi Using the obtained configurations of the previous step,
p(W > 0) ≥ 0.80 ∨ p(W > 0) = 0 (11) we apply wire-by-switch replacement and synthesize the
generated netlists at the critical path delay of the exact
where Vi is the variance constraint at each accuracy level multiplier. After synthesis, we obtain the corresponding
i and p(W > 0) is the probability that the multiplication approximate reconfigurable multipliers. To obtain an effi-
W × A, ∀A, produces an erroneous output. By setting p(W > cient control circuitry, we follow the approach of [29]
0) ≥ 0.80, we aim in generating an approximate multiplier and we constraint the configurations selected for the first
with high error rate (i.e., systematic error) in order to exploit accuracy level to be a subset of the configurations selected
Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
TASOULAS et al.: WEIGHT-ORIENTED APPROXIMATION FOR ENERGY-EFFICIENT NN INFERENCE ACCELERATORS 4675
Fig. 4. The (a) error variance and (b) energy reduction of LVRM at the
two approximate modes (82 and 202 variance constraints) with respect to the
weight values (8-bit).
Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
4676 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 67, NO. 12, DECEMBER 2020
Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
4678 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 67, NO. 12, DECEMBER 2020
Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
TASOULAS et al.: WEIGHT-ORIENTED APPROXIMATION FOR ENERGY-EFFICIENT NN INFERENCE ACCELERATORS 4679
Fig. 8. Energy savings comparison of different methods that utilize approximate multipliers for the CIFAR-10 dataset.
Fig. 9. Energy savings comparison of different methods that utilize approximate multipliers for the CIFAR-100 dataset.
proposed approximate reconfigurable accelerator. Therefore, consumption by the sum of the average energy of each
for the fairness of the evaluation, we consider a homoge- accuracy level multiplied by the number of multiplications
neous architecture for ALWANN [23] and thus, the same performed in the respective accuracy level. To capture the
approximate multiplier type is used in all the convolution average energy consumption of the examined approximate
layers. Nevertheless, we consider all the 32 Pareto optimal multipliers (i.e., LVRM, [29], [30]), the same procedure as
approximate multipliers of [30]. Finally, note that for our in Section III-A is followed. All the multipliers are described
analysis, we also evaluated the novel alphabet multipliers in Verilog RTL, they are synthesized at the critical path
proposed in [20]. However, they require NN retraining. For the delay of the exact multiplier [30] (i.e., same performance)
fairness of the comparisons we did not perform retraining and using DesignCompiler and the compile_ultra command,
thus alphabet [20] delivered poor accuracy results, below the and are mapped to the 7nm technology library [41]. Post-
examined thresholds. Hence, it is not included in our analysis. synthesis timing simulations are performed using QuestaSim
and 500, 000 random generated inputs. The obtained switching
B. Experimental Setup activity is fed to PrimeTime to perform the power analysis.
1) Accuracy Evaluation: Initially, all the examined NNs
were trained on the aforementioned datasets using the Ten- C. Results
sorflow machine learning library. Then, the NNs were frozen Figures 8-11 depict the results of our experimental
and quantized to 8-bit. In order to evaluate the accuracy evaluation in terms of energy savings for the multiplication
of the examined approximate methods, i.e., M2-M6, all the operations for all the examined NN models and approximate
approximate multipliers were described in C. Note that all the methods. In our framework, we selected three values for
evaluated approximate multipliers (LVRM, [29], [30]) apply the accuracy drop threshold {0.5%, 1.0%, 2.0%} and we
functional approximation and thus, they can be seamlessly compare our energy savings against the other methods whose
represented in C. Next, we overrode the convolution layers solutions also satisfied the same thresholds. These values
in Tensorflow and replaced the exact multiplications with the were selected since we target high inference accuracy. Note
respective approximate ones [23]. Finally, for each quantized that for M2 [23], in each case, we present the highest energy
model and approximate method, we executed the inference reduction achieved by an approximate multipliers of [30].
phase and captured the delivered accuracy. Hence, the reported energy gains are not delivered using a
2) Energy Evaluation: In terms of energy savings, we eval- single approximate multiplier.
uated the energy gains originated by the multiplication oper- Figure 8 compares different configurations for all the
ations of the convolution layers. The energy required for selected NNs on the CIFAR-10 dataset. As an overall
the multiplications of a layer is estimated by the number observation, we see that the proposed methodology always
of multiplications in that layer multiplied by the average produces configurations with the highest energy gain, with
energy of the used multiplier [23], [29]. As the baseline an average gain of 17.7%, within the examined accuracy
energy consumption, we consider the energy consumed when loss margins. Specifically for the ResNet-20, the proposed
using the exact multiplier. For the approximate reconfigurable framework achieves up to 17.5% energy gain for an accuracy
multipliers (i.e., LVRM and [29]) we estimate the total energy loss of only 1.7%, while this gain increases to 18.6% for
Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
4680 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 67, NO. 12, DECEMBER 2020
Fig. 10. Energy savings comparison of different methods that utilize approximate multipliers for the GTSRB dataset.
Fig. 11. Energy savings comparison of different methods that utilize approximate multipliers for the LISA dataset.
the ResNet-32. Similarly for the ResNet-44 and ResNet-56, savings with M4 being the second best, with average energy
the proposed approach finds solutions that achieve the highest gain of 13.9%, with regards to the average energy savings
energy gains (16.1% and 17% on average, respectively), for all NNs and accuracy thresholds. Interestingly, M5 again
while respecting the accuracy drop thresholds. For the cannot achieve significant energy savings for ResNet-56 under
Mobilenet-v2 (Figure 8(e)), the proposed framework achieves the 0.5% accuracy threshold (Figure 9(d)). This happens
up to 19.3% energy gain. It is noteworthy that, for this NN on because even though the fine-grain weight mapping allows
CIFAR-10, all methods that utilize the LVRM approximate for better exploration of approximate modes, the bias-based
multipliers have increased energy savings. Finally, for error correction that we propose is important for such small
VGG-11 and VGG-13 the corresponding energy savings are accuracy drop threshold. In this dataset as well, M2 [23] and
17.4% and 18.1%. The configuration with closest average M3 [29] have again the smallest energy gains verifying that
energy gain, for all the examined accuracy loss thresholds our choice to follow a fine-grain weight mapping combined
and NNs, comes from the M4 method with an average gain with a low-variance approximate reconfigurable multiplier
of 14.5%. This shows that our error correction mechanism further reduces the energy consumption.
utilizing the bias lets us map more weights to approximate Figure 10 compares different configurations for all the
multipliers. Interestingly, M2 [23] and M3 [29] have the selected NNs on the GTSRB dataset. The proposed method-
smallest energy gains. M3 follows a layer wise approach and ology still remains the best with an average energy gain of
uses a reconfigurable multiplier optimized for MRE. Thus, 16.6%. For this dataset, M4 is again the second best choice,
a small number of layers could be mapped to an approximate with an average gain for all NNs of 15.6%. This behavior
mode without violating the desired accuracy loss threshold. verifies our strategy for error correction using bias. Again,
Similarly, M2 applies fixed approximation, and thus, only the M2 [23] and M3 [29] have the smallest energy gains.
multipliers of [30] that feature almost negligible MRE satisfy Last, Figure 11 presents the results for the LISA dataset. The
the accuracy threshold, limiting the energy gains. These behavior of all approaches is similar to the other three datasets,
results, come in compliance with our evaluation in Figure 1. with our method achieving the highest energy gains of 20.2%
Figure 9 compares different configurations for all the on average. Interestingly, M5 does not perform very well
selected NNs on the CIFAR-100 dataset. As an overall for ResNet-56 for 0.5% and 1.0% accuracy drop thresholds
observation, we see again that the proposed methodology (Figure 11(d)). This happens because the accuracy drop thresh-
always produces configurations with the highest energy old is small and since M5 does not utilize the proposed error
gain, with an average gain of 17.7%, within the examined correction with bias method, the weight-mapping solution
accuracy loss margins. Specifically, the maximum energy is limited to how many weights are mapped to approximate
gain achieved by our method (M6) was 19.1%, recorded modes, even though a fine-grain approach is followed. This is
for the Mobilenet-v2, while respecting the accuracy drop also verified by the behavior of M4 which remains the second
thresholds. Additionally, the behavior of all NNs is similar as best with an average gain, for all NNs, of 19%.
in the previous dataset. Specifically, all methods that utilize In Figures 8-11 we evaluate our method (M6) against 7
the LVRM approximate multipliers have increased energy NNs trained for 4 different datasets. Although, the obtained
Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
TASOULAS et al.: WEIGHT-ORIENTED APPROXIMATION FOR ENERGY-EFFICIENT NN INFERENCE ACCELERATORS 4681
Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
4682 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 67, NO. 12, DECEMBER 2020
Gold 6138 processor, which runs at 2.00 G H z base frequency [9] H. Jiang, C. Liu, L. Liu, F. Lombardi, and J. Han, “A review, classifi-
and features 40 cores and 80 threads. Additionally, the system cation, and comparative evaluation of approximate arithmetic circuits,”
ACM J. Emerg. Technol. Comput. Syst., vol. 13, no. 4, pp. 1–34,
memory is 128G B. The proposed methodology can be Aug. 2017.
efficiently parallelized on multicore systems by dividing the [10] G. Zervakis, K. Tsoumanis, S. Xydis, D. Soudris, and K. Pekmestzi,
steps of the exploration among multiple threads. Each thread “Design-efficient approximate multiplication circuits through partial
product perforation,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
can explore configurations for a fraction of the total layers. vol. 24, no. 10, pp. 3105–3117, Oct. 2016.
Furthermore, the significance metric is determined only once [11] D. Esposito, A. G. M. Strollo, E. Napoli, D. De Caro, and N. Petra,
per NN-dataset combination, irrespective of the number of “Approximate multipliers based on new approximate compressors,”
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 65, no. 12, pp. 4169–4182,
thresholds that we want to satisfy. The exploration for each Dec. 2018.
accuracy threshold that we need to meet can take place [12] W. Liu, J. Xu, D. Wang, C. Wang, P. Montuschi, and F. Lombardi,
concurrently on multicore systems, thus yielding very quick “Design and evaluation of approximate logarithmic multipliers for low
power error-tolerant applications,” IEEE Trans. Circuits Syst. I, Reg.
results. As shown in Figure 12, the proposed framework Papers, vol. 65, no. 9, pp. 2856–2868, Sep. 2018.
required up to 2h (for the ResNet-56). Execution time of [13] M. Pashaeifar, M. Kamal, A. Afzali-Kusha, and M. Pedram, “A the-
this scale can be consider negligible compared to longer time oretical framework for quality estimation and optimization of DSP
applications using low-power approximate adders,” IEEE Trans. Circuits
required by other methods. As reported in [23], it required Syst. I, Reg. Papers, vol. 66, no. 1, pp. 327–340, Jan. 2019.
7.5 days (for ResNet-50), 90× mores than our method. [14] L. B. Soares, M. M. A. da Rosa, C. M. Diniz, E. A. C. da Costa, and
Additionally, regarding ResNet-56, RETSINA [29] required S. Bampi, “Design methodology to explore hybrid approximate adders
for energy-efficient image and video processing accelerators,” IEEE
4.5 hours and the retraining procedure as in [20], [21] Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 6, pp. 2137–2150,
took 11 hours, thus making our method 2.25× and 5.5× Jun. 2019.
faster respectively. [15] A. Raha and V. Raghunathan, “Towards full-system energy-accuracy
tradeoffs: A case study of an approximate smart camera system,” in
Proc. 54th Annu. Design Autom. Conf., Jun. 2017, pp. 1–6.
V. C ONCLUSION [16] G. Zervakis, K. Koliogeorgi, D. Anagnostos, N. Zompakis, and
K. Siozios, “VADER: Voltage-driven netlist pruning for cross-layer
In this article, we presented a bottom-top design method- approximate arithmetic circuits,” IEEE Trans. Very Large Scale Integr.
ology for enabling adaptive and run-time approximation (VLSI) Syst., vol. 27, no. 6, pp. 1460–1464, Jun. 2019.
[17] I. Scarabottolo, G. Ansaloni, G. A. Constantinides, and L. Pozzi,
for NN accelerators (e.g., MAC arrays) during inference. “Partition and propagate: An error derivation algorithm for the design
The proposed framework utilizes NN-oriented low-variance of approximate circuits,” in Proc. 56th Annu. Design Autom. Conf.,
approximate multipliers, which support multiple approxima- Jun. 2019, pp. 1–6.
[18] J. Schlachter, V. Camus, K. V. Palem, and C. Enz, “Design and
tion modes, in order to employ a weight-oriented fine-grain applications of approximate circuits by gate-level pruning,” IEEE Trans.
weight mapping. As a result we are able to significantly Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 5, pp. 1694–1702,
reduce energy consumption while satisfying tight accuracy May 2017.
[19] G. Zervakis, S. Xydis, D. Soudris, and K. Pekmestzi, “Multi-level
requirements during inference. Experimental results on multi- approximate accelerator synthesis under voltage island constraints,”
ple NNs and different datasets shows that the proposed weight- IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 66, no. 4, pp. 607–611,
oriented approximation method exploiting our reconfigurable Apr. 2019.
[20] S. S. Sarwar, S. Venkataramani, A. Ankit, A. Raghunathan, and K. Roy,
multipliers, achieves always the lowest energy consumption, “Energy-efficient neural computing with approximate multipliers,” ACM
comparing to other state-of-art approaches that apply fixed or J. Emerg. Technol. Comput. Syst., vol. 14, no. 2, pp. 1–23, Jul. 2018.
layer-wise approximation. [21] V. Mrazek, S. S. Sarwar, L. Sekanina, Z. Vasicek, and K. Roy, “Design of
power-efficient approximate multipliers for approximate artificial neural
networks,” in Proc. 35th Int. Conf. Comput.-Aided Design, Nov. 2016,
R EFERENCES pp. 1–7.
[22] M. A. Hanif, R. Hafiz, and M. Shafique, “Error resilience analysis
[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for for systematically employing approximate computing in convolutional
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. neural networks,” in Proc. Design, Autom. Test Eur. Conf. Exhib.
(CVPR), Jun. 2016, pp. 770–778. (DATE), Mar. 2018, pp. 913–916.
[2] J. Song et al., “An 11.5TOPS/W 1024-MAC butterfly structure dual- [23] V. Mrazek, Z. Vasicek, L. Sekanina, M. A. Hanif, and M. Shafique,
core sparsity-aware neural processing unit in 8nm flagship mobile SoC,” “ALWANN: Automatic layer-wise approximation of deep neural network
in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, accelerators without retraining,” in Proc. IEEE/ACM Int. Conf. Comput.-
Feb. 2019, pp. 130–132. Aided Design (ICCAD), Nov. 2019, pp. 1–8.
[3] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor [24] A. Krizhevsky, “Learning multiple layers of features from tiny images,”
processing unit,” in Proc. Annu. Int. Symp. Comput. Archit., 2017, Univ. Toronto, Toronto, ON, Canada, Tech. Rep. 4, 2009.
pp. 1–12. [25] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, “Man vs. computer:
[4] S. Cass, “Taking AI to the edge: Google’s TPU now comes in a maker- Benchmarking machine learning algorithms for traffic sign recognition,”
friendly package,” IEEE Spectr., vol. 56, no. 5, pp. 16–17, May 2019. Neural Netw., vol. 32, pp. 323–332, Aug. 2012.
[5] J. Han and M. Orshansky, “Approximate computing: An emerging [26] A. Mogelmose, M. M. Trivedi, and T. B. Moeslund, “Vision-based traffic
paradigm for energy-efficient design,” in Proc. 18th IEEE Eur. TEST sign detection and analysis for intelligent driver assistance systems:
Symp. (ETS), May 2013, pp. 1–6. Perspectives and survey,” IEEE Trans. Intell. Transp. Syst., vol. 13, no. 4,
[6] F. Ebrahimi-Azandaryani, O. Akbari, M. Kamal, A. Afzali-Kusha, pp. 1484–1497, Dec. 2012.
and M. Pedram, “Block-based carry speculative approximate adder for [27] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and
energy-efficient applications,” IEEE Trans. Circuits Syst. II, Exp. Briefs, L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,”
vol. 67, no. 1, pp. 137–141, Jan. 2020. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
[7] D. Esposito, D. De Caro, E. Napoli, N. Petra, and A. G. M. Strollo, pp. 4510–4520.
“Variable latency speculative Han-Carlson adder,” IEEE Trans. Circuits [28] K. Simonyan and A. Zisserman, “Very deep convolutional networks
Syst. I, Reg. Papers, vol. 62, no. 5, pp. 1353–1361, May 2015. for large-scale image recognition,” 2014, arXiv:1409.1556. [Online].
[8] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, “Low-power Available: http://arxiv.org/abs/1409.1556
digital signal processing using approximate adders,” IEEE Trans. [29] G. Zervakis, H. Amrouch, and J. Henkel, “Design automation of approxi-
Comput.-Aided Design Integr. Circuits Syst., vol. 32, no. 1, pp. 124–137, mate circuits with runtime reconfigurable accuracy,” IEEE Access, vol. 8,
Jan. 2013. pp. 53522–53538, 2020.
Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
TASOULAS et al.: WEIGHT-ORIENTED APPROXIMATION FOR ENERGY-EFFICIENT NN INFERENCE ACCELERATORS 4683
[30] V. Mrazek, R. Hrbacek, Z. Vasicek, and L. Sekanina, “EvoApprox8b: Georgios Zervakis received the Diploma and
Library of approximate adders and multipliers for circuit design and Ph.D. degrees from the Department of Electri-
benchmarking of approximation methods,” in Proc. Design, Autom. Test cal and Computer Engineering (ECE), National
Eur. Conf. Exhib. (DATE), Mar. 2017, pp. 258–261. Technical University of Athens (NTUA), Greece,
[31] R. Ye, T. Wang, F. Yuan, R. Kumar, and Q. Xu, “On reconfiguration- in 2012 and 2018, respectively. He worked as a
oriented approximate adder design and its application,” in Proc. primary Researcher in several EU-funded projects
IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), Nov. 2013, as a member of the Institute of Communication
pp. 48–54. and Computer Systems (ICCS), Athens, Greece.
[32] S. Jain, S. Venkataramani, and A. Raghunathan, “Approximation through He is currently a Research Group Leader of the
logic isolation for the design of quality configurable circuits,” in Proc. Chair for Embedded Systems (CES) with the Karl-
Design, Autom. Test Eur. Conf. Exhib. (DATE), 2016, pp. 612–617. sruhe Institute of Technology (KIT), Germany. His
[33] X. Jiao, V. Akhlaghi, Y. Jiang, and R. K. Gupta, “Energy-efficient research interests include approximate computing, low-power design, design
neural networks using approximate computation reuse,” in Proc. Design, automation, and integration of hardware acceleration in cloud.
Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2018, pp. 1223–1228.
[34] A. Raha, H. Jayakumar, and V. Raghunathan, “Input-based dynamic
reconfiguration of approximate arithmetic units for video encoding,” Iraklis Anagnostopoulos (Member, IEEE) received
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 24, no. 3, the Ph.D. degree from the Microprocessors and
pp. 846–857, Mar. 2016. Digital Systems Laboratory, National Technical Uni-
[35] O. Akbari, M. Kamal, A. Afzali-Kusha, and M. Pedram, “RAP-CLA: versity of Athens. He is currently an Assistant Pro-
A reconfigurable approximate carry look-ahead adder,” IEEE Trans. fessor with the Electrical and Computer Engineering
Circuits Syst. II, Exp. Briefs, vol. 65, no. 8, pp. 1089–1093, Aug. 2018. Department, Southern Illinois University Carbon-
[36] O. Akbari, M. Kamal, A. Afzali-Kusha, and M. Pedram, “Dual-quality dale. He is the Director of the Embedded Sys-
4: 2 compressors for utilizing in dynamic accuracy configurable multi- tems Software Laboratory, which works on run-time
pliers,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 4, resource management of modern and heterogeneous
pp. 1352–1361, Apr. 2017. embedded many-core architectures, and he is also
[37] V. Mrazek, Z. Vasicek, and L. Sekanina, “Design of quality-configurable affiliated with the Center for Embedded Systems.
approximate multipliers suitable for dynamic environment,” in Proc. His research interests lie in the area of constrained application mapping
NASA/ESA Conf. Adapt. Hardw. Syst. (AHS), Aug. 2018, pp. 264–271. for many-core systems, design and exploration of heterogeneous platforms,
[38] B. Boroujerdian, H. Amrouch, J. Henkel, and A. Gerstlauer, “Trading
resource contention minimization, and power-aware design of embedded
off temperature guardbands via adaptive approximations,” in Proc. IEEE
systems.
36th Int. Conf. Comput. Design (ICCD), Oct. 2018, pp. 202–209.
[39] M. A. Hanif, F. Khalid, and M. Shafique, “CANN: Curable approxima-
tions for high-performance deep neural network accelerators,” in Proc.
Hussam Amrouch (Member, IEEE) received the
56th Annu. Design Autom. Conf., Jun. 2019, pp. 1–6.
[40] C. Li, W. Luo, S. S. Sapatnekar, and J. Hu, “Joint precision optimization Ph.D. degree (summa cum laude) from KIT in 2015.
and high level synthesis for approximate computing,” in Proc. 52nd He is currently a Junior Professor heading the
Annu. Design Autom. Conf. (DAC), 2015, pp. 1–6. Chair of Semiconductor Test and Reliability (STAR)
[41] L. T. Clark et al., “ASAP7: A 7-nm finFET predictive process design within the Computer Science, Electrical Engineer-
kit,” Microelectron. J., vol. 53, pp. 105–115, Jul. 2016. ing Faculty, University of Stuttgart, as well as a
[42] A. Renda, J. Frankle, and M. Carbin, “Comparing rewinding and fine- Research Group Leader with the Karlsruhe Institute
tuning in neural network pruning,” 2020, arXiv:2003.02389. [Online]. of Technology (KIT), Germany. His main research
Available: http://arxiv.org/abs/2003.02389 interests are design for reliability and testing from
[43] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and device physics to systems, machine learning, secu-
connections for efficient neural network,” in Proc. Adv. Neural Inf. rity, approximate computing, and emerging tech-
Process. Syst., 2015, pp. 1135–1143. nologies with a special focus on ferroelectric devices. He holds seven HiPEAC
[44] M. Zhu and S. Gupta, “To prune, or not to prune: Exploring the efficacy Paper awards and three best paper nominations at top EDA conferences:
of pruning for model compression,” 2017, arXiv:1710.01878. [Online]. DAC’16, DAC’17, and DATE’17 for his work on reliability. He also serves
Available: http://arxiv.org/abs/1710.01878 as Associate Editor for Integration, the VLSI Journal. He has served in the
[45] P. Gysel, J. Pimentel, M. Motamedi, and S. Ghiasi, “Ristretto: A frame- Technical Program Committees of many major EDA conferences, such as
work for empirical study of resource-efficient inference in convolutional DAC, ASP-DAC, and ICCAD, and as a reviewer in many top journals, such
neural networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 11, as T-ED, TCAS-I, TVLSI, TCAD, and TC. He has around 85 publications
pp. 5784–5789, Nov. 2018. in multidisciplinary research areas across the entire computing stack, starting
[46] Keras. Keras CIFAR-10 ResNet. Accessed: Jul. 24, 2020. [Online]. Avail- from semiconductor physics to circuit design all the way up to computer-aided
able: https://github.com/keras-team/keras/blob/master/examples/cifar10_ design, and computer architecture.
resnet.py
[47] Pytorch. VGG-NETS. Accessed: Jul. 24, 2020. [Online]. Available:
https://pytorch.org/hub/pytorch_vision_vgg/ Jörg Henkel (Fellow, IEEE) received the Diploma
[48] A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and degree and the Ph.D. degree (summa cum laude)
T. Krishna, “SCALE-sim: Systolic CNN accelerator simulator,” 2018, from the Technical University of Braunschweig. He
arXiv:1811.02883. [Online]. Available: http://arxiv.org/abs/1811.02883 was a Research Staff Member with NEC Laborato-
ries, Princeton, NJ, USA. He is currently the Chair
Professor of embedded systems with the Karlsruhe
Institute of Technology. His research work is focused
on co-design for embedded hardware/software sys-
tems with respect to power, thermal, and reliability
Zois-Gerasimos Tasoulas received the Diploma aspects. He has received six best paper awards
in electrical and computer engineering from the throughout his career from, among others, ICCAD,
National Technical University of Athens, Greece, ESWeek, and DATE. For two consecutive terms, he served as the Editor-
in 2016. He is currently pursuing the Ph.D. degree in-Chief for the ACM Transactions on Embedded Computing Systems. He
with the Electrical, Computer, and Biomedical is also the Editor-in-Chief of IEEE Design&Test Magazine. He is/has been
Engineering Department, Southern Illinois Univer- an Associate Editor of major ACM and IEEE T RANSACTIONS . He has led
sity Carbondale, IL, USA. His research interests several conferences as a General Chair, including ICCAD and ESWeek, and
include concurrent data-structures, performance of serves as a Steering Committee Chair/Member of leading conferences and
many-core systems, system reliability, and GPGPU journals for embedded and cyber-physical systems. He coordinates the DFG
resource management. He is awarded with a Doc- program SPP 1500 Dependable Embedded Systems. He is a Site Coordinator
toral Graduate Fellowship from the University of of the DFG TR89 Collaborative Research Center on Invasive Computing. He
Southern Illinois Carbondale for the 2019–2020 academic year. is the Chairman of the IEEE Computer Society, Germany Chapter.
Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.