0% found this document useful (0 votes)
18 views14 pages

Weight-Oriented Approximation for Energy-Efficient Neural Network Inference Accelerators

Uploaded by

jspkay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views14 pages

Weight-Oriented Approximation for Energy-Efficient Neural Network Inference Accelerators

Uploaded by

jspkay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

4670 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 67, NO.

12, DECEMBER 2020

Weight-Oriented Approximation for


Energy-Efficient Neural Network
Inference Accelerators
Zois-Gerasimos Tasoulas , Georgios Zervakis , Iraklis Anagnostopoulos , Member, IEEE,
Hussam Amrouch , Member, IEEE, and Jörg Henkel , Fellow, IEEE

Abstract— Current research in the area of Neural Networks computing domains. Especially, embedded system applications
(NN) has resulted in performance advancements for a variety rely more and more on the integration of Neural Networks
of complex problems. Especially, embedded system applications (NNs) in order to provide more sophisticated services and
rely more and more on the utilization of convolutional NNs to
provide services such as image/audio classification and object
enhance user experience. As embedded devices are generally
detection. The core arithmetic computation performed during NN characterized by limited computing capabilities and they are
inference is the multiply-accumulate (MAC) operation. In order also energy constrained, custom hardware accelerators prevail
to meet tighter and tighter throughput constraints, NN acceler- as a solution to the accuracy-throughput trade-off. The core
ators integrate thousands of MAC units resulting in a signifi- arithmetic operation performed by NNs during inference is the
cant increase in power consumption. Approximate computing is multiply-accumulate (MAC) operation. Particularly, the con-
established as a design alternative to improve the efficiency of
computing systems by trading computational accuracy for high volution and fully connected layers of NNs perform millions
energy savings. In this work, we bring approximate computing of multiplications and additions [1]. However, as state-of-art
principles and NN inference together by designing NN specific NNs are becoming deeper and more complex, such hardware
approximate multipliers that feature multiple accuracy levels accelerators integrate thousands or even more MAC units [2]
at run-time. We propose a time-efficient automated framework in order to keep up with the required throughput. For example,
for mapping the NN weights to the accuracy levels of the
approximate reconfigurable accelerator. The proposed weight-
the cloud-oriented tensor processing unit (TPU) integrates
oriented approximation mapping is able to satisfy tight accuracy 64K MACs [3] while the embedded-oriented Samsung’s NPU
loss thresholds, while significantly reducing energy consumption uses 1K MACs [2] and Google’s Edge TPU comprises 4K
without any need for intensive NN retraining. Our approach MACs [4]. However, this vast number of MAC units results in
is evaluated against several NNs demonstrating that it delivers a significant increase in energy consumption, greatly affecting
high energy savings (17.8% on average) with a minimal loss in the integration in energy constrained embedded devices.
inference accuracy (0.5%).
According to the principle of approximate computing,
Index Terms— Approximate computing, neural network infer- modern systems can trade-off computation accuracy in order to
ence, low-power, reconfigurable approximate multipliers. reduce both execution time and power consumption [5]–[20].
A wide variety of modern applications, such as digital signal
I. I NTRODUCTION processing, image processing, video analytics, and wireless
communications, support this principle and they are good
W ITH the recent and rapid advancements in the area
of artificial intelligence, machine learning has become
the driving force both in general purpose and embedded
candidates for approximation belonging to the Recognition,
Mining, and Synthesis (RMS) application class [5].
Particularly, previous research works [20], [21] revealed that
Manuscript received May 8, 2020; revised July 24, 2020; accepted NNs feature increased error resilience. At the implementation
August 16, 2020. Date of publication September 4, 2020; date of current level, approximate computing is performed with the
version December 1, 2020. This work was partially funded by the German design of approximate circuits targeting additions [6]–[9]
Research Foundation (DFG) through the Project Approximate Computing
aCROss the System Stack (ACCROSS). This article was recommended by and multiplications [9]–[12]. Such approximate circuits
Associate Editor W. Liu. (Zois-Gerasimos Tasoulas and Georgios Zervakis consume significantly less energy at the cost of reduced
contributed equally to this work.) (Corresponding author: Zois-Gerasimos output quality. Since different applications have different
Tasoulas.)
Zois-Gerasimos Tasoulas and Iraklis Anagnostopoulos are with the Depart-
error tolerance, the design of customized approximate circuits
ment of Electrical and Computer Engineering, Southern Illinois Univer- per application is challenging [13]–[15]. Most approximate
sity, Carbondale, IL 62901 USA (e-mail: [email protected]; circuits are generated having fixed approximation and they
[email protected]).
Georgios Zervakis, Hussam Amrouch, and Jörg Henkel are with the
require re-design for different applications [6]–[8], [10]–[12].
Chair for Embedded Systems, Karlsruhe Institute of Technology, 76131 To accelerate the whole process, automation frameworks for
Karlsruhe, Germany (e-mail: [email protected]; [email protected]; designing approximate circuits have been proposed in the
[email protected]). past [16]–[19]. However, these methods also apply fixed
Color versions of one or more of the figures in this article are available
online at https://ieeexplore.ieee.org. approximation and the generated circuits do not support input
Digital Object Identifier 10.1109/TCSI.2020.3019460 adaptive run-time reconfigurability.

1549-8328 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
TASOULAS et al.: WEIGHT-ORIENTED APPROXIMATION FOR ENERGY-EFFICIENT NN INFERENCE ACCELERATORS 4671

Given the inherent error resilience of NNs [20] and the The developed methodology was evaluated, using four
advantages of approximate computing, the design of approxi- datasets [24]–[26], on the Mobilenet-v2 [27], several
mate multipliers and inference accelerators has gained signifi- ResNet [1], and VGG [28] NNs. Experimental results
cant research interest the past few years. Design methodologies showed an average 17.8% energy reduction with only
focus on the utilization of approximate arithmetic units (adders 0.5% accuracy loss. Our weight-oriented approach outper-
and/or multipliers) or memories in order to reduce energy con- forms layer-wise and fixed approximation state-of-the-art
sumption. However, the utilization of approximate components methods [20], [23], [29].
in MAC arrays during NN inference is challenging as modern
NNs are very deep, approximation error is input dependent II. R ELATED W ORK
[15], and layers differ in error resilience [22]. For example, In order to achieve accuracy reconfiguration during NN
ResNet-50 [1] employs approximately 120M multiplications inference, the authors in [23] proposed a heterogeneous archi-
making an exhaustive search impossible, in order to identify tecture built upon several static approximate multipliers [30].
which multiplications should be approximated in order to Specifically, they apply at run-time a layer-wise approximation
keep accuracy high. Previous approaches analyse the impact and they power-gate any approximate multipliers that are
of error in NNs [22], use fixed approximation multipliers not used. However, this approach requires a heterogeneous
[20], [21], or follow a layer-based approximation, in which architecture design, weight tuning, and it also has a high area
they try to find the right approximation level for different overhead resulting in throughput loss due to the underutilized
convolution layers separately [23]. However, such methods hardware. In [29], Simulated Annealing is used to produce
require NN retraining or a layer optimization phase, which approximate reconfigurable multipliers for NN inference by
can be time-consuming for deep NNs. Additionally, all the combining gate-level pruning [18] and wire-by-switch replace-
aforementioned approaches consider fixed error in the utilized ment [31]. Nevertheless, similar to [23], the approximate
approximate circuits without any support for input-adaptive multipliers generated are optimized for the Mean Relative
run-time management. Error (MRE) metric and they apply only layer-wise approx-
In this article, we present a bottom-top design method- imation limiting the potential benefits. In [32] approximate
ology for enabling adaptive and run-time approximation for reconfigurable circuits are generated using wire-by-switch
NN inference accelerators. Our goal is to support weight- replacement and by identifying closed logic island regions.
oriented fine-grain approximation in order to reduce energy The authors in [33] used reconfigurable bloom filters in order
consumption while controlling the accuracy loss and without to support approximate layer-based pattern matching. In this
requiring NN retraining. The contributions of the article are article, we follow a more fine-grain approach by deciding
many-fold: the approximation level based on the weight values of the
• NN-oriented reconfigurable approximate multipliers: We NN, while the approximate multipliers are optimized for low
design a convolution-specific approximate reconfigurable dispersion, which is a more suitable metric for NN inference as
multiplier. Its design is performed based on the error presented in Section III-A. Previous methodologies have also
variance instead of the mean relative or absolute error tried to control the accuracy of the approximations at run-
that existing methods follow. Build on prior art we time by enabling reconfiguration [31], [34]–[38]. Particularly,
design LVRM (Low-Variance Reconfigurable Multiplier), the methods in [34]–[36] apply power gating to achieve recon-
an approximate reconfigurable multiplier that supports figuration. Considering that thousands of MACs are integrated
exact operations (LVRM0) and two approximate modes in NN accelerators, such fine-grained power-gating approach
(LVRM1 and LVRM2). is inefficient. Reference [31] splits the addition to small sub-
• A time-efficient methodology for mapping the different adders and multiplexers select between the exact and predicted
approximation modes based on the weight values of the NN: carries inducing considerable delay overhead. In [37] Cartesian
We present a weight-oriented methodology that considering genetic programming and clock gating is used to generate
an inference accelerator comprising approximate approximate multipliers that feature significant area overhead
reconfigurable multipliers, it decides which approximation and high error value, thus being unsuitable for deep NNs
mode (i.e., LVRM0, LVRM1, or LVRM2) will be used inference. In [38] a synthesis framework for approximate
for each weight value for each layer of the NN, such that reconfigurable circuits is proposed but it targets delay and
the final accuracy of the NN during inference satisfies a not power efficiency. In this article, the proposed multipliers
user provided error threshold and the energy consumption support an exact and two approximation levels and along with
is minimized. Our method offers fine-grain optimization the heuristic algorithm for weight mapping, we are able to
compared to existing coarse-grain layer-wise or fixed control the introduced error for any NN, regardless of its
approximation approaches. size. In [39], the authors presented a NN accelerator which
• No NN retraining is required: Given a trained NN, the pro- integrates approximate multipliers along with a compensation
posed framework selects an approximate mode for each module in order to reduce energy consumption. Similarly,
weight (i.e., maps the approximate accelerator to the NN) the authors in [21] analyzed the impact of error in NNs
without requiring time consuming retraining (i.e., map the by utilizing approximate multipliers in different convolutional
NN to the approximate accelerator). Moreover, we present layers. However, the latter two approaches considered the
how the biases can be used to compensate the error induced LeNet NN, which is shallow comparing to current sate-of-
by the approximate multiplications, thus achieving a zero- art architectures, and the developed multipliers offer a single
cost error correction (no additional hardware or retraining). level of approximation thus not being flexible for deeper NNs.

Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
4672 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 67, NO. 12, DECEMBER 2020

Moreover, the error compensation proposed in [39] requires


the addition of an extra accumulation row in the MAC array,
increasing thus its size as well as its computational latency.
Additionally, the method in [21] requires retraining after
performing the approximation in order to help NN adapt
to the changes. The authors in [20] proposed approximate
multipliers based on the concept of computation sharing
to reduce energy consumption. Nevertheless, the introduced Fig. 1. Evaluating the accuracy of NN inference for CIFAR-10 using
approximate multipliers as the number of convolution layers increases. The
concept of Multiplier-less Artificial Neuron also requires net- accuracy attained by the approximate multipliers is normalized with respect
work retraining in order to correct the approximation-induced to the accuracy achieved by the exact multiplier. The evaluated approximate
accuracy loss. multipliers 125K, 14VP, and CK5 are obtained from [30] and 8-bit quantiza-
The differentiator and novelty of the presented work, com- tion is considered.
paring to previous approaches is three-fold: (1) it proposes
NN-oriented accuracy reconfigurable multipliers, rather than
generic fixed approximate multipliers, thus being able to adapt
to varying accuracy requirements; (2) it follows a fine-grain
weight-oriented approach by deciding the approximation level
based on the weight values of the NN, rather than layer-
based; and (3) it does not require any NN retraining due to
the performed fine-grained weight-to-approximation mapping.

III. P ROPOSED M ETHODOLOGY


As aforementioned, the accuracy during NN inference is
highly input dependent and, as NNs become deeper, the error
induced by approximate multiplications has more impact
on the inference accuracy. Particularly, for deep NNs static
approximate multipliers fail to meet tight accuracy constraints.
In Figure 1, three power-error Pareto-optimal approximate Fig. 2. Overview of the proposed methodology.
multipliers from the state-of-the-art library EvoApproxLib [30]
(i.e., 125K, 14VP, and CK5) are considered. These multi-
based on a variance constraint. Once the design of LVRM
pliers are selected as they feature very small error values.
is completed, we perform a weight-to-approximation mode
Specifically, the MRE of 125K, 14VP, and CK5 is only
mapping (Section III-B). Specifically, we consider an inference
0.02%, 0.14%, and 0.6% respectively. In Figure 1, all the
accelerator similar to Google TPU [3], as depicted in Figure 3,
multiplications, in all the convolutions layers, are performed
that employs a systolic MAC array and we replace the exact
using the respective approximate multiplier. As shown in
multipliers with LVRM. However, note that our methodology
Figure 1, for the small ResNet-8 network [1], all the multipli-
is not bound to a specific accelerator architecture. Already
ers achieve about the same inference accuracy with the exact
trained NNs are quantized to 8-bit fixed point (both weights
multiplier. The 125K approximate multiplier, that features
and activations) to enable their execution on the considered
almost zero MRE and only 10% energy savings, retains very
accelerator. Then, we override the exact multiplication with
high accuracy for all the examined ResNets. On the other
a C-description of LVRM and we extract the significance of
hand, as the size of ResNet increases (i.e., ResNet-20, ResNet-
each layer. Based on an accuracy drop threshold we perform
32, and ResNet-56) the inference accuracy drops very fast
a fine-grain weight mapping, rather than layer-based, in order
when we utilize the 14VP and CK5. For example, for the
to extract the final run-time configurations.
ResNet-20 the accuracy achieved by the 14VP and CK5 is
12% and 29% smaller compared to the accuracy achieved A. NN-Oriented Approximate Reconfigurable Multipliers
by the exact multiplier. Hence, NN-oriented reconfigurable 1) Convolution Error Analysis: The basic operation per-
approximate multipliers are needed in order to optimize energy formed in a convolution layer is given by Equation 1:
consumption during inference, while keeping accuracy loss
below certain thresholds. 
k

Figure 2 depicts an overview of our proposed method- Y =B+ Wj × Aj, (1)


j =1
ology. Initially, we focus on the design of NN-specific
approximate reconfigurable multipliers (Section III-A) in order where W j are the weights, A j the input activations, and B the
to enable adaptation to the varying accuracy requirements. bias of the neuron. Hence, if the multiplication W j × A j is
Comparing to previous approaches, which follow generic performed by an approximate multiplier, then considering the
approximate designs based on Mean Relative or Absolute size of the filter and the depth of the input, the error of Y can
Error (MRE or MAE) [9], [30], we generate an approximate seamlessly become very significant. Assuming an approximate
reconfigurable multiplier, named LVRM (Low-Variance Recon- multiplier and denoting (W, A) the multiplication error of a
figurable Multiplier), which supports three accuracy levels multiplication W × A, ∀W, A, then (W, A) is given by:

Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
TASOULAS et al.: WEIGHT-ORIENTED APPROXIMATION FOR ENERGY-EFFICIENT NN INFERENCE ACCELERATORS 4673


k 
k
= (W j , A j ) − μ(W j ). (6)
j =1 j =1

Therefore, the mean value μ(Y ) of Y , ∀A j , is calculated by:


k 
k 
μ(Y ) = μ (W j , A j ) − μ(W j )
j =1 j =1

k
  k
= μ (W j , A j ) − μ(W j )
j =1 j =1
Fig. 3. Abstract overview of a systolic MAC array similar to the one used 
k 
k
in Google TPU [3]. = μ(W j ) − μ(W j ) = 0. (7)
j =1 j =1
(W, A) = W × A − W × A|approximat e , (2)
and the variance V ar (Y ) of Y , ∀A j , is given by:
and the approximate convolution output Y  is given by:

k 
k 

k
V ar (Y ) = V ar (W j , A j ) − μ(W j )

Y = B+ (W j × A j − (W j , A j ))
j =1 j =1
j =1
  *0


k k

k
  k
= B− (W j , A j ) + Wj × Aj. (3) = V ar (W j , A j ) + V ar  μ(W j )
j =1 j =1
j =1 j =1

The error value  of an approximate multiplier can be viewed  k
as a random variable defined by its mean value μ() and its = V ar (W j ). (8)
variance V ar () [40]. Hence, if the error of the approximate j =1
multiplier is systematic, its mean value μ() can be compen-
sated by a constant correction term [40]. During the inference Note that in Equation 8, W j are independent variables (i.e.,
phase, for each filter, the weights are constant while the activa- the error of one multiplication does not depend on the error
tions flow through the accelerator (MAC array) to perform the of another multiplication) and thus their covariance is zero.
convolution [3]. We define μ(W ) and V ar (W ) the mean error As a result, when using our error correction through bias
and the variance of performing the multiplication W × A, ∀A, update (Equation 4), the error value Y of the convolution
i.e., multiplying a fixed weight W with any input A. Therefore, output Y features:
using the bias value we can compensate the error induced
by the employed approximate multiplication. To achieve this, μ(Y ) = 0
the bias B is replaced by B  as Equation 4 shows: k
and V ar (Y ) = V ar (W j ). (9)

k j =1
B = B + μ(W j ). (4)
j =1 Hence, leveraging our bias update, the error value of the
output Y is defined only by the V ar (W ), ∀W . Therefore,
Hence, using the bias update, Equation 3 is written as: the accuracy of the convolution is only subject to the error

k 
k variance of the approximate multiplier (V ar (W ), ∀W ). In
Y  = B − (W j , A j ) + Wj × Aj other words, approximate multipliers with low dispersion
j =1 j =1 will merely impact the output quality. Qualitatively, we use
the mean error μ(W ) (i.e., expected error value) of each

k 
k 
k
= B+ μ(W j ) − (W j , A j ) + W j × A j . (5) approximate multiplier to correct the generated error. In that
j =1 j =1 j =1
way, if the variance (V ar (W )) is small, the multiplication
errors are concentrated close to the mean value, boosting the
Thus, the error Y of the output Y is given by: efficiency of our correction.
As Equations 1-9 demonstrate, the proposed error correc-
Y = Y − Y  tion efficiently nullifies the mean error of the convolution.
k Additionally, combining approximate multipliers with low
= B+ Wj × Aj dispersion (low variance), we can deliver very high accuracy
j =1 at convolution level. What is important to note is that our error

k 
k 
k correction requires only to update the biases after quantization.
−B− μ(W j ) + (W j , A j ) − Wj × Aj Therefore, it comes with zero-cost since it does not require any
j =1 j =1 j =1 retraining or additional hardware.

Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
4674 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 67, NO. 12, DECEMBER 2020

2) Low-Variance Reconfigurable Multiplier Design: Con- Equation 4. Note that, in contrast to [29], we specify that
sidering the above analysis, we utilize the widely used approx- the variance constraint must be satisfied ∀W . We consider
imation technique wire-by-switch replacement [29], [31], [32] three accuracy levels with Vi bounds 0, 82 , and 202 , respec-
in order to generate approximate reconfigurable multipliers tively, where 0 refers to exact operation. Due to such tight
that can replace the exact ones of a NN inference accelerator. variance constraints, the obtained solutions featured moderate
The wire-by-switch approximation technique replaces a wire energy reduction. Hence, we relax the variance constraint
by a switch that selects between the wire’s value and an V ar (W ) ≤ Vi and we replace it by:
approximate one. When the switch is off, the output of the  
switch equals to the wire’s value and exact results are obtained. p V ar (W ) ≤ Vi ≥ 0.90. (12)
When the switch is on, the output of the switch equals
to the approximate value (e.g., ‘0’ or ‘1’). This technique In other words, we specify that at least the 90% of the weights
is widely used to boost performance by limiting the carry have to satisfy the variance constraint. Then, at run-time,
propagation and/or to decrease the power consumption by the 10% of the weights that violate the variance constraint may
limiting the circuit’s switching activity [29], [31], [32]. A be mapped to exact operation. The variance constraints (82
control signal is used to turn on/off the switches and select and 202 ) are selected after we evaluated all the Pareto-optimal
between exact and approximate execution. Therefore, by using multipliers in [30] and examined how their error variance
different control signals to enable different switches, we can impacts the final accuracy of different NNs (see Section IV).
achieve multiple varying accuracy levels [29]. In our work, Additionally, they provide a good balance between error
we consider three accuracy levels, while a 2-bit input control variance and energy gain (as shown later). The constraint 82
signal is required to select the desired accuracy level (i.e., is very tight, delivering high accuracy, but also leaves limited
enable the respective switches of each level). When the control room for approximations. The 202 constraint is more relaxed
signal is set to 2 b00, all the switches are off and exact and thus, although it results in a lower accuracy, it allows
computations are performed. When it is 2 b01, the switches for more approximation and delivers higher energy gains.
of the first approximate level are turned on. Similarly, when Finally, considering that the 8-bit multiplier is a small circuit,
it is 2 b11, the switches of the second approximate level are supporting more accuracy levels would limit the delivered
turned on. To apply wire-by-switch replacement we need to energy savings since the latter highly depend on the circuit
identify which wires will be replaced by switches as well as characteristics (e.g., size and activity) and the number of
the respective approximate value (‘0’ or ‘1’) that will be used induced switches [29].
in the replacement. In [29], Simulated Annealing combined Using the proposed quality function and the 8-bit exact
with a high-level power estimator is used to identify the wires multiplier of [30] as our baseline circuit, we generate an
to be approximated. In [32] the authors aim to identify logic approximate reconfigurable multiplier with three accuracy
isolation islands, i.e., closed logic cones, and apply wire-by- levels, named LVRM, that satisfies the aforementioned vari-
switch replacement. However, such heuristics do not guarantee ance constraints. Our exhaustive design space exploration to
power optimality. In our work, since we only need to generate generate LVRM operates as follows:
an approximate reconfigurable 8-bit multiplier (i.e., a small i We synthesize the exact multiplier at its critical path
circuit), we employ an exhaustive design space exploration to delay using Synopsys DesignCompiler targeting the 7nm
ensure power optimality. In addition, [29] utilizes the mean FinFET library [41] and obtain its gate-level netlist.
relative error (MRE) as a quality function to generate approx- ii Since we target low error variance, we extract from the
imate reconfigurable multipliers. For each accuracy level i , netlist only the wires that affect the eight least significant
with error constraint E i , [29] identifies the approximations bits of the multiplier.
that satisfy: iii We generate all the possible configurations where each
  (W, A)   of the extracted wires in the previous step is either not
aver age   ≤ Ei . (10) modified (exact), or it is replaced by ‘0’ or ‘1’.
W×A iv We simulate all the configurations of the previous step to
Nevertheless, since the relative error is not an additive metric, calculate their output error. To speedup the error evalua-
MRE is not a representative error metric when targeting a tion, we use a Verilog-to-C converter [29] to run all the
convolution operation as the latter requires the summation of simulations at C-level. All the possible input combinations
a large number of products. Therefore, we replace the quality (65, 536 in total) are considered in the simulations.
function of [29] with Equation 11. v For each accuracy level, we keep only the configurations
that satisfy the respective error bound (variance bound)
∀i, W : of the quality function (Equations 11-12).
V ar (W ) ≤ Vi vi Using the obtained configurations of the previous step,
p(W > 0) ≥ 0.80 ∨ p(W > 0) = 0 (11) we apply wire-by-switch replacement and synthesize the
generated netlists at the critical path delay of the exact
where Vi is the variance constraint at each accuracy level multiplier. After synthesis, we obtain the corresponding
i and p(W > 0) is the probability that the multiplication approximate reconfigurable multipliers. To obtain an effi-
W × A, ∀A, produces an erroneous output. By setting p(W > cient control circuitry, we follow the approach of [29]
0) ≥ 0.80, we aim in generating an approximate multiplier and we constraint the configurations selected for the first
with high error rate (i.e., systematic error) in order to exploit accuracy level to be a subset of the configurations selected

Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
TASOULAS et al.: WEIGHT-ORIENTED APPROXIMATION FOR ENERGY-EFFICIENT NN INFERENCE ACCELERATORS 4675

Fig. 4. The (a) error variance and (b) energy reduction of LVRM at the
two approximate modes (82 and 202 variance constraints) with respect to the
weight values (8-bit).

for the second one. Note that, by definition, the error


bound of the first accuracy level is lower than the error
bound of the second one. In step iv we performed an
exhaustive exploration, evaluating the error of all the
possible configurations. Hence, for each configuration,
we have also calculated the error of all of its subsets.
vii We perform a power analysis for all the generated circuits
that do not violate the delay constraint (i.e., achieve the
same frequency as the exact multiplier). To calculate the Fig. 5. Weight-oriented mapping of approximation modes.
power consumption we assume equiprobable operation
at all the accuracy levels and we run post-synthesis to the 7nm FinFET library [41]. QuestaSim is used to run
timing simulations with Mentor QuestaSim to obtain the gate-level timing simulations to obtain the outputs and the
switching activity. Synopsys PrimeTime is used for power switching activity. Finally, PrimeTime is used to perform the
analysis. power analysis. As shown in Figure 4, at the accuracy level 1
viii We select the approximate reconfigurable multiplier of the (82 variance constraint), LVRM delivers up to 25% (18%
previous step that minimizes the power consumption. on average) energy reduction, while at the accuracy level 2
(202 variance constraint) the energy reduction goes up to
All the computations in each step are independent and thus,
35% (22% on average). At the accuracy level 1, the average
they are executed in parallel. In total, about 5 million con-
variance is 27 and the 94% (higher than 90%) of the weights
figurations were extracted in step iii. Using a dual Xeon
satisfy the given variance constraint of 82 . Similarly, at the
Gold 6138 server that features 80 threads, the design space
accuracy level 2, the average variance is 157 and the 97%
exploration required about 6 hours to generate the LVRM
of the weights satisfy the given variance constraint of 202 .
multiplier for the examined variance constraints. Although
For the rest of the article, LVRM0 corresponds to operation
an exhaustive design space ensures power optimality, it is
at the exact mode, LVRM1 corresponds to accuracy level 1
not efficient with respect to the execution time. However,
(82 variance constraint), and LVRM2 corresponds to accuracy
the 6 hours that are required are considered a reasonable
level 2 (202 variance constraint).
amount of time. Proposing a fast automated synthesis frame-
work for approximate reconfigurable circuits is out of the
scope of our work. Finally, note that LVRM has a similar B. Weight-Oriented Mapping
structure with the exact multiplier since LVRM is generated The second part of our methodology focuses on mapping the
by applying wire-by-switch replacement on the netlist of different approximation modes based on the weight values of
the exact multiplier. The only difference, compared to the the NN. Specifically, given an accuracy drop threshold, decide
exact multiplier, is that LVRM requires an additional 2-bit which approximation mode (e.g., LVRM0, LVRM1, or LVRM2)
control signal. Hence, LVRM can seamlessly replace the exact will be used for each weight value for each layer of the NN,
multiplier in any NN accelerator (given that the control signal such that the final accuracy of the NN during inference satisfies
is provided, with respect to the accuracy requirement). the error threshold and the energy consumption is minimized.
In Figure 4, we evaluate the accuracy and energy efficiency This mapping problem is very challenging due to its high
of LVRM. The energy reduction is reported with respect to complexity. Modern NNs employ tens to hundreds convolu-
the energy consumption of the exact multiplier [30]. For the tional layers consisting of thousands to millions of different
two approximate levels of LVRM, Figure 4 illustrates the weight values. Taking also into consideration the different
V ar (W ) and the achieved energy reduction with respect to approximate modes of LVRM, an exhaustive exploration is
the weight value. As aforementioned, the exact multiplier and infeasible. In an attempt to reduce the search time and space,
LVRM are synthesized using DesignCompiler and mapped previous approaches [23] utilize evolutionary algorithms and

Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
4676 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 67, NO. 12, DECEMBER 2020

of ResNet-20 and layer 21 of ResNet-56) remarkably affecting


the accuracy of the NN. The last point in x-axis corresponds
to the case where all layers are approximated.
Step 2 – Map Entire Convolution Layers to LVRM2: This
step aims at mapping multiple layers at the same time to
LVRM2. The reasoning behind this design choice is that layers
of lower significance can potentially be configured entirely
to the most approximate mode, thus yielding high energy
gain without impeding high accuracy. Starting from the least
significant layer, we map the next more significant one from
our list to LVRM2. It is important to mention that at this point
Fig. 6. Example on ResNet-20 and ResNet-56 NN of how much accuracy we update the biases of the filters in order to compensate the
is affected when all the multiplications inside a layer are performed using
LVRM1 or LVRM2. error induced by the employed approximate multiplications,
as shown in Equation 4. After recording the achieved accuracy
perform a layer-oriented mapping. However, such solutions during inference, we check whether the current configuration
try to solve the problem in a stochastic way being very time satisfies the required threshold. If the threshold is met, the cur-
consuming for satisfying specific accuracy threshold. rent layer is saved as the last layer that can be entirely mapped
In order to find an efficient weight-to-approximate mode to LVRM2. In case that for a convolution layer, the achieved
mapping and reduce the number of evaluated solutions, accuracy fails to meet the required threshold, we stop the
we employ a four-step methodology based on the con- layer search since it is expected that by adding more layers to
cepts of layer significance and weight magnitude [42], [43] the approximate configuration, accuracy will only be reduced.
(Figure 5). The significance of a layer is determined based At the end of this step, we have extracted the most significant
on how much accuracy drop we have during inference if all layer up to which we can configure entire layers to LVRM2,
the multiplications of that layer were executed with LVRM2 while satisfying the required threshold.
(most aggressive approximate mode). The idea behind weight Step 3 – Map Ranges of Weights per Convolution Layer to
magnitude is that weights with small absolute value contribute LVRM2: The goal of this step is to determine how weights
little to the final result [44]. Thus, they can tolerate more error per layer will be mapped to the various modes of LVRM, for
and can be mapped to LVRM2. This concept has also been the layers that have not been entirely mapped to LVRM2 in
used in weight pruning where any value less than a threshold the previous step. Specifically, we determine which ranges of
is set to zero. weights will be mapped to the LVRM2 and which will mapped
Step 1 – Determine Layer Significance: The focus of this to LVRM0. The range approach is an important aspect of the
step is to extract and store the significance of each convolution proposed methodology. The intuition behind mapping specific
layer. Initially, we map all weights of all the convolution ranges of weights per layer to approximate multiplication
layers to LVRM0. The NN is executed and the accuracy is derives from the weight pruning based on magnitude [42],
recorded along with the number of multiplications performed [43]. Weight pruning based on magnitude relies on the concept
in each convolution layer. The accuracy of the network, using of removing neurons with weights of small magnitude, close to
LVRM0 for all the layers, is necessary in order to calculate the the value zero. The pruned NN results in more compact repre-
layer significance. Additionally, the number of multiplications sentation without sacrificing significant levels of accuracy. The
per layer is useful in cases where the significance of multiple developed approach takes advantage of the fact that certain
layers is the same. Moving forward, we map the weights of weights do not affect the overall accuracy, even if they are
each convolutional layer separately (one at a time) to LVRM2, removed. Thus, we approximate the multiplication of weights
which is the most aggressive approximate mode and yields with small magnitude, close to zero, in an attempt to achieve
higher energy gains. We record the accuracy achieved while energy gains. In that way, even though approximate multiplica-
a whole layer (L) was approximated and we calculate the tions will insert error to the calculations, the overall accuracy
significance (S) of this layer, using the metric will not be significantly impacted. Additionally, if we map a
ACCall layers→L V R M0 − ACC L→L V R M2 weight to an approximate mode, the more it appears in a layer,
SL = (13) the higher the probability of deteriorating overall accuracy of
ACCall layers→L V R M0
the NN. The ranges we use depend on the introduced error of
Layers with low significance value are not considered impor- LVRM1 and LVRM2. For the LVRM2, the range of weights
tant as they do not affect the accuracy of the NN. When the is more conservative, compared to LVRM1, due to the higher
significance of all the convolution layers is extracted, we sort error. After the 8-bit quantization and based on the maximum
them based on the calculated values in an ascending order. accuracy drop threshold (2.0%) that we set for our experiments
In case multiple layers demonstrate the same significance (see Section IV), we experimentally derived the weight value
value, the parity is broken by considering the number of ranges to map to LVRM2 as r ange3 = 0±10,1 r ange2 = 0±5
multiplications in the layer. Layers with fewer multiplications and r ange1 = 0. We start exploring configurations using a
are considered more significant. As an example, Figure 6 wider weight range (r ange3), and we gradually move to more
shows the accuracy of each separate convolutional layer for
1 Initially, the weights had float values in the range of [−1, 1]. Thus,
ResNest-20 and ResNet-56 under CIFAR-10, while mapped
the value of 0 depends on the applied quantization. For example, for 8-bit
under LVRM1 and LVRM2 approximate modes. We can see quantization in [−128, 127], 0 = 010 , while for quantization in [0, 255],
that some layers are more significant than others (e.g., layer 7 0 = 12810 .
Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
TASOULAS et al.: WEIGHT-ORIENTED APPROXIMATION FOR ENERGY-EFFICIENT NN INFERENCE ACCELERATORS 4677

narrow ranges (r ange2, r ange1) until the accuracy threshold


is met. Using LVRM2 on weights outside r ange3 had a strong
effect on the accuracy and for that reason they were omitted.
Once the configurations that satisfy the accuracy threshold
have been found, the weight mapping is performed and the
biases are updated for the respective mapping as shown in
Equation 4. This correction has to be performed every time
we update the weight mapping in a particular filter.
Step 4 – Map Ranges of Weights per Convolution Layer to Fig. 7. Runtime generation of the control signal required to set the
LVRM1: The goal of this final step is to find which of the approximation mode of LVRM.
remaining weights, that are still assigned to LVRM0, can be
mapped to LVRM1. Similarly to the previous step, we create In the first approach, the model’s size increases by 25%
ranges of weight values. Since LVRM1 introduces smaller since 10 bits instead of 8 are required to store each weight.
error than LVRM2, the NN can tolerate more weights to be Moreover, an additional 2-bit register is required in each MAC
approximated per layer. Thus, the range of the weight values unit to store the control signal. Considering that each MAC in
that we search in this step is greater. Specifically, given the the systolic array requires two 8-bit registers for the weights
maximum accuracy drop threshold (2.0%) that we set for our and activations and a 32-bit register for the accumulated
experiments (see Section IV), we have two additional ranges sum [3], [45], storing the control signal increases the number
for LVRM1 r ange5 = 0 ± 30 and r ange4 = 0 ± 20. We start of required registers only by 4%. Moreover, the control signal
exploring configurations using a wide range (r ange5), and rarely changes since it might change only when a new filter is
we gradually move to more narrow ranges until the accuracy loaded in the MAC array (i.e., when new weights are loaded).
threshold is met. Although the ranges in Steps 3 and 4 are Hence, the energy overhead of the additional registers is
overlapping, if a weight is mapped in Step 3, then it is not minimal. Although this approach increases the model’s size,
considered in Step 4. Again, each time a weight mapping is it does not require any modifications in the convolution’s
performed we update the biases based on Equation 4. dataflow since the control signal is loaded from memory along
Our experimental analysis showed that, for all the examined with the weight value.
NNs (Section IV), the weights’ values are distributed around 0. In the second approach, a control unit uses the ranges
Hence, our range-based approach enables identifying a large extracted by our weight-mapping framework and determines
number of weights to be assigned to an approximate mode the approximate mode for each weight. A 32-bit identifier is
and thus, boosts the energy savings. Nevertheless, by just required, for each convolution layer, to generate the respective
increasing the size of the examined ranges we can also control signal for each weight. Assuming that the range
cover cases that the weights are not concentrated around 0. extracted for LVRM2 is [L2S, L2E] and the range extracted
Finally, note that our proposed framework (steps 1-4) needs for LVRM1 is [L1S, L1E], then the identifier will be in the
to be executed only once at design time. After Step 4, for form {L2E, L2S, L1E, L1S}. If for example, an entire layer
each weight at each layer the corresponding accuracy level is mapped to LVRM2 then the range is [0, 255]. If a weight
of the approximate reconfigurable multiplier (e.g., LVRM0, belongs in the LVRM2 range the generated control signal
LVRM1, or LVRM2) is extracted and the biases are updated will be 2 b11, while if it belongs in the LVRM1 it will be
according to Equation 4. Then, during run-time inference, 2 b01. If it does not belong in any of these ranges, the control
the extracted accuracy level is selected for each approximate signal will be 2 b00, i.e., exact mode. If a weight belongs
multiplier of the NN accelerator. in both ranges, LVRM2 is selected since in our framework
LVRM2 precedes LVRM1. The control unit that generates
C. Run-Time Reconfiguration the control signal is illustrated in Figure 7. The control unit
Selecting the mode of operation of LVRM is seamlessly consists of four 8-bit comparators, two AND gates, and one
performed at run-time by setting accordingly its 2-bit input OR gate. Similar to the first approach, for each MAC unit,
control signal. When the control signal is set to 2 b00, an additional 2-bit register is required to store the control
2 b01, or 2 b11, LVRM0, LVRM1, or LVRM2 are selected, signal. The control units need to be placed between the
respectively. In this work, we follow a weight-oriented accu- memory and the accelerator. As the weights are fetched from
racy mode selection. This approach is based on the facts that the memory, they pass from the control units, and then, they
(i) the NN weights are known a priori (after quantizing the are loaded to the accelerator in the form of the 10-bit signal:
NN) and (ii) in contrast to the activations, the weights remain {contr ol signal, weight}. In Figure 7, we present an example
constant in the MAC array for several cycles, thus we rarely of integrating the required control units in the Google TPU
need to reconfigure the selected approximate mode for the microarchitecture [3]. For a M × M systolic MAC array, only
LVRM. M control units are required since the weights are loaded on
Setting the operating mode of each LVRM can be performed a tile (i.e., M weights) per cycle. In TPUs, the weights are
in two ways: (i) the control signal associated with each weight fetched from the off-chip memory and are stored on an on-
(as extracted by our framework) can be stored in memory chip FIFO. Then, the weights are loaded from the FIFO to
along with the weight value or (ii) the control signal can be the MAC array using double buffering to overlap the weight
generated in the accelerator as the weights are fetched from loading with the computations [3]. In such a microarchitecture,
memory. the control units are placed between the FIFO and the MAC

Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
4678 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 67, NO. 12, DECEMBER 2020

array. TPU follows a weight stationery approach, in which the TABLE I


weights remain constant in the MAC array for several (even E XAMINED N EURAL N ETWORKS
thousands) of cycles [3]. Hence, the energy overhead of the
control units is negligible since they are activated only when
new weights are loaded on the MAC array. The rest of the
time they are clock-gated. The delay of the control units is
lower than the delay of the MAC unit and thus, the control
units do not affect operating frequency and latency of the MAC
array. This approach does not affect the model’s size. The only
modification with respect to the convolution’s dataflow is that
it requires transferring to the accelerator a 32-bit identifier at
the beginning of each convolution layer.
We implement in Verilog RTL the double buffered 64 × 64 (M2) ALWANN [23]: This method utilizes the fixed approx-
systolic MAC array of the Edge TPU [4] as well as imate multipliers of the state-of-the-art library EvoAp-
its approximate counterpart that uses the proposed LVRM, proxLib [30] and applies weight-tuning to minimize the
10-bit registers for the weights and control signals, and the MAE incurred by using approximate multipliers;
64 required control units. Both designs are synthesized using (M3) RETSINA [29]: This method follows a layer-wise
DesignCompiler, the compile_ultra command, and are approximation (i.e., the multiplications of each con-
mapped to the 7nm FinFET technology library [41]. Overall, volution layer are performed at the same accuracy
the area overhead induced by our approach is only 3%. level but different layers might use different accuracy
To evaluate the area overhead of our approach, we con- level) utilizing MRE-based approximate reconfigurable
sidered the state-of-the-art Edge TPU. However, our work multipliers with three modes: exact, 0.5% MRE, and
is independent of the NN accelerator microarchitecture. As 1.5% MRE;
described above, the proposed error correction through bias (M4) LVRM layer-wise w/ Bias: This method utilizes the
update is applied within the NN model itself. LVRM can presented LVRM approximate multiplier and applies
replace any exact multiplier since it has a similar regular the proposed error correction through bias update, how-
structure and requires only an additional 2-bit input signal. ever it follows the layer-oriented weight mapping used
Finally, our weight-oriented mapping framework, considers in [29];
only the weight’s value to select the approximate mode. (M5) LVRM fine weight mapping w/o bias: This method uti-
Therefore, our work can be seamlessly applied to any NN lizes LVRM and the proposed fine-grain weight map-
accelerator. ping, without, however, the proposed bias update;
(M6) LVRM fine weight mapping w/ bias: This is our pro-
posed method that utilizes LVRM, fine-grain weight
IV. E XPERIMENTAL E VALUATION mapping, and bias update.
In this section we experimentally evaluate the accuracy Specifically, comparing M6 with M2 and M3 evaluates
and energy efficiency of the proposed approach. To this end, the efficiency of our approach against the state-of-the-art that
we consider seven NNs trained on four different datasets. applies fixed and layer-wise approximation, respectively. Com-
Additionally, we provide a comparative study against state- parison between M6 and M5 evaluates the benefits delivered
of-the art techniques that follow different approximation by the proposed error correction approach. Finally, M4 and
approaches, i.e., fixed approximation, layer-wise approxima- M3 follow the same mapping method used in [29] (i.e., layer-
tion, and/or MRE-based approximate multipliers. wise approximation). However, M4 uses our low-variance
reconfigurable multiplier while M3 uses a low-MRE multi-
plier. Therefore, comparing M4 against M3 demonstrates the
A. Benchmarks
efficiency delivered by designing approximate multipliers with
In our experimental analysis, we perform a thorough evalua- low-variance when targeting NN inference.
tion considering several NNs as well as state-of-the-art approx- Note that, since different layers feature different signif-
imate methods. The NNs we examined are listed in Table I and icance, in ALWANN [23] a heterogeneous architecture is
their implementation was based on official repositories [46], proposed that comprises several fixed approximate multiplier
[47]. Additionally, Table I lists the number of convolution types of [30]. In order to achieve reconfiguration at run-
layers as well as the number of multiplications in each NN. All time, each convolution layer uses a specific multiplier type
the NNs are trained on four datasets: CIFAR-10 [24], CIFAR- and the rest are power-gated. However, this approach leads
100 [24], GTSRB [25], and LISA [26]. In total, 28 different to highly increased area and to high loss in throughput
NN models are used in our evaluation. due to the underutilized (power-gated) hardware. Moreover,
To demonstrate the efficiency of the developed LVRM to achieve high accuracy, a different architecture (i.e., selection
and the effectiveness of the proposed fine-grain weight map- of approximate multiplier types) is generated for each consid-
ping, the following approximation methods are evaluated and ered NN. On the other hand, in this work, we target a generic
compared: homogeneous inference accelerator (such as [3] and [2]). This
(M1) Exact: This method is the baseline of our experiments, enables us to achieve high throughput [3] and to be NN
utilizing exact multipliers in all layers; independent, i.e., any convolutional NN can be executed on the

Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
TASOULAS et al.: WEIGHT-ORIENTED APPROXIMATION FOR ENERGY-EFFICIENT NN INFERENCE ACCELERATORS 4679

Fig. 8. Energy savings comparison of different methods that utilize approximate multipliers for the CIFAR-10 dataset.

Fig. 9. Energy savings comparison of different methods that utilize approximate multipliers for the CIFAR-100 dataset.

proposed approximate reconfigurable accelerator. Therefore, consumption by the sum of the average energy of each
for the fairness of the evaluation, we consider a homoge- accuracy level multiplied by the number of multiplications
neous architecture for ALWANN [23] and thus, the same performed in the respective accuracy level. To capture the
approximate multiplier type is used in all the convolution average energy consumption of the examined approximate
layers. Nevertheless, we consider all the 32 Pareto optimal multipliers (i.e., LVRM, [29], [30]), the same procedure as
approximate multipliers of [30]. Finally, note that for our in Section III-A is followed. All the multipliers are described
analysis, we also evaluated the novel alphabet multipliers in Verilog RTL, they are synthesized at the critical path
proposed in [20]. However, they require NN retraining. For the delay of the exact multiplier [30] (i.e., same performance)
fairness of the comparisons we did not perform retraining and using DesignCompiler and the compile_ultra command,
thus alphabet [20] delivered poor accuracy results, below the and are mapped to the 7nm technology library [41]. Post-
examined thresholds. Hence, it is not included in our analysis. synthesis timing simulations are performed using QuestaSim
and 500, 000 random generated inputs. The obtained switching
B. Experimental Setup activity is fed to PrimeTime to perform the power analysis.
1) Accuracy Evaluation: Initially, all the examined NNs
were trained on the aforementioned datasets using the Ten- C. Results
sorflow machine learning library. Then, the NNs were frozen Figures 8-11 depict the results of our experimental
and quantized to 8-bit. In order to evaluate the accuracy evaluation in terms of energy savings for the multiplication
of the examined approximate methods, i.e., M2-M6, all the operations for all the examined NN models and approximate
approximate multipliers were described in C. Note that all the methods. In our framework, we selected three values for
evaluated approximate multipliers (LVRM, [29], [30]) apply the accuracy drop threshold {0.5%, 1.0%, 2.0%} and we
functional approximation and thus, they can be seamlessly compare our energy savings against the other methods whose
represented in C. Next, we overrode the convolution layers solutions also satisfied the same thresholds. These values
in Tensorflow and replaced the exact multiplications with the were selected since we target high inference accuracy. Note
respective approximate ones [23]. Finally, for each quantized that for M2 [23], in each case, we present the highest energy
model and approximate method, we executed the inference reduction achieved by an approximate multipliers of [30].
phase and captured the delivered accuracy. Hence, the reported energy gains are not delivered using a
2) Energy Evaluation: In terms of energy savings, we eval- single approximate multiplier.
uated the energy gains originated by the multiplication oper- Figure 8 compares different configurations for all the
ations of the convolution layers. The energy required for selected NNs on the CIFAR-10 dataset. As an overall
the multiplications of a layer is estimated by the number observation, we see that the proposed methodology always
of multiplications in that layer multiplied by the average produces configurations with the highest energy gain, with
energy of the used multiplier [23], [29]. As the baseline an average gain of 17.7%, within the examined accuracy
energy consumption, we consider the energy consumed when loss margins. Specifically for the ResNet-20, the proposed
using the exact multiplier. For the approximate reconfigurable framework achieves up to 17.5% energy gain for an accuracy
multipliers (i.e., LVRM and [29]) we estimate the total energy loss of only 1.7%, while this gain increases to 18.6% for

Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
4680 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 67, NO. 12, DECEMBER 2020

Fig. 10. Energy savings comparison of different methods that utilize approximate multipliers for the GTSRB dataset.

Fig. 11. Energy savings comparison of different methods that utilize approximate multipliers for the LISA dataset.

the ResNet-32. Similarly for the ResNet-44 and ResNet-56, savings with M4 being the second best, with average energy
the proposed approach finds solutions that achieve the highest gain of 13.9%, with regards to the average energy savings
energy gains (16.1% and 17% on average, respectively), for all NNs and accuracy thresholds. Interestingly, M5 again
while respecting the accuracy drop thresholds. For the cannot achieve significant energy savings for ResNet-56 under
Mobilenet-v2 (Figure 8(e)), the proposed framework achieves the 0.5% accuracy threshold (Figure 9(d)). This happens
up to 19.3% energy gain. It is noteworthy that, for this NN on because even though the fine-grain weight mapping allows
CIFAR-10, all methods that utilize the LVRM approximate for better exploration of approximate modes, the bias-based
multipliers have increased energy savings. Finally, for error correction that we propose is important for such small
VGG-11 and VGG-13 the corresponding energy savings are accuracy drop threshold. In this dataset as well, M2 [23] and
17.4% and 18.1%. The configuration with closest average M3 [29] have again the smallest energy gains verifying that
energy gain, for all the examined accuracy loss thresholds our choice to follow a fine-grain weight mapping combined
and NNs, comes from the M4 method with an average gain with a low-variance approximate reconfigurable multiplier
of 14.5%. This shows that our error correction mechanism further reduces the energy consumption.
utilizing the bias lets us map more weights to approximate Figure 10 compares different configurations for all the
multipliers. Interestingly, M2 [23] and M3 [29] have the selected NNs on the GTSRB dataset. The proposed method-
smallest energy gains. M3 follows a layer wise approach and ology still remains the best with an average energy gain of
uses a reconfigurable multiplier optimized for MRE. Thus, 16.6%. For this dataset, M4 is again the second best choice,
a small number of layers could be mapped to an approximate with an average gain for all NNs of 15.6%. This behavior
mode without violating the desired accuracy loss threshold. verifies our strategy for error correction using bias. Again,
Similarly, M2 applies fixed approximation, and thus, only the M2 [23] and M3 [29] have the smallest energy gains.
multipliers of [30] that feature almost negligible MRE satisfy Last, Figure 11 presents the results for the LISA dataset. The
the accuracy threshold, limiting the energy gains. These behavior of all approaches is similar to the other three datasets,
results, come in compliance with our evaluation in Figure 1. with our method achieving the highest energy gains of 20.2%
Figure 9 compares different configurations for all the on average. Interestingly, M5 does not perform very well
selected NNs on the CIFAR-100 dataset. As an overall for ResNet-56 for 0.5% and 1.0% accuracy drop thresholds
observation, we see again that the proposed methodology (Figure 11(d)). This happens because the accuracy drop thresh-
always produces configurations with the highest energy old is small and since M5 does not utilize the proposed error
gain, with an average gain of 17.7%, within the examined correction with bias method, the weight-mapping solution
accuracy loss margins. Specifically, the maximum energy is limited to how many weights are mapped to approximate
gain achieved by our method (M6) was 19.1%, recorded modes, even though a fine-grain approach is followed. This is
for the Mobilenet-v2, while respecting the accuracy drop also verified by the behavior of M4 which remains the second
thresholds. Additionally, the behavior of all NNs is similar as best with an average gain, for all NNs, of 19%.
in the previous dataset. Specifically, all methods that utilize In Figures 8-11 we evaluate our method (M6) against 7
the LVRM approximate multipliers have increased energy NNs trained for 4 different datasets. Although, the obtained

Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
TASOULAS et al.: WEIGHT-ORIENTED APPROXIMATION FOR ENERGY-EFFICIENT NN INFERENCE ACCELERATORS 4681

NN models exhibit varying characteristics, Figures 8-11 TABLE II


demonstrate that our method features a consistent behavior AVERAGE E NERGY G AIN C OMPARISON
with respect to delivering high energy gains while satisfying
tight accuracy constraints, always outperforming the methods
M2-M5. As an overall comparison of the 5 methods,
we provide Table II. This table depicts the average energy
gain for each method, across all the examined NNs in the
experiments on all the datasets. We provide the average
energy gain, per methodology, for 0.5% and 2.0% accuracy
loss thresholds. Overall, based on the performed evaluations
we reach the following conclusions: (1) even though both
M3 [29] and M4 are layer-oriented methods, the latter
delivers better solutions due to the utilization of LVRM (low
variance) and the error correction through the bias update;
(2) the proposed method (M6) is better than M4 (“LVRM
layer-wise w/ Bias”) as the layer-oriented approaches are very
coarse grain and lose optimality; (3) the proposed method is Fig. 12. Execution time for the proposed methodology.
better than M5 (“LVRM fine weight mapping w/o Bias”) as
it was able to identify different configurations that satisfy the (i.e., smallest NN examined) with only one 32 × 32 input
accuracy thresholds with lower energy by taking advantage image, required about 20 hours. We used SCALE-Sim,
of the bias correction method, thus allowing to perform more a cycle accurate Systolic CNN Accelerator Simulator from
approximate multiplications; and (4) even for deep NNs, ARM [48], to estimate the simulation cycles required for
where the effect of approximate multiplications is difficult to each of the examined NNs. We configured SCALE-sim with
quantify, the proposed method achieved considerable energy 8MB SRAM and a 64 × 64 MAC array (as in the Edge
gains, for example 17.2% on average for the ResNet-56 NN. TPU). For the inference of only one 32 × 32 pixels image,
The proposed fine-grained weight-oriented approach is not SCALE-Sim reported 40K cycles for ResNet-20 and 122K
affected by the NN size and efficiently identifies the proper cycles for ResNet-56. The VGGs required more than 1M
approximations (weight-to-approximate mode mapping). The cycles. As a result, considering that ResNet-20 simulation
latter is also verified by the fact that for the tight accuracy for one image required 20 hours, performing post-synthesis
loss thresholds examined, compared to the other methods, timing simulations on the entire test datasets (10K images
it delivers more consistent results as it features the highest for CIFAR-10 and CIFAR-100, 12K images for GTSRB, and
energy gains along with the lowest energy reduction variance. 1.2K images for LISA) is infeasible. For the same reason,
we evaluate the energy reduction, at MAC array level, only
D. A MAC Array Use Case for the ResNets and MobileNet but not for VGGs.
Considering that millions of multiplications are performed For the examined NNs and all the datasets, the average
by an NN accelerator (Table I), in Figures 8–11 we evaluated energy reduction delivered by our method (M6) for one infer-
the energy savings that originate from the multiplication oper- ence is 11%. Similarly, the average energy reduction delivered
ations. Hereafter, we investigate how the energy gains from by M3 [29] is 2%. Note that, for M2 [23], a single approximate
the multiplications translate to the energy gains with respect multiplier, that satisfies the 0.5% accuracy loss threshold, does
to the energy consumption of the entire systolic MAC array. not exist for all the examined NNs. For this reason, M2 [23]
For our analysis, we consider the MAC array we designed and is not considered in this analysis. This further highlights the
synthesized in Section III-C to evaluate the area overhead of need for adaptive approximation since supporting also exact
our approach, i.e., a double buffered 64 × 64 MAC array as operations, ensures that the accuracy loss constraints will be
in the Edge TPU [4]. Our approximate MAC array comprises always satisfied. As expected, the percentage energy reduction
the necessary control units that generate the required control at MAC array level is lower compared to the percentage energy
signals (see Section III-C). Hence, our method does not modify reduction at the multiplications level. The latter is due to the
the model’s size and consequently does not affect the memory existence of the registers and adders in the MAC array that
energy consumption. Therefore, in our analysis, we evaluate also consume energy. However, since the multiplier is the most
the energy savings of the MAC array operation. Targeting high energy consuming component of the MAC array and since
accuracy, we simulate only the approximate configurations that millions of multiplications are required, the delivered energy
result to 0.5% accuracy loss. reduction, at MAC array level, remains still high.
We run post-synthesis timing simulations of the entire
64 × 64 MAC array to accurately obtain the switching activity E. Execution Time Discussion
and perform a precise power analysis. For the simulation, In this subsection we evaluate the execution time of the
we set the operating frequency to 500MHz to achieve the proposed methodology. As aforementioned one of our main
same raw throughput as the Edge TPU. However, performing focuses is to provide a time-efficient framework that does
timing simulations with QuestaSim for the entire MAC not require NN re-training. In Figure 12, we present the
array is a very time consuming procedure. For example, average execution time for the NNs we used on the CIFAR-
running a post-synthesis timing simulation for ResNet-20 100 dataset. The experiments were conducted on a dual Xeon

Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
4682 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 67, NO. 12, DECEMBER 2020

Gold 6138 processor, which runs at 2.00 G H z base frequency [9] H. Jiang, C. Liu, L. Liu, F. Lombardi, and J. Han, “A review, classifi-
and features 40 cores and 80 threads. Additionally, the system cation, and comparative evaluation of approximate arithmetic circuits,”
ACM J. Emerg. Technol. Comput. Syst., vol. 13, no. 4, pp. 1–34,
memory is 128G B. The proposed methodology can be Aug. 2017.
efficiently parallelized on multicore systems by dividing the [10] G. Zervakis, K. Tsoumanis, S. Xydis, D. Soudris, and K. Pekmestzi,
steps of the exploration among multiple threads. Each thread “Design-efficient approximate multiplication circuits through partial
product perforation,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
can explore configurations for a fraction of the total layers. vol. 24, no. 10, pp. 3105–3117, Oct. 2016.
Furthermore, the significance metric is determined only once [11] D. Esposito, A. G. M. Strollo, E. Napoli, D. De Caro, and N. Petra,
per NN-dataset combination, irrespective of the number of “Approximate multipliers based on new approximate compressors,”
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 65, no. 12, pp. 4169–4182,
thresholds that we want to satisfy. The exploration for each Dec. 2018.
accuracy threshold that we need to meet can take place [12] W. Liu, J. Xu, D. Wang, C. Wang, P. Montuschi, and F. Lombardi,
concurrently on multicore systems, thus yielding very quick “Design and evaluation of approximate logarithmic multipliers for low
power error-tolerant applications,” IEEE Trans. Circuits Syst. I, Reg.
results. As shown in Figure 12, the proposed framework Papers, vol. 65, no. 9, pp. 2856–2868, Sep. 2018.
required up to 2h (for the ResNet-56). Execution time of [13] M. Pashaeifar, M. Kamal, A. Afzali-Kusha, and M. Pedram, “A the-
this scale can be consider negligible compared to longer time oretical framework for quality estimation and optimization of DSP
applications using low-power approximate adders,” IEEE Trans. Circuits
required by other methods. As reported in [23], it required Syst. I, Reg. Papers, vol. 66, no. 1, pp. 327–340, Jan. 2019.
7.5 days (for ResNet-50), 90× mores than our method. [14] L. B. Soares, M. M. A. da Rosa, C. M. Diniz, E. A. C. da Costa, and
Additionally, regarding ResNet-56, RETSINA [29] required S. Bampi, “Design methodology to explore hybrid approximate adders
for energy-efficient image and video processing accelerators,” IEEE
4.5 hours and the retraining procedure as in [20], [21] Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 6, pp. 2137–2150,
took 11 hours, thus making our method 2.25× and 5.5× Jun. 2019.
faster respectively. [15] A. Raha and V. Raghunathan, “Towards full-system energy-accuracy
tradeoffs: A case study of an approximate smart camera system,” in
Proc. 54th Annu. Design Autom. Conf., Jun. 2017, pp. 1–6.
V. C ONCLUSION [16] G. Zervakis, K. Koliogeorgi, D. Anagnostos, N. Zompakis, and
K. Siozios, “VADER: Voltage-driven netlist pruning for cross-layer
In this article, we presented a bottom-top design method- approximate arithmetic circuits,” IEEE Trans. Very Large Scale Integr.
ology for enabling adaptive and run-time approximation (VLSI) Syst., vol. 27, no. 6, pp. 1460–1464, Jun. 2019.
[17] I. Scarabottolo, G. Ansaloni, G. A. Constantinides, and L. Pozzi,
for NN accelerators (e.g., MAC arrays) during inference. “Partition and propagate: An error derivation algorithm for the design
The proposed framework utilizes NN-oriented low-variance of approximate circuits,” in Proc. 56th Annu. Design Autom. Conf.,
approximate multipliers, which support multiple approxima- Jun. 2019, pp. 1–6.
[18] J. Schlachter, V. Camus, K. V. Palem, and C. Enz, “Design and
tion modes, in order to employ a weight-oriented fine-grain applications of approximate circuits by gate-level pruning,” IEEE Trans.
weight mapping. As a result we are able to significantly Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 5, pp. 1694–1702,
reduce energy consumption while satisfying tight accuracy May 2017.
[19] G. Zervakis, S. Xydis, D. Soudris, and K. Pekmestzi, “Multi-level
requirements during inference. Experimental results on multi- approximate accelerator synthesis under voltage island constraints,”
ple NNs and different datasets shows that the proposed weight- IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 66, no. 4, pp. 607–611,
oriented approximation method exploiting our reconfigurable Apr. 2019.
[20] S. S. Sarwar, S. Venkataramani, A. Ankit, A. Raghunathan, and K. Roy,
multipliers, achieves always the lowest energy consumption, “Energy-efficient neural computing with approximate multipliers,” ACM
comparing to other state-of-art approaches that apply fixed or J. Emerg. Technol. Comput. Syst., vol. 14, no. 2, pp. 1–23, Jul. 2018.
layer-wise approximation. [21] V. Mrazek, S. S. Sarwar, L. Sekanina, Z. Vasicek, and K. Roy, “Design of
power-efficient approximate multipliers for approximate artificial neural
networks,” in Proc. 35th Int. Conf. Comput.-Aided Design, Nov. 2016,
R EFERENCES pp. 1–7.
[22] M. A. Hanif, R. Hafiz, and M. Shafique, “Error resilience analysis
[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for for systematically employing approximate computing in convolutional
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. neural networks,” in Proc. Design, Autom. Test Eur. Conf. Exhib.
(CVPR), Jun. 2016, pp. 770–778. (DATE), Mar. 2018, pp. 913–916.
[2] J. Song et al., “An 11.5TOPS/W 1024-MAC butterfly structure dual- [23] V. Mrazek, Z. Vasicek, L. Sekanina, M. A. Hanif, and M. Shafique,
core sparsity-aware neural processing unit in 8nm flagship mobile SoC,” “ALWANN: Automatic layer-wise approximation of deep neural network
in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, accelerators without retraining,” in Proc. IEEE/ACM Int. Conf. Comput.-
Feb. 2019, pp. 130–132. Aided Design (ICCAD), Nov. 2019, pp. 1–8.
[3] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor [24] A. Krizhevsky, “Learning multiple layers of features from tiny images,”
processing unit,” in Proc. Annu. Int. Symp. Comput. Archit., 2017, Univ. Toronto, Toronto, ON, Canada, Tech. Rep. 4, 2009.
pp. 1–12. [25] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, “Man vs. computer:
[4] S. Cass, “Taking AI to the edge: Google’s TPU now comes in a maker- Benchmarking machine learning algorithms for traffic sign recognition,”
friendly package,” IEEE Spectr., vol. 56, no. 5, pp. 16–17, May 2019. Neural Netw., vol. 32, pp. 323–332, Aug. 2012.
[5] J. Han and M. Orshansky, “Approximate computing: An emerging [26] A. Mogelmose, M. M. Trivedi, and T. B. Moeslund, “Vision-based traffic
paradigm for energy-efficient design,” in Proc. 18th IEEE Eur. TEST sign detection and analysis for intelligent driver assistance systems:
Symp. (ETS), May 2013, pp. 1–6. Perspectives and survey,” IEEE Trans. Intell. Transp. Syst., vol. 13, no. 4,
[6] F. Ebrahimi-Azandaryani, O. Akbari, M. Kamal, A. Afzali-Kusha, pp. 1484–1497, Dec. 2012.
and M. Pedram, “Block-based carry speculative approximate adder for [27] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and
energy-efficient applications,” IEEE Trans. Circuits Syst. II, Exp. Briefs, L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,”
vol. 67, no. 1, pp. 137–141, Jan. 2020. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
[7] D. Esposito, D. De Caro, E. Napoli, N. Petra, and A. G. M. Strollo, pp. 4510–4520.
“Variable latency speculative Han-Carlson adder,” IEEE Trans. Circuits [28] K. Simonyan and A. Zisserman, “Very deep convolutional networks
Syst. I, Reg. Papers, vol. 62, no. 5, pp. 1353–1361, May 2015. for large-scale image recognition,” 2014, arXiv:1409.1556. [Online].
[8] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, “Low-power Available: http://arxiv.org/abs/1409.1556
digital signal processing using approximate adders,” IEEE Trans. [29] G. Zervakis, H. Amrouch, and J. Henkel, “Design automation of approxi-
Comput.-Aided Design Integr. Circuits Syst., vol. 32, no. 1, pp. 124–137, mate circuits with runtime reconfigurable accuracy,” IEEE Access, vol. 8,
Jan. 2013. pp. 53522–53538, 2020.

Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.
TASOULAS et al.: WEIGHT-ORIENTED APPROXIMATION FOR ENERGY-EFFICIENT NN INFERENCE ACCELERATORS 4683

[30] V. Mrazek, R. Hrbacek, Z. Vasicek, and L. Sekanina, “EvoApprox8b: Georgios Zervakis received the Diploma and
Library of approximate adders and multipliers for circuit design and Ph.D. degrees from the Department of Electri-
benchmarking of approximation methods,” in Proc. Design, Autom. Test cal and Computer Engineering (ECE), National
Eur. Conf. Exhib. (DATE), Mar. 2017, pp. 258–261. Technical University of Athens (NTUA), Greece,
[31] R. Ye, T. Wang, F. Yuan, R. Kumar, and Q. Xu, “On reconfiguration- in 2012 and 2018, respectively. He worked as a
oriented approximate adder design and its application,” in Proc. primary Researcher in several EU-funded projects
IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), Nov. 2013, as a member of the Institute of Communication
pp. 48–54. and Computer Systems (ICCS), Athens, Greece.
[32] S. Jain, S. Venkataramani, and A. Raghunathan, “Approximation through He is currently a Research Group Leader of the
logic isolation for the design of quality configurable circuits,” in Proc. Chair for Embedded Systems (CES) with the Karl-
Design, Autom. Test Eur. Conf. Exhib. (DATE), 2016, pp. 612–617. sruhe Institute of Technology (KIT), Germany. His
[33] X. Jiao, V. Akhlaghi, Y. Jiang, and R. K. Gupta, “Energy-efficient research interests include approximate computing, low-power design, design
neural networks using approximate computation reuse,” in Proc. Design, automation, and integration of hardware acceleration in cloud.
Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2018, pp. 1223–1228.
[34] A. Raha, H. Jayakumar, and V. Raghunathan, “Input-based dynamic
reconfiguration of approximate arithmetic units for video encoding,” Iraklis Anagnostopoulos (Member, IEEE) received
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 24, no. 3, the Ph.D. degree from the Microprocessors and
pp. 846–857, Mar. 2016. Digital Systems Laboratory, National Technical Uni-
[35] O. Akbari, M. Kamal, A. Afzali-Kusha, and M. Pedram, “RAP-CLA: versity of Athens. He is currently an Assistant Pro-
A reconfigurable approximate carry look-ahead adder,” IEEE Trans. fessor with the Electrical and Computer Engineering
Circuits Syst. II, Exp. Briefs, vol. 65, no. 8, pp. 1089–1093, Aug. 2018. Department, Southern Illinois University Carbon-
[36] O. Akbari, M. Kamal, A. Afzali-Kusha, and M. Pedram, “Dual-quality dale. He is the Director of the Embedded Sys-
4: 2 compressors for utilizing in dynamic accuracy configurable multi- tems Software Laboratory, which works on run-time
pliers,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 4, resource management of modern and heterogeneous
pp. 1352–1361, Apr. 2017. embedded many-core architectures, and he is also
[37] V. Mrazek, Z. Vasicek, and L. Sekanina, “Design of quality-configurable affiliated with the Center for Embedded Systems.
approximate multipliers suitable for dynamic environment,” in Proc. His research interests lie in the area of constrained application mapping
NASA/ESA Conf. Adapt. Hardw. Syst. (AHS), Aug. 2018, pp. 264–271. for many-core systems, design and exploration of heterogeneous platforms,
[38] B. Boroujerdian, H. Amrouch, J. Henkel, and A. Gerstlauer, “Trading
resource contention minimization, and power-aware design of embedded
off temperature guardbands via adaptive approximations,” in Proc. IEEE
systems.
36th Int. Conf. Comput. Design (ICCD), Oct. 2018, pp. 202–209.
[39] M. A. Hanif, F. Khalid, and M. Shafique, “CANN: Curable approxima-
tions for high-performance deep neural network accelerators,” in Proc.
Hussam Amrouch (Member, IEEE) received the
56th Annu. Design Autom. Conf., Jun. 2019, pp. 1–6.
[40] C. Li, W. Luo, S. S. Sapatnekar, and J. Hu, “Joint precision optimization Ph.D. degree (summa cum laude) from KIT in 2015.
and high level synthesis for approximate computing,” in Proc. 52nd He is currently a Junior Professor heading the
Annu. Design Autom. Conf. (DAC), 2015, pp. 1–6. Chair of Semiconductor Test and Reliability (STAR)
[41] L. T. Clark et al., “ASAP7: A 7-nm finFET predictive process design within the Computer Science, Electrical Engineer-
kit,” Microelectron. J., vol. 53, pp. 105–115, Jul. 2016. ing Faculty, University of Stuttgart, as well as a
[42] A. Renda, J. Frankle, and M. Carbin, “Comparing rewinding and fine- Research Group Leader with the Karlsruhe Institute
tuning in neural network pruning,” 2020, arXiv:2003.02389. [Online]. of Technology (KIT), Germany. His main research
Available: http://arxiv.org/abs/2003.02389 interests are design for reliability and testing from
[43] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and device physics to systems, machine learning, secu-
connections for efficient neural network,” in Proc. Adv. Neural Inf. rity, approximate computing, and emerging tech-
Process. Syst., 2015, pp. 1135–1143. nologies with a special focus on ferroelectric devices. He holds seven HiPEAC
[44] M. Zhu and S. Gupta, “To prune, or not to prune: Exploring the efficacy Paper awards and three best paper nominations at top EDA conferences:
of pruning for model compression,” 2017, arXiv:1710.01878. [Online]. DAC’16, DAC’17, and DATE’17 for his work on reliability. He also serves
Available: http://arxiv.org/abs/1710.01878 as Associate Editor for Integration, the VLSI Journal. He has served in the
[45] P. Gysel, J. Pimentel, M. Motamedi, and S. Ghiasi, “Ristretto: A frame- Technical Program Committees of many major EDA conferences, such as
work for empirical study of resource-efficient inference in convolutional DAC, ASP-DAC, and ICCAD, and as a reviewer in many top journals, such
neural networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 11, as T-ED, TCAS-I, TVLSI, TCAD, and TC. He has around 85 publications
pp. 5784–5789, Nov. 2018. in multidisciplinary research areas across the entire computing stack, starting
[46] Keras. Keras CIFAR-10 ResNet. Accessed: Jul. 24, 2020. [Online]. Avail- from semiconductor physics to circuit design all the way up to computer-aided
able: https://github.com/keras-team/keras/blob/master/examples/cifar10_ design, and computer architecture.
resnet.py
[47] Pytorch. VGG-NETS. Accessed: Jul. 24, 2020. [Online]. Available:
https://pytorch.org/hub/pytorch_vision_vgg/ Jörg Henkel (Fellow, IEEE) received the Diploma
[48] A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and degree and the Ph.D. degree (summa cum laude)
T. Krishna, “SCALE-sim: Systolic CNN accelerator simulator,” 2018, from the Technical University of Braunschweig. He
arXiv:1811.02883. [Online]. Available: http://arxiv.org/abs/1811.02883 was a Research Staff Member with NEC Laborato-
ries, Princeton, NJ, USA. He is currently the Chair
Professor of embedded systems with the Karlsruhe
Institute of Technology. His research work is focused
on co-design for embedded hardware/software sys-
tems with respect to power, thermal, and reliability
Zois-Gerasimos Tasoulas received the Diploma aspects. He has received six best paper awards
in electrical and computer engineering from the throughout his career from, among others, ICCAD,
National Technical University of Athens, Greece, ESWeek, and DATE. For two consecutive terms, he served as the Editor-
in 2016. He is currently pursuing the Ph.D. degree in-Chief for the ACM Transactions on Embedded Computing Systems. He
with the Electrical, Computer, and Biomedical is also the Editor-in-Chief of IEEE Design&Test Magazine. He is/has been
Engineering Department, Southern Illinois Univer- an Associate Editor of major ACM and IEEE T RANSACTIONS . He has led
sity Carbondale, IL, USA. His research interests several conferences as a General Chair, including ICCAD and ESWeek, and
include concurrent data-structures, performance of serves as a Steering Committee Chair/Member of leading conferences and
many-core systems, system reliability, and GPGPU journals for embedded and cyber-physical systems. He coordinates the DFG
resource management. He is awarded with a Doc- program SPP 1500 Dependable Embedded Systems. He is a Site Coordinator
toral Graduate Fellowship from the University of of the DFG TR89 Collaborative Research Center on Invasive Computing. He
Southern Illinois Carbondale for the 2019–2020 academic year. is the Chairman of the IEEE Computer Society, Germany Chapter.

Authorized licensed use limited to: Tallinn University of Technology. Downloaded on December 09,2024 at 13:13:55 UTC from IEEE Xplore. Restrictions apply.

You might also like