0% found this document useful (0 votes)
11 views67 pages

Approximate Computing Concepts Architectures Challenges Applications and Future Directions

computing concept
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views67 pages

Approximate Computing Concepts Architectures Challenges Applications and Future Directions

computing concept
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Received 2 September 2024, accepted 21 September 2024, date of publication 25 September 2024,

date of current version 15 October 2024.


Digital Object Identifier 10.1109/ACCESS.2024.3467375

Approximate Computing: Concepts, Architectures,


Challenges, Applications, and Future Directions
AYAD M. DALLOO 1 , AMJAD JALEEL HUMAIDI 2 , AMMAR K. AL MHDAWI 3,

AND HAMED AL-RAWESHIDY 4 , (Senior Member, IEEE)


1 Department of Communication Engineering, University of Technology, Baghdad 10066, Iraq
2 Department of Control and Systems Engineering, University of Technology, Baghdad 10066, Iraq
3 Schoolof Engineering and Sustainable Development, De Montfort University, LE1 9BH Leicester, U.K.
4 Department of Electronic and Electrical Engineering, Brunel University London, UB8 3PH Uxbridge, U.K.

Corresponding author: Hamed Al-Raweshidy ([email protected])

ABSTRACT The unprecedented progress in computational technologies led to a substantial proliferation of


artificial intelligence applications, notably in the era of big data and IoT devices. In the face of exponential
data growth and complex computations, conventional computing encounters substantial obstacles pertaining
to energy efficiency, computational speed, and area. Due to the diminishing advantages of technology
scaling and increased demands from computing workloads, novel design techniques are required to increase
performance and decrease power consumption. Approximate computing, nowadays considered a promising
paradigm, achieves considerable improvements in overhead cost reduction (i.e., energy, area, and latency) at
the expense of a modest (i.e., still acceptable) deterioration in application accuracy. Therefore, approximate
computing at different levels (Data, Circuit, Architecture, and Software) has been attracted by the research
and industrial communities. This paper presents a comprehensive review of the major research areas of
different levels of approximate computing by exploring their underlying principles, potential benefits, and
associated trade-offs. This is a burgeoning field that seeks to balance computational efficiency with accept-
able accuracy. The paper highlights opportunities where these techniques can be effectively applied, such as
in applications where perfect accuracy is not a strict requirement. This paper presents assessments of applying
approximate computing techniques in various applications, especially machine learning algorithms (ML)
and IoT. Furthermore, this review underscores the challenges encountered in implementing approximate
computing techniques and highlights potential future research avenues. The anticipation is that this survey
will stimulate further discourse and underscore the necessity for continued research and development to fully
exploit the potential of approximate computing.

INDEX TERMS Approximate computing, approximate programming language, approximate memory,


circuit-level, approximate machine learning, deep learning, approximate logic synthesis, statistical and
neuromorphic computing, cross layer and end-to-end approximate computing.

I. INTRODUCTION istically, even under typical operating conditions. Verifying


Since 1974, Moore’s law and Dennard scaling have projected the correct operation of digital integrated circuits is becoming
that the transistor would become smaller and the transis- more and more costly as technology scales down. Both intrin-
tor density would double, resulting in a 40% increase in sic (such as varying dopant concentrations) and extrinsic
clock rate while the power density remained constant with (such as temperature) factors are drastically increasing the
each generation [1]. As transistors shrink with technological variability of transistors and interconnects [2], [3], [4] as well
advancements, it becomes more costly for designers and as reducing energy-delay advantages via CMOS scaling. This
manufacturers to maintain transistors that behave determin- nondeterministic phenomenon impedes the constant develop-
ment of technology. According to ITRS and Intel’s technical
The associate editor coordinating the review of this manuscript and data, at the 8 nm node, the area of dark silicon exceeds 50%
approving it for publication was Hang Shen . of the chip’s area [3], [5].

2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
146022 For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 12, 2024
A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

FIGURE 1. 42 years of microprocessor trend data [6], [7].

In 2007, the rate of Dennard scaling slowed dramatically, of CMOS technology below 7 nm can lead to deteriorating
and by 2012, it had almost stopped completely [6], [7], reliability [13]. This is due to the increased difficulty in
as shown in Figure 1. Therefore, the scaling of the threshold controlling and preventing parameter variations and faults
and supply voltages slowed down due to concerns with leak- at such advanced nanoscales. At these smaller dimensions,
age currents resulting from increasing on-chip power density. physical and quantum effects become more pronounced. All
To prevent the chip from overheating, the clock frequency these challenges have changed the dynamics for designing
was gradually increased [1], [8], [9]. Due to these limita- and producing far faster, lower-power circuits and haven’t
tions and requirements for future applications, novel design diminished the possibilities for achieving that. For instance,
techniques are required to handle ever-increasing amounts manufacturers implement various techniques such as strained
of data at ever-increasing performance and ever-decreasing silicon, high-k/metal gates, and FinFET structures to tackle
power consumption. This is propelling us towards the multi- challenges like leakage currents, variability, and other relia-
core era [10]. Despite multi-core system-on-chips (SoCs) bility concerns. Furthermore, the percentage of computations
aiming to augment throughput while minimizing power con- for many applications in the runtime represents 83% [14],
sumption, this objective has only been partially achieved due as shown in Figure 3. These challenges compel both the
to the inherent challenges in parallelizing certain sequen- industry and the academic communities to investigate feasi-
tial workloads and existing power constraints [1], [9], As a ble alternatives and strategies for sustaining the conventional
consequence, the number of active cores is restricted (a scaling of performance and energy efficiency. In an era
phenomenon termed ‘‘dark silicon’’), resulting in a gradual marked by the explosive growth of data and the increas-
scaling-up of cores in contemporary SoCs. As a result, ther- ing complexity of computations, the traditional methods of
mal dissipation power (TDP) is a limiting factor for multicore computing face significant challenges in terms of energy
CPUs [1], [11], as shown in Figure 2. Overheating problems efficiency, computational speed, and resource utilization.
were solved by reducing processor clock speeds and pow- Approximate computing is one of the promising tech-
ering down unused cores in the ‘‘dark silicon’’ era, which niques in this trend that has attracted significant traction
followed the TDP constraint [11]. from both academic and industry communities [15], [16].
The days of Dennard scaling are over, Amdahl’s Law is Major corporations such as IBM, Google, Intel, and ARM
nearing its end, and keeping up with Moore’s Law is becom- are actively engaged in pioneering research and the develop-
ing difficult and expensive, particularly when the benefits ment of commercial offerings that incorporate approximate
in terms of power and performance begin to diminish [6], computing strategies. An illustrative case is Google’s Ten-
[11]. In many computer systems, especially mobile devices, sor Processing Units (TPUs), which employ an approximate
clusters, and server farms, energy efficiency has become computing technique known as reduced precision to lower
a primary design requirement. Saving energy on a mobile energy usage [17]. Parallel paradigms, such as stochastic
phone may lengthen battery life and improve mobility [12]. computing, neuromorphic computing, and quantum comput-
At nanometer age, the circuits become more sensitive to ing, have also garnered considerable interest [18]. Table 1
parameter variations and faults. Reducing the feature size shows a general comparison of these four paradigms, where

VOLUME 12, 2024 146023


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

approximate computing can provide a good balance between Secondly, they exhibit a probabilistic nature, often evident in
latency, accuracy, power consumption, and reliability com- iterative algorithms. Lastly, a minor degree of imprecision
pared to others. The table compares different computing in their results is generally acceptable, largely due to the
paradigms, highlighting their trade-offs. Approximate com- limitations of human sensory capabilities.
puting is fast but less accurate, while stochastic computing Typical paradigms of approximate computing applications
is power-efficient but may not be fast and accurate. Neuro- range from big data to scientific applications, such as image
morphic computing excels in power efficiency but might lack processing, machine learning, and data mining domains.
in reliability, whereas quantum computing could offer speed The multifaceted nature of approximate computing results in
and precision but is not yet fully developed. The ideal choice unique trade-offs. Techniques can be implemented at various
for a computing paradigm depends on the application’s levels, from transistor design to software; each approach
specific accuracy needs, which might lead one to choose impacts hardware integrity and output quality in different
stochastic computing, whereas power constraints might favor ways. For example, leveraging acceptable error margins,
neuromorphic computing. This comparison is critical when as high as 10%, in a typical error-resilient image process-
selecting a suitable computing approach for a given task. ing algorithm can significantly enhance energy efficiency
and computational performance [21]. Another example is
that varying memory refresh rates or adopting different data
storage and representation precisions are viable strategies
to achieve such improvements. However, these techniques
might not be suitable for critical applications like medical and
military applications [22].
At the heart of approximate computing are four different
levels: data, software, architecture, and circuit (hardware).
One of its main issues is that the consequences of certain
approximations are far-reaching on efficiency and accuracy
for different applications; thus, there is no one-size-fits-all
FIGURE 2. An abstract illustration of the Dark Silicon phenomenon which solution. This paper will delve into the evaluation of approx-
prevents powering-on more cores due to high power density and thermal imate computing at these four levels.
hotspots, where the white C represents the active cores and the black C
represents the idle cores [11]. • Data-level: The importance of these techniques can-
not be underestimated in the quest for lower power
consumption and improved performance. Sampling,
quantization, and compression are some of the tech-
niques that allow us to manage quality vs. efficiency
issues for smaller or simpler data representations.
• Software Level: There are approximate approaches such
as code optimizations like loop perforation, which
handle software code to have optimized code, using
approximate functions to construct approximate algo-
rithms, or using relaxed synchronization. This aims to
show valuable efficiency with a slight degradation in
output quality.
• Architecture Level: The approaches at this level for
increasing efficiency and saving power are more com-
plicated because we need to rethink the design of
FIGURE 3. Intrinsic application resilience [9], [14].
specialized approximate processing units and memory
systems. Furthermore, it is necessary to expand and
Approximate computing offers large power and perfor-
enhance instruction sets and use different precisions to
mance improvements in digital systems by relaxing the
contribute to increasing efficiency.
numerical equality for implementing error-tolerant applica-
• Circuit level: the approaches are the cornerstone of
tions [19]. In approximate computing, error metrics emerge
improved efficiency in power consumption; here we
as a novel design parameter that can be traded off to enhance
need to rethink to approximate logic gates, optimize
performance or reduce power consumption. Although com-
transistor behavior, and redesign arithmetic unit circuits,
putational faults are never desirable, applications tolerant to
but they come with different degrees of inaccuracy.
errors confer additional advantages due to their inherent resis-
tance to inaccuracies, attributable to several factors [19], [20]. This multi-level analysis opens up further exploration
Firstly, these algorithms handle real-world, noisy input and of how different approximations combine in real systems.
redundant data, typically output from diverse sensor types. Understanding interactions between levels will guide the

146024 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

TABLE 1. Comparison of features of the existing computing paradigms.

development of robust, highly optimized hardware and soft- methodologies and statistical and Neuromorphic Computing.
ware designs for approximate computing. However, cross- Section XI explores the impact of applying approximate
layer approaches to approximate computing have emerged computing strategies across diverse applications. Section XII
as a powerful tool for intelligently combining approximation provides the benchmarks, tools, and libraries. Section XIII
techniques across hardware, architecture, software, and data discusses our perspectives of Future Directions. Section XIV
levels. Through strategic coordination across these layers, presents the remaining challenges in approximate comput-
researchers aim to maximize efficiency gains while adhering ing at the different levels, open research questions, and
to user-specified quality constraints. These are critical ele- future research directions. Finally, Section XV concludes this
ments for the field’s progress. Despite its relative youth, this review paper.
field demonstrates highly promising results [23], [24], [25].
These early successes underscore the potential of cross-layer II. EXITING AND CURRENT SURVEYS
techniques to push the boundaries of resource efficiency with- In this section, we delve into an examination of extant
out sacrificing the functionality of computing systems. literature specifically oriented towards the realm of approxi-
This review tackles the fragmented nature of approximate mate computing at different levels. As of the writing of this
computing with a comprehensive approach, covering tech- paper, a limited number of surveys probing the domain of
niques from circuit to architecture levels. It explores how approximate computing have been identified. Hence, we have
strategies like voltage scaling and selective precision opti- compiled the most significant surveys on approximate com-
mize energy efficiency and speed while carefully balancing puting up to the end of 2023, arranging them in chronological
accuracy, making it ideal for domains like machine learn- order according to their publication years in Table 2. Addi-
ing where slight imprecision is acceptable. We also aim to tionally, we directed attention to comprehensive surveys that
address the future directions of this promising field, high- delve into and concentrate on specific subjects within each
lighting the potential research avenues and emerging trends. broader topic, aiding readers in their exploration. Table 2
This paper is intended to serve as a primer for researchers and provides a comparative analysis of various surveys with
practitioners interested in exploring approximate computing regards to the topics addressed, namely approximate tech-
at different levels, providing insights into its potential and niques, applications, hardware and software, and challenges.
limitations. Furthermore, the table provides an overview of the extent
The subsequent sections of this manuscript unfold in the to which each topic was addressed, indicating whether it
following manner: Initially, Section II delves into prior sur- was fully covered, only partially covered, or not covered
veys to identify the main gaps to be filled by this present at all. The review paper encompasses the number of pages
survey. Section III presents the scope of this survey and the and references cited, as well as the range of years covered.
review methodology. Section IV offers an overview of the Typically, a review paper should concentrate on the various
general framework for approximate computing. Section V general aspects pertaining to the implementation of approxi-
elaborates on the techniques of approximate computing at mate computing.
the data level. Section VI delves into the methodologies In the scholarly landscape, there are a modest number of
employed in approximate computing within the software surveys that have embarked on the exploration of approxi-
domain, focusing specifically on the nuances of program- mate computing [19], [26], [27], [28], [29], [30], [31], [32],
ming languages designed for approximation. Section VII [33], [34], [35]. These studies, while valuable, primarily offer
furnishes an in-depth examination of approximate computing a cursory overview of the subject, often focusing on specific
at the architectural stratum, with a particular emphasis on facets and, consequently, leaving certain aspects underex-
approximate memory systems. Following this, Section VIII plored. The granularity of detail and comprehensive under-
elucidates the methodologies of approximate computing standing of the topic that these surveys provide is, therefore,
at the circuit level, providing detailed insights into their somewhat limited. Recognizing these gaps in the existing
implementation and applications. Section IX provides an literature, the present survey endeavors to redress these short-
overview of frameworks and approaches in approximate logic comings. For example, Zervakis et al. [29] concentrated on
synthesis. Section X explores three emerging computing a limited range of approximate computing techniques. They
frameworks of: cross-layer and comprehensive end-to-end presented a survey that covers approximate multipliers and

VOLUME 12, 2024 146025


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

approximate high-level synthesis for implementing CNN. years. The focus of this research was primarily on contem-
Furthermore, they focused on the reconfigurable approxi- porary literature within the field of Approximate Computing
mation for neural network inference. Damsgaard and col- (AxC). An analytical review of selected papers was con-
leagues [33] presented a review paper that touched upon ducted with several objectives in mind: firstly, to catalog and
various AxC techniques at the architecture and circuit lev- elucidate the various AxC methodologies; secondly, to enu-
els, albeit with brief explanations. Their work distinctively merate and describe notable AxC architectures that have been
highlights the exploration of approximate wired and wire- documented; thirdly, to discuss the hurdles associated with
less network-of-chips, an area frequently neglected in other AxC while proposing feasible solutions; and lastly, to assess
reviews. Leon and their colleagues [34], [35] presented a how AxC is applied in practice. Figure 4 presents a visual rep-
two-part review that offers valuable insights and a compre- resentation of the distribution of selected papers, categorized
hensive overview of the field. by their publication years and the publishers involved.
Recent literature on approximate computing (AxC) has
enriched the field with key insights but often lacks the IV. GENERAL FRAMEWORK OF APPROXIMATE
scope and depth our research intends to cover, particularly COMPUTING
in exploring the nuances of AxC techniques. This observa- Approximate computing represents a paradigm in com-
tion underscores the necessity for continued research and putational methodology that willingly sacrifices a degree
discussion to fill these gaps and provide a more detailed of precision in exchange for enhanced performance and
exploration of AxC methodologies and applications. How- improved energy efficiency. This strategy proves particularly
ever, our review delves deeply into the most approximate advantageous for applications that can accommodate a certain
techniques and applications, ranging from mobile to cloud measure of inaccuracy without a significant impact on the
computing. Our review expands upon Leon’s work by offer- overall outcome. As delineated in Figure 5, the framework for
ing a more nuanced analysis, breaking down topics into approximate computing encompasses a multitude of stages
detailed subtopics, and updating the discourse with research and components.
from the last eight years. We also explore areas not covered
by Leon, such as approximate elementary and activation func-
tions, the impact on communication and security. Moreover,
we providing a broader and more updated perspective in this
area by presenting a comprehensive list of influential review
papers in the field of approximate computing. We aim to
enrich our understanding of this sophisticated domain and lay
a robust groundwork for future research by filling the knowl-
edge gaps left by previous surveys to ensure more inclusive
coverage of the topic. This review paper covers the important
key points as follows: 1) the benefits; 2) techniques; 3) the
cases used for each technique; 4) frameworks; 5) hardware
circuits and accelerators; 5) programming languages 6) tools,
including compilers and logic synthesis; 7) security; and 8)
challenges and a future roadmap.
This survey underlines the transformative potential of
approximate computing in a variety of domains, particularly
machine learning and IoT, and aims to enrich the research
community by offering a valuable reference for researchers.

III. SURVEY METHODOLOGY


This literature review provides a solid foundation for under-
standing the evolution of this field. By tracing developments
from early foundational works to the most influential recent
publications (2017-2024), we identify key trends and break- FIGURE 4. The distribution of selected papers, categorized by their
throughs. Publications from established publishers (IEEE, publication years and the publishers involved.

ACM, Elsevier, Nature, Springer, MDPI, etc.) were care-


fully considered, supplemented by insights from select The overarching structure of the approximate computing
ArXiv preprints. Five hundred and two studies encompass- framework is primarily composed of three integral com-
ing various techniques in approximate computing have been ponents: the selection of error-tolerant applications, the
examined. The review includes 20 articles from 2024, 93 from implementation of approximate-aware design at compile-
2023, 77 from 2022, 59 from 2021, 57 from 2020, 57 from time (offline), and the execution of approximate tuning at
2019, 52 from 2018, 36 from 2017 and 61 from the preceding run-time (online). The framework is segmented into finer

146026 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

TABLE 2. Comparative analysis of approximate computing surveys across the full computing stack.

elements, each vital for the effective deployment of approxi- approximation levels, and continuously assess adherence to
mate computing. These subdivisions collectively contribute system goals. Intelligent runtime management is crucial for
to enhancing the computational performance and energy realizing the full potential of approximate computing. Devel-
efficiency of the overall system. The process begins by oping systems that can autonomously and rapidly select the
choosing one or more approximate levels (employing a cross- optimal degree of approximation in response to fluctuating
layer approach) to implement an application. To successfully requirements and conditions remains a complex and active
leverage approximate computing, a critical first step is to area of research.
identify the non-critical computation units of the application, To achieve approximate computation, researchers and
which allow for relaxation in accuracy without degrading practitioners employ a toolbox of diverse techniques. These
the overall output quality. This step is called ‘‘non-critical span hardware components (approximate adders), software
unit identification,’’ which requires thoughtful analysis. Once frameworks, system-level strategies (sampling), and pro-
these units are identified, the next step is ‘‘Approximate gramming language and logic synthesis features. However,
Design’’, which is performed both at compile-time (offline) the error analysis and quality evaluation help us in the
and runtime (online). Compile-time Approximate design selection and dynamic adjustment of these techniques. The
transforms these application units by strategically introducing approximate computing framework reflects a comprehensive
approximate computations. This process can require special- strategy. The process begins to help us define and identify the
ized tools and compilers to optimize the trade-off between candidate parts or units of the application to apply a suitable
accuracy and efficiency. To maximize efficiency while main- approximate technique. After integrating approximated units
taining accuracy, systems must reconfigure themselves at or parts into the application’s design, the compilation and
runtime to change the degree of approximations. This process error analysis phases begin. Another critical aspect of this
involves the following steps: monitor conditions, readjust framework is the runtime management of the application

VOLUME 12, 2024 146027


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

through an ongoing assessment of the quality of the results. [44], [45]. Instead of analyzing the entire dataset, a repre-
This methodology facilitates more efficient computing, espe- sentative subset of the data with error bounds is selected for
cially in scenarios where some inaccuracy is acceptable. The applications such as database search, stream analysis, and
next subsequent sections will delve into each component in model training. This can reduce the computational complex-
detail. ity of the analysis and speed up the processing time. The
selection of data can be done based on various criteria such
V. DATA-LEVEL APPROXIMATIONS as random, systematic, adaptive, stratified, multistage (clus-
A. APPROXIMATE DATA TYPES AND STRUCTURES tering), reservoir Sampling, Sampling-Over-Joins, Bucketing
One straightforward way to incorporate approximation into Strategy, Coreset etc. The inclusion of sampling operators
hardware and software is to use approximate data types in leading database products (e.g., Oracle, Microsoft SQL
and structures. To save computing resources, data types and Server, IBM Db2) highlights their importance in extracting
structures allow for certain imprecision in storage and manip- insights from large datasets. This capability proves crucial in
ulation. For example, we know that precision scaling (e.g., many areas, including exploratory analysis, predictive mod-
using fixed-point) can accelerate computations and reduce eling, and hypothesis testing. The most basic approach to
storage. Likewise, approximate data structures (such as random sampling is known as uniform random sampling,
Bloom filters or Count-Min sketches) are also useful to save in which every item in the full data set (also referred to as the
more resources and time by providing probabilistic function- ‘‘population’’) has an equal chance of being selected. Despite
ality. Furthermore, approximate data representation focuses its simplicity, uniform random sampling can potentially result
on approximating the input data to allow for more efficient in significant variability in the resulting estimates.
computation. Unfortunately, the applications in image pro- One strategy for overcoming this obstacle is to provide the
cessing and neural networks offer a certain inherent level developer with abstractions to identify, reduce, and reshape
of error tolerance and this provide us the opportunities for resilient and best-effort computations to be more paralleliz-
concrete enhancements in both performance and energy effi- able or run on unstable hardware components [9], [46]. There
ciency. This section delves into the specifics of approximate are many frameworks that provide the developer with these
data types and structures and discusses their implementation, abstractions and the capabilities of distributed computing and
benefits, and potential drawbacks. data processing, such as Hadoop MapReduce [47], Approx-
Hadoop [48], Apache Spark, Apache Flink, Apache Storm,
1) APPROXIMATE DATA REPRESENTATION Apache Tez, Apache Beam, etc. For example, Apache Beam,
Approximate data representation involves strategically an open-source framework, simplifies batch and stream pro-
employing techniques like sampling or simplified represen- cessing with its high-level API, compatible with various
tations to reduce the complexity or volume of datasets. These execution engines like Apache Flink, Spark, and Google
methods see wide adoption in domains such as data analy- Cloud Dataflow. Initiated by Google and developed with part-
sis, machine learning, and other computationally demanding ners such as Cloudera and PayPal, it transitioned from Google
fields. Approximate data types prove advantageous in three Cloud Dataflow in 2014 to Apache Beam in 2016 under the
key scenarios: Apache Software Foundation.
Data sampling is a technique used in various frameworks to
• Resource Constraints: When hardware limitations improve the efficiency and speed of processing large datasets,
(memory, storage) are present, data-level approxi- especially in decision-making and analytical applications.
mations enable operation on datasets that would be In this paper, we provide an overview of how some frame-
otherwise infeasible, trading some precision for effi- works utilize data sampling. Laptev [47] proposed enhancing
ciency gains. Hadoop with statistics-based uniform sampling for efficient
• Real-Time Processing: In streaming or sensor data sce- analysis of massive datasets, addressing time and resource
narios, approximate techniques allow for rapid insight limits. This extension, EARL on Hadoop, accelerates pro-
extraction and decision-making, prioritizing responsive- cessing when preliminary results suffice, maintaining high
ness over exhaustive analysis. accuracy with small samples and using bootstrapping for
• Inherent Imprecision: Many real-world datasets (e.g., accuracy estimates. Goiri et al. [48] introduced an approxi-
weather data, image data) contain natural variability. mate Hadoop version using strategies like data sampling and
In these cases, absolute accuracy may be less critical, task dropping for large datasets. This approach, allowing for
justifying the benefits of approximate representations. both precise and approximate MapReduce operations, can
This makes approximate methodologies suitable for effec- significantly cut runtimes by up to 32 times with a toler-
tive data handling, as the natural variability and uncertainty able error margin of 1% at 95% confidence. Hu et al. [49]
in data sources make exact precision less critical. explored sampling as a way to speed up decision-making
queries on large data sets by introducing a sampling frame-
a: DATA SAMPLING work in Spark that allows for approximate computing with
One common technique for approximate data representation error estimates. ApproxSpark supports various sampling
is data sampling [36], [37], [38], [39], [40], [41], [42], [43], methods, such as partition versus data item sampling and

146028 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

FIGURE 5. Overall framework of approximate computing.

stratified sampling, to provide fast results with estimated put, balancing efficiency and quality. However, it’s limited
error bounds. The findings indicate that ApproxSpark can to linear queries and needs manual sampling adjustments.
notably enhance speed while retaining accuracy to optimize Nguyen et al. [37] introduced S-VOILA, a stratified random
for different applications. sampling algorithm designed for efficient and representative
Sampling techniques play a crucial role in addressing the data stream handling. The algorithm was evaluated using
challenges of stream analytics. Quoc et al. [50] developed real-world datasets, including the OpenAQ dataset, and com-
StreamApprox, an approximate computing system for stream pared with other methods such as Reservoir, ASRS, and
analytics that provides significant speedups and throughput Senate sampling. It achieves a lower variance than ASRS
gains (1.15x−3x) over native Spark Streaming and Flink. and approximates VOILA allocation. Empirical results on
This is achieved through selective sampling, while still main- real-world data demonstrate its superiority over Neyman
taining high accuracy levels. StreamApprox outperforms a allocation. This makes S-VOILA valuable for reducing com-
competing Spark-based sampling system with comparable putational overhead in machine learning model training.
accuracy. Wen et al. [51] proposed a system called Approx- Park et al. [52] developed BlinkML, a system that enables
IoT that employed approximate analytics for high throughput error-sample size trade-offs for machine learning training,
edge computing. The authors used online hierarchical strat- efficiently estimating the needed sample size for desired
ified reservoir sampling to gather data in a decentralized accuracies. BlinkML outshone traditional methods by train-
manner, but the aforementioned systems [50] are designed ing 961 models in 30 minutes and finding the best model in
to handle the task of processing input data streams in 6 minutes, but the traditional methods failed within an hour.
a centralized datacenter. The authors also employed an It achieved up to 95% accuracy in various models, using only
extended stratified reservoir sampling to select data from 0.16% to 15.96% of the usual training time, and employed
multiple sub-streams, ensuring no individual sub-stream is uniform random sampling for large datasets with a memory-
ignored. It generates approximate output with defined error efficient approach. Anderson and Cafarella [53] focused on
bounds, making effective use of edge computing resources. optimizing the time-consuming process of feature engineer-
ApproxIoT surpassed traditional sampling with 1.3x to 9.9x ing in machine learning. They proposed a system, ZOMBIE,
faster processing across 10%-80% sampling rates, show- that treats feature evaluation as a query optimization problem,
ing slight accuracy decreases (0.07% at 10% sampling). thus accelerating the feature evaluation loop. They employed
In tests with NYC taxi data, it offered improved data through- a variation of active learning for data sampling. The system

VOLUME 12, 2024 146029


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

was tested using different learning tasks and index group where each numerical value is represented as an integer or
creation methods, and the results showed that ZOMBIE sig- fixed-point number with a limited number of bits. Quan-
nificantly outperformed conventional methods, reaching the tization can be done during the training or inference of a
accuracy plateau for a task nearly eight times faster. The machine learning model. In post-training quantization, the
authors conclude that ZOMBIE can reduce engineer wait weights and activations of a pre-trained model are quan-
times from 8 to 5 hours in some settings. tized to a lower precision [57], while in quantization-aware
Sampling frameworks offer compelling advantages when training [56], [57], [58], [59], [60], the model is trained
dealing with massive datasets. By intelligently reducing the with the quantization process in mind, often with the use
volume of processed data, they lead to faster execution times of special quantization-aware algorithms and techniques.
and improved scalability. Current research is investigating The choice of quantization method and the level of pre-
improvements in sample techniques to reduce errors and cision to use depend on the specific requirements of the
customize them for certain analytical purposes. application and the trade-off between accuracy and resource
Researchers commonly approximate data at the software usage. This can save memory and computational resources,
and hardware levels using three approaches: precision scal- but it can also introduce some errors [61]. In the depicted
ing, quantization, and relaxed precision. These techniques training workflow as shown in Figure 6, Novac et al. [59]
can reduce the complexity of computational applications. employed floating-point quantization to strike a balance
between computational efficiency and precision. Prior to per-
b: RELAXING PRECISION forming computations within each neural network layer, the
inputs, weights, and biases are quantized to lower precision
The design methodology of approximate computing involves
while retaining their floating-point nature. Post-computation,
sacrificing computational precision in exchange for enhanced
the outputs are similarly quantized before they proceed to
power efficiency and performance. A prevalent approach is
the subsequent layer. This approach ensures a consistent
the relaxation of precision, which involves reducing the bit
precision level throughout the network’s forward pass. The
count employed in representing data or performing com-
precise methodology for quantization is outlined in [59].
putations. However, the compromise lies in the potential
Notably, during Training, certain processes, such as the sys-
occurrence of errors or imprecisions in calculations. Error-
tem dynamically reassess the value range and updates the
tolerant algorithms, error compensation techniques, and
scale factor before performing layer computations. However,
error-aware design can be used to alleviate the deleterious
during inference, the scale factor remains fixed. Also, Dai
effects of precision relaxation. The appropriateness of relax-
and Fan [62] tackled the challenge of deploying accurate
ing precision is contingent upon the application’s capacity to
crop disease recognition models onto resource-constrained
accommodate errors, and a judicious evaluation is necessary
hardware. Their multi-pronged approach combines pruning,
to achieve equilibrium between the advantages of minimizing
knowledge distillation, and ActNN compression with INT8
precision and the requisite degree of exactness for a particular
quantization. Remarkably, this significantly reduced model
application. This involves reducing the precision of numeri-
size (by 88%) and inference time (by 72%) while achieving
cal calculations, such as using single-precision floating-point
an impressive 94.24% accuracy. Their contribution demon-
numbers instead of double-precision. The floating-point data
strates the feasibility of accurate real-world image analysis on
type is a common target for approximation. By reducing
smaller devices. Real-time ECG analysis at the edge is chal-
the precision of floating-point numbers, computations can be
performed more quickly and with less energy. This can signif- lenging due to device limitations. Mohammed’s work [63]
icantly reduce the computational cost of the algorithm at the addresses this with a lightweight model that uses quantization
and pruning to achieve up to 99.1% and a 95% F1-score for
expense of reduced accuracy. Carmichael et al. [54] explore
edge-based deployment.
low-precision numeric formats (fixed-point, floating-point,
Due to hardware improvements and privacy consider-
and posit) at ≤8-bit precision for use in DNN accelerators.
ations, machine learning (ML) is moving towards edge
Static analysis tools [55] play a key role in enabling such
devices. Federated learning (FL) shines here, improving
precision reduction techniques.
privacy and network efficiency. To support this trend,
Costa et al. [64] proposed L-SGD, a lightweight version
c: QUANTIZATION of SGD optimized for microcontrollers (MCUs). Their
This technique refers to the process of reducing the pre- implementation is 4.2x faster than standard SGD while con-
cision of numerical data in a program by mapping the suming significantly less memory (2.8%). It boasts both a
values to a smaller set of discrete values. This is typically floating-point and a quantized version for fine-tuning, show-
done in machine learning models to reduce the memory ing promise for quick model updates and fairness fixes in FL
requirements and computation costs of the model, which is scenarios.
especially important for deployment on edge devices with
limited resources. As a result, the majority of recent studies d: PRECISION TUNING OR SCALING
on quantization have concentrated on inference [56]. One This technique involves adjusting the numerical precision
common method of quantization is fixed-point quantization, of calculations to improve both accuracy and efficiency.

146030 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

FIGURE 6. Quantization-Aware training architecture [59].

It entails fine-tuning data and computations to maximize performance is approximate computing, which involves sac-
efficiency and accuracy while using as few resources as pos- rificing accuracy in exchange for improved performance. The
sible. Precision scaling or feature scaling approaches (e.g., technique of approximating high-precision values into lower-
half-precision (16-bit) and mixed-precision training) are both precision values with precision scaling has become increas-
techniques used in deep learning that aim to improve training ingly popular on GPUs, with support for half-precision at
efficiency and reduce computational resources while main- the hardware level. The issue with GPU-side kernel-level
taining or even improving model performance [54], [65], scaling is that the overall improvement in program per-
[66]. These breakthroughs have become particularly relevant formance is often limited due to the combination of data
with the advent of powerful hardware accelerators such as transfer, type conversion, and kernel execution. To address
GPUs and TPUs, which can effectively leverage the bene- this issue, several solutions can be employed: optimizing
fits of reduced precision arithmetic. Nevertheless, achieving data transfer, kernel fusion [67], adaptive precision tech-
precision below half-precision has presented a consider- niques [70], memory hierarchy optimization, compiler and
able challenge that requires extensive fine-tuning. Numerous runtime support, and advanced code analysis and optimiza-
cutting-edge software-level approaches [67], [68] have been tions. By implementing these solutions, the performance of
developed to tackle various challenges associated with pre- GPU-side kernel-level scaling can be significantly improved.
cision scaling, including scaling degree, scaling automation, Kotipalli et al. [70] addressed the limitations of precision
mixed precision, and dynamic scaling. selection for applications with strict accuracy requirements,
To handle the complexity and non-intuitive nature neglect of performance concerns in GPGPU accelerators, and
of round-off errors in floating-point, Wei-Fan et al. [69] insufficient optimization techniques in existing approaches.
addressed this issue using formal analysis with FPTUNER, It provides a comprehensive solution, AMPT-GA, that opti-
an automated tool that optimizes precision through symbolic mizes performance while satisfying accuracy requirements in
expansions. FPTUNER efficiently manages precision mod- high-performance computing applications. To face the scal-
ifications and was tested on various benchmarks, showing ability limitations of precision tuning techniques due to the
significant energy savings with mixed-precision code despite wide search space, Guo and Rubio-González [71] presented
some compiler-related challenges. For a detailed study on a scalable hierarchical search algorithm for precision tuning,
these quantization techniques, the review paper [56] offers which was implemented in the tool HiFPTuner. The results
extensive insights. showed the proposed algorithm reduce the search time by
The utilization of graphics processing units (GPUs) has 59.6%. compared to the state-of-the-art.
become widespread in accelerating various emerging appli- The concept of ‘‘dynamic precision scaling’’ pertains to
cations, including but not limited to big data processing the modification of numerical precision in real-time, which
and machine learning. Although GPUs have demonstrated is contingent upon the particular demands of a given com-
their effectiveness, one prevalent approach to enhancing putation or system [68], [72]. Deep neural networks demand

VOLUME 12, 2024 146031


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

extensive linear operations, impacting speed. Giamougian-


nis et al. [73] introduce a dynamic-mixed-precision infer-
ence scheme to address this problem. Their results show
a significant execution time reduction (55%) for linear
operations while maintaining model accuracy. Effective
mixed-precision tuning demands tailored hardware and soft-
ware. Fornaciari et al. [74] presented a roadmap for this
co-design. Their roadmap, informed by recent advances,
strives to maximize mixed-precision benefits (performance
and energy efficiency) for diverse applications.

e: COMPRESSION
This technique involves reducing the size of data or files
through various compression techniques. The goal is to store
or transmit data in a more efficient way, thus reducing
storage or bandwidth requirements and potentially improv-
ing performance and energy efficiency [53], [54]. There FIGURE 7. Architectures of (a) Autoencoder, and (b) Variational
Autoencoders (VAEs) [84].
are two main categories of data compression: lossless and
lossy. Lossless compression is a type of compression that
uniquely guarantees the ability to recover the exact original
data from its compressed form. Examples of lossless com- in Figure 7 (a). The decoder then attempts to reconstruct the
pression techniques include Huffman coding, Run Length original input from this latent variable, outputting x as per the
Encoding, LZ77, ZIP, GZIP, and RAR, which are used to function pθ (x|z). The AE is generally trained without super-
compress text, images, and other types of data [75]. There vision to minimize the reconstruction error between x and
are also lossy compression algorithms such as JPEG, MP3, x̂. Variations of AEs, including Variational Auto-Encoders
and MPEG, which eliminate unnecessary or less important (VAEs) and their derivatives, extend this basic framework
information where a certain amount of data loss will not be to serve more complex purposes like data generation and
detected by most users. These types of compression are used denoising, adapting the architecture to a range of applica-
to compress multimedia files like images, audio, and video. tions, as shown in Figure 7(b).
For instance, JPEG, MP3, and MPEG-4 are used for images, For example, Duan et al. [82] introduced a Quantization-
audio, and video, respectively [75], [76], [77], [78], [79], aware ResNet VAE (QARV) for lossy image compression,
[80]. Lossy compression formats, such as MP3 or MPEG4, combining hierarchical VAEs design with quantization opti-
achieve smaller file sizes in comparison to lossless formats, mizations for efficient entropy coding and fast decoding.
albeit with a trade-off of reduced output fidelity. Data com- QARV is characterized by using variable compression rates,
pression plays a dual role in machine learning and big data which outperforms existing methods in rate-distortion met-
contexts. Lossy techniques (MP3, MPEG4) aren’t the only rics. However, choices like PCA’s number of components or
way to reduce the size of a file. Dimensionality reduction an autoencoder’s bottleneck size directly influence informa-
techniques like Principal Component Analysis (PCA) and tion loss.
t-Distributed Stochastic Neighbor Embedding (t-SNE) hold The choice of data compression technique depends on the
particular significance. These techniques streamline process- specific requirements of the application, such as the need for
ing by extracting high-level features from vast datasets while lossless reconstruction, the acceptable level of data loss, and
potentially mitigating issues like the curse of dimensionality. the computational resources available. Wiedemann et al. [85]
Classical dimensionality reduction (PCA, t-SNE) excels introduced DeepCABAC, a novel neural network com-
at finding linear structure in data but can have difficulties pression method based on Context-based Adaptive Binary
capturing complex, nonlinear relationships that often exist Arithmetic Coder (CABAC), achieving high compression
in high-dimensional datasets. Beyond Linear Compression, rates without compromising accuracy. They demonstrate that
Autoencoders [81] and generative models (including Varia- DeepCABAC can compress the VGG16 ImageNet model by
tional Autoencoders (VAEs) [82] and Generative Adversarial a factor of 63.6, reducing the network’s memory footprint to
Networks (GANs) [83]) use deep neural networks inherently a mere 9 MB without compromising its accuracy.
adept at nonlinear patterns. These can encode richer, more In conclusion, although compression methods provide
expressive representations of data. uto-Encoder (AE) [84] is notable advantages in minimizing data volume and enhancing
a neural network architecture that specializes in encoding storage and communication efficiency, they present a set of
and decoding data. The encoder component compresses the challenges that need to be addressed. The low performance
input data x into a condensed representation known as the and high complexity of compression and decompression
latent variable z, following the function qφ (z|x), as shown algorithms can offset the benefits, especially for low-power

146032 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

devices and real-time scenarios (video streaming). The data in a Bloom filter depends on several factors: the filter’s
types and compression ratio also specify the type of com- size, the expected dataset size, and the most importantly,
pression algorithm to be used. Therefore, being careful when the relative cost of the hash function itself. Modern opti-
making the decision to select and implement compression mization balances these factors. Bloom filters exhibit either
techniques is crucial. the capability to delete data while incurring supplementary
memory usage, or the ability to expand data while incurring a
2) APPROXIMATE DATA STRUCTURES
higher rate of false positives and a reduction in query speed.
Therefore, Yuhan W. [92] addressed and solved the two short-
Data structures offer a strategic approach to data storage
comings: no deletion and no expansion, by proposing a new
and retrieval, incorporating mechanisms for approximation
Bloom Filter, called Elastic Bloom Filter.
or lossy compression to curtail memory and computational
The classic Bloom filter, while remarkably space-efficient,
demands. This efficiency extends to supporting decrement
faces inherent trade-offs between accuracy, query speed,
operations and managing negative counts, further enhancing
and memory usage. Recent work by Gebretsadik et al. [93]
system performance. For example, in data analytics, approxi-
presented the enhanced Bloom filter (eBF), a novel design
mate data structures such as Bloom filters and HyperLogLog
specifically tailored to the challenges of intrusion detection
can be used to estimate the cardinality of a set without storing
in IoT networks. Their experimental evaluation reveals the
all the elements of the set [86], [87]. There are some examples
eBF as a significant step forward, demonstrating considerable
of approximate data structures:
memory savings (15.6x, 13x, 8x) over standard Bloom filters,
Cuckoo filters, and robust BFs, while maintaining fast and
a: BLOOM FILTER accurate performance. Seymen and Yalçın [94] proposed a
The Bloom filter’s core strength lies in its space efficiency lightweight Bloom filter for IOT applications and imple-
and fast membership queries. However, its probabilistic mented it using the Murmur3 hash on a Nexys A7 FPGA
nature introduces the possibility of false positives (indicat- board.
ing an element is present when it isn’t actually in the set) In summary, Bloom filters are celebrated for their compact-
[88]. Despite this limitation, Bloom filters find wide adop- ness and proficiency in membership determination, despite
tion in scenarios where some inaccuracy is tolerable and their computational and memory demands [95], [96]. Future
space is a major constraint [89]. They are widely used in efforts will aim at refining these structures to lower false
various domains such as IOT, networking, databases, and positives, enhance scalability, and conserve computational
bioinformatics. Burton [90] introduced Bloom filters in the resources, thereby bolstering their effectiveness and effi-
1970s. There are many categories of Bloom filters based ciency for expansive datasets.
on practical measurements, namely, Standard, Counting,
Dynamic, Hierarchical, Loglog, Spectral, Multidimensional, b: SKETCHING DATA STRUCTURES
Fingerprint-based, Shifting, Compressed Bloom Filters, etc. This structure is a family of data structures used to summarize
In general, designing Bloom filters presents several key chal- large data sets in a small amount of space. They can be
lenges: a trade-off between false positive rate and space, used for approximate query answering and data compres-
no false negative control, optimal hash function choice, pre- sion [97]. Particularly, sketch-based data structures, such as
defining size, and scalability. The predefined size of the the traditional Count sketch (CS), Count-Min Sketch (CMS)
Bloom Filter, which cannot be changed later, poses chal- [7], Count-Mean-Min Sketch (CMMS) [98], and many
lenges for large or growing datasets. The rate of false positives more, are a frequent technique for frequency estimation.
can be reduced by increasing the size of the Bloom filter The efficiency and reasonable accuracy make sketch-based
or using more hash functions. There are many proposed approaches compelling for network measurement. However,
approaches to reduce the rate of false positives. However, the ongoing need to balance accuracy with memory con-
both solutions require more computational resources. For straints creates an active research area. Developing more
applications where false positives are absolutely unaccept- versatile sketches or techniques for dynamically adjusting
able within known data size constraints, EGH filters provide a sketch parameters is crucial. New sketch data structures,
valuable solution, as demonstrated by Sándor et al. [91]. This including the Count Min Log Sketch (CMLS) [99], Switch
has potential implications for areas like network security and Sketch [100], Elastic Sketch [101], HBL (Heavy-Buffer-
data validation. For providing control over false negatives, Light)-Sketch [102], or Diamond Sketch [103], have emerged
Bloom filters can handle the deletion of elements, thus pro- in recent years [104]. For example, faced with the challenge
viding control over false negatives. of counting item frequencies in huge datasets where exact
Bloom filter is a little more memory-intensive hash- storage is impossible, the Count-Min Sketch emerges as a
ing method. BF’s compute cost comes from hash function powerful solution. With a probabilistic approach, it intel-
computation and query judgment. MD5, SHA-1, and other ligently trades some accuracy for a significantly reduced
computation-intensive hash algorithms are needed for BF. memory footprint [105]. It is used in various applications like
Perfect and locality-sensitive hashes are considerably harder compressed sensing, networking, databases, NLP, security,
to compute. Determining the ideal number of hash functions machine learning, etc.

VOLUME 12, 2024 146033


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

One of the challenges of the Count-min sketch algorithm approximate estimation, and one can improve the accuracy of
is overestimation of the frequency of events due to hash HLL [112] by increasing the number of registers used or using
collisions [106], and to mitigate the issue of overestimation, high-quality hash functions. There are also improved versions
one could use a variant of the Count-Min Sketch known as of the algorithm, such as HyperLogLog++ [113], Hyper-
the Count-Mean-Min Sketch [98]. The accuracy of the CMS LogLogLog [114], or HLL-Tailcut [115], that offer better
depends on the quality of the hash functions used [107]. The accuracy and less memory usage. Unfortunately, Hyper-
quality of the hash functions can be improved by using inde- LogLog doesn’t support the deletion of elements; therefore,
pendently universal hash families [108]. Khan A. et al. [107] the sliding HyperLogLog algorithm [116] was proposed to
introduced an enhanced approach to sketch-based hashing, support deletions. HyperLogLog sketches [117] are proposed
diverging from direct full-key hashing. Their methodology to extend the HyperLogLog algorithm to support estimating
involves the use of multiple independent hash functions, the cardinalities of union, intersection, or relative com-
each targeting different segments and combinations of a plements of two sets. Another issue is that understanding
key, thereby establishing a composite hashing framework privacy-related attributes of datasets, such as re-identifiability
for improved accuracy. The fact that the accuracy of CMS and joinability, is crucial for data governance. However, large
improves with more space (i.e., more hash functions and datasets and organizations require more efficient strategies,
larger arrays) is another challenge leading to a trade-off as brute force methods are inefficient due to their mas-
between the two. In addition, it should be noted that CMS sive systems and data volume. Pern et al. [87] introduced an
lacks support for decrement operations and negative counts. extension of the HyperLogLog algorithm, KHyperLogLog
Count-Sketch is a viable alternative to Count-Min-Sketch for (KHLL), an algorithm based on approximate counting tech-
accommodating negative counts. CMS can provide frequency niques for estimating re-identifiability and joinability risks in
estimates, and a combination of data structures could be large databases. KHLL’s joinability analysis helps distinguish
used to support exact queries; for example, one could use a between pseudonymous and identified datasets. This leads
hash map for exact queries [109] and Count-Min Sketch for to reduce reliance on expert judgment and manual reviews.
frequency estimation. Another challenge is that the CMS data It uses less memory and linear runtime.
structure cannot be resized once it’s created. This issue was
addressed by Zhu et al. [110] by proposing a dynamic variant d: MINHASH
of Count-Min Sketch that allows for resizing, called Dynamic
MinHash is a probabilistic data structure used to estimate
Count-Min Sketch. The Count-Min Sketch is widely used in
the similarity between two sets. It can return approximate
data stream analysis, network monitoring, database size esti-
answers with high probability [88]. The utilization of Min-
mation, and other areas where processing massive amounts
Hash and HyperLogLog sketching algorithms has become
of data is required.
an essential practice in the realm of big data applications
In 2018, a team from Tsinghua University and Microsoft
for the purpose of set summarization. HyperLogLog is a
Research [91] proposed Elastic Sketch algorithm, which
technique that enables the counting of distinct elements using
would consume less memory and provide a more precise
small fraction of storage space. On the other hand, MinHash
estimation of item frequencies. It is considered a solution for
is a method that is well-suited for rapid set comparison,
network-wide measurements, which is a critical function for
as it permits the estimation of Jaccard similarity and other
network management and security. It is designed to adapt to
related measures. Therefore, Ertl [119] introduced a novel
different traffic distributions and measurement tasks. Elastic
data structure named SetSketch, which effectively bridges the
Sketch outperforms contemporary benchmarks with a speed
gap between the two aforementioned use cases. In numer-
increase of 44.6 to 45.2-fold and a reduction in error rates
ous instances, it exhibits superior performance compared to
ranging from 2.0 to 273.7 times. This algorithm was enhanced
the corresponding state-of-the-art estimators. Also, Yu and
by Keyan [102], known as Heavy-Buffer-Light (HBL) sketch.
Weber [120] introduced a novel compressed sketch known as
By comparing it to its predecessor, such as the elastic sketch,
HyperMinHash, which is based on the HyperLogLog frame-
and other conventional methods, HBL manages to decrease
work and can serve as a seamless substitute for MinHash.
the average relative error rate by 55% to 93% under identical
The HyperMinHash algorithm preserves the fundamental
memory constraints.
characteristics of MinHash, including the ability to perform
streaming updates, unions, and estimate cardinality.
c: HYPERLOGLOG (HLL)
HLL is a probabilistic data structure that is a very powerful e: T-DIGEST
approximate algorithm used for estimating the cardinality of T-digest is an algorithm that is used for real-time operations
a set. It’s particularly useful when dealing with large datasets and constructing concise representations of data that are
because it provides acceptable accurate estimation with sig- capable of approximating rank-based statistics with a high
nificantly less memory [80], [87]. It is used in various appli- degree of accuracy, especially in the vicinity of the distribu-
cations like network monitoring, web analytics, data analysis, tion’s extremities [121]. It was introduced by Ted Dunning
and databases. HLL is a probabilistic algorithm that provides in 2013. This novel form of sketch exhibits resilience in

146034 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

the face of non-normal distributions, multiple iterations of A. CODE OPTIMIZATION-BASED APPROXIMATE


sampling, and arranged data sets. The integration of indepen- METHODS
dently computed sketches can be achieved with minimal or These methods focus on modifying the code to optimize for
negligible compromise in precision. The t-digest algorithm approximate computation while maintaining an acceptable
is extensively utilized within prominent corporations and level of accuracy [122]. These methods can be applied man-
is additionally incorporated into commonly used software ually by the programmer or automatically by a compiler or
applications such as Postgres, ElasticSearch, Apache Kylin, another tool. Approximation-enabled compilers are another
and Apache Druid. The t-Digest has the property that the important avenue for software-level approximate computing.
error is smaller around the median and larger at the extremes, These compilers introduce approximations into programs
which makes it particularly useful for applications that automatically or semi-automatically. They analyze the source
require accurate estimates of quantiles for skewed data [89]. code to identify parts of the program where approxima-
Overall, approximate data structures can be a useful tool tions can be introduced without significantly affecting the
for handling large amounts of data efficiently while sacrific- overall output quality. Techniques employed by these com-
ing some level of accuracy. Each of those techniques offers pilers include loop perforation (skipping some iterations of a
unique advantages and can be applied in different scenarios loop), operator approximation (replacing exact operators with
depending on the specific requirements of the application. approximate ones), and task skipping (skipping some non-
However, they also have their own challenges and limita- critical computations). These techniques modify the compiler
tions, such as ensuring that the introduced approximations do to generate approximate code that trades off accuracy and per-
not significantly degrade output quality or lead to unaccept- formance. Examples of such techniques include AutoTuning
able errors. The choice among these techniques, therefore, and Knowledge distillation, matrix approximation, numerical
requires a careful understanding of both the application’s optimization, rounding, truncation, statistical sampling, Tay-
characteristics and the capabilities of the approximation lor series approximation, linearization, neural networks, and
technique. piecewise linear approximation. There are several different
techniques for optimizing code using approximate methods.
VI. SOFTWARE-LEVEL APPROXIMATIONS
Approximate computing is a technique used in computer
1) COMPUTATION SKIPPING
engineering to reduce the computational complexity and
Computation skipping is a technique used in computer pro-
energy consumption of computing systems while relaxing the
gramming to improve the performance and efficiency of code
accuracy of the computations. This approach can be particu-
by reducing the number of computations that need to be per-
larly useful for applications where accuracy is not critical or
formed [123]. This technique involves the exclusion of code
where the computations are too complex or time-consuming
blocks based on predetermined criteria such as acceptable
to be performed exactly. The complexity of these applications
levels of Quality-of-Service degradation, constraints estab-
is ever-increasing since they must constantly adapt to provide
lished by the programmer, and/or predictions made regarding
new services and process a large amount of data. The growing
the accuracy of the output at runtime. It involves skipping
cost of developing such systems, including the target cost,
unnecessary computations that would not change the outcome
power consumption, execution time, and memory space for
of the program. Skipping computations in Convolutional
software development, is directly proportional to the increas-
Neural Networks (CNNs) has been the subject of numer-
ing complexity of systems. The idea behind approximation
ous studies. CNNs excel in many recognition tasks, but
computing at the software level is to minimize processing
their computational complexity limits their use on power-
complexity, which is represented by the number of process-
constrained platforms. Therefore, Lin et al. [124] introduced
ing operations and memory accesses, in order to reduce
PredictiveNet, a method for reducing the computational com-
implementation costs. Therefore, there are many approximate
plexity of CNN without significant accuracy loss. It predicts
techniques at the software level proposed in the literature in
sparse outputs from non-linear layers, bypassing most com-
order to reduce the computation and the time-execution of
putations. It skips many CNN convolutions during runtime
a program by introducing inaccuracies or approximations in
without changing the CNN structure or needing additional
certain parts of the computation while producing an accept-
branch networks. When tested, PredictiveNet reduced com-
able accuracy of results. The task of identifying and selecting
putational cost by a factor of 2.9 compared to a standard
computations for approximation that have less influence on
CNN, with minimal accuracy degradation. There are several
the quality of the results is one of the most difficult aspects of
different techniques for computation skipping, including:
approximate computing. Software-level approximation tech-
niques refer to the methods used to simplify the design
and analysis of software platforms. These techniques aim to a: LOOP PERFORATION (SKIPPING)
reduce the complexity of software systems while maintaining This technique involves selectively skipping iterations of a
acceptable levels of performance and functionality. They can loop that are not critical to the output in a software program
be applied at various stages of the software development to provide performance and energy gains in exchange for QoS
process, including design, implementation, and testing. loss. There are several skipping approaches for a different set

VOLUME 12, 2024 146035


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

of iterations based on different criteria, such as skipping every


other iteration, skipping based on a condition, or skipping
until a certain threshold is met. Loop tiling involves breaking
a loop into smaller sub-loops to reduce the memory access
pattern [125], [126], [127], [128]. Figure 8 shows a loop that
iterates over a set of data. For each iteration, the loop checks
a perforation condition. If the condition is true, the iteration
is skipped. Otherwise, the iteration is executed. Loop perfo-
ration is a powerful technique that can be used to improve the
performance and accuracy of loops. However, traditional loop
perforation, which only considers the number of instructions
to skip, overlooks the significant influence of differences
between instructions and loop iterations on performance and FIGURE 8. Flowchart illustrating the concept of loop perforation.
accuracy. To address this issue, Li et al. [126] advanced loop
perforation with their Sculptor system, introducing selective
dynamic loop perforation to enhance performance and accu-
racy by skipping specific instructions within loop iterations. and profile the code in offline and real-time to figure out
Despite challenges in instruction analysis and strategy opti- unnecessary memory access. The main goals of this technique
mization, they proposed compiler improvements for selective are to save energy on memory access, ineffective utilization of
and adaptive perforation. Testing across eight applications bandwidth, and the overall performance of the system. MAS
showed this method outperforms traditional loop perforation, boosts performance mainly for memory-bound applications.
achieving speedups of 2.89x and 4.07x with 5% and 10% However, the implementation of MAS faces two challenges:
error tolerances, proving its effectiveness in boosting both the complexity and managing the overheads of accurate skip
speed and accuracy. detection. However, this area has significant potential for
The graph algorithms are widely used in high-performance improving performance and power consumption.
and mobile computing. The performance of these algorithms Due to the growth of dataset sizes and multi-level cache
can vary due to input dependence, i.e., changes in the input hierarchies, memory performance in data mining applica-
graph. Omar et al. [128] proposed an input-aware loop per- tions is a significant problem being addressed by the current
foration predictive model called GraphTuner, which allows research. The important methods in data mining applications
graph algorithms to systematically trade off accuracy for are recursive partitioning methods such as decision trees and
performance and power benefits. In this approximate com- random forest learning. To address this issue, Kislal and
puting circumstance, they examine the consequences of input Kandemir [129] introduced a framework to optimize perfor-
dependence on graph algorithms. This helps to identify the mance in recursive partitioning applications while managing
requirement for adaptation of inner and outer loop perfo- accuracy loss. Their key components include a data access
rations depending on input graph features such as graph skipping module (DASM) guided by user-defined strategies
density or size. The outcomes indicate an average perfor- and a heuristic to predict the impact of skipping data accesses
mance improvement of approximately 30% and a power for accuracy preservation. This proposed framework lever-
utilization improvement of about 19% at a program accuracy ages the inherent flexibility in these applications to enhance
loss limit of 10% for NVidia® GPU. performance with minimal accuracy losses. Experimental
The loop perforation technique has also been used evaluations show that this method can enhance performance
in approximation frameworks for optimizing embedded by up to 25% with minor accuracy losses of up to 8%.
GPU kernels. Maier and Juurlink [127] proposed a new The authors also prove the framework’s scalability under
memory-aware perforation approach for GPU kernels, opti- different accuracy needs and its potential for memory perfor-
mized for embedded GPUs, and a framework for automatic mance improvement in NoC/SNUCA systems. Also, Raparti
loop nest approximation based on polyhedral compilation. and Pasricha [130] introduced two innovative solutions for
This framework introduces new multidimensional perfora- memory bottlenecks in many-core GPGPU (NoC) architec-
tion schemes and generalizes existing ones. To enhance result tures. They introduced an approximate memory controller
accuracy, a reconstruction technique is incorporated, and a (AMC) to lower DRAM latency and optimize scheduling,
pruning method is proposed to eliminate low-quality trans- and a low-power NoC (Dapper) to enhance communication
formations in the large transformation space. efficiency. Experiments show the architectures boost NoC
throughput by 21% and cut latency and power use by 45.5%
and 38.3%, respectively.
b: MEMORY ACCESS SKIPPING (MAS) Certain researchers have directed their attention towards
MAS is a new approach that tries to optimize storage and the deliberate skipping of costly data accesses. Researchers
memory access by skipping unnecessary for uncritical data. must be aware of three critical questions. What is the upper
To achieve this technique, we need to statistically analyze limit of skipping data accesses while maintaining a specified

146036 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

level of inaccuracy? The significance of architectural aware- such as early stopping, regularization techniques like L1/L2
ness in discerning which data accesses to eliminate is a regularization, data augmentation, and ensembling. However,
pertinent inquiry. Is it always the case that two executions, the early stopping approach is a simpler and easier way than
which both skip the same number of data accesses, will the others.
yield identical output quality? Karakoy et al. [131] attempt to
answer these critical questions through proposing a program
slicing-based approach that identifies the set of data accesses
to skip.

2) ITERATIVE REFINEMENT
Iterative refinement is a method used when we are dealing
with ill-conditioned systems, where small changes in the
input can lead to large changes in the output. The technique
involves starting with an initial estimate of the solution and
then iteratively refining the estimate until a desired level of
accuracy is achieved. Iterative refinement can be used in var-
ious fields, including computer graphics, machine learning,
and scientific computing. Recent research has shown that
iterative refinement can be particularly effective in certain
domains, such as optimization and machine learning [132],
FIGURE 9. The importance of early stopping approach in the machine
[133], [134], [135], [136]. Recently, Yang et al. [133] high- Learning.
lighted the effects of applying iterative refinement in machine
learning. They introduced a compact deep neural network Recent research has shown that early stopping can be
and applied learned gating criteria during the training phase particularly effective in deep learning models, which are
to figure out if the weight-sharing cycle would work. This often computationally expensive and require large amounts of
mechanism gives adaptive behavior to the model. However, data for training [137], [138]. Early stopping can be applied
iterative refinement may not always be best. It may converge to various types of algorithms, including search algorithms,
slowly or not at all for ill-conditioned systems. optimization algorithms, and machine learning algorithms.
Together, early stopping and the validation set help to find
3) EARLY STOPPING that ideal balance to establish the optimal capacity for a
Early Stopping is a technique used to improve performance model’s training, as shown in Figure 9. There are several
and prevent poor generalization in machine learning models types of early stopping techniques that can be used in machine
by stopping the training process before the maximum num- learning to stop the training process before it reaches the
ber of iterations or epochs is reached. The primary goal of maximum number of iterations or epochs. There are some
early stopping is to prevent overfitting, reduce computational common types, such as Fixed early stopping [139], [140],
costs, and enhance the efficiency of the training process. Adaptive early stopping [141], Noisy early exit [140], [142],
Early stopping can be achieved by monitoring various metrics Early stopping with patience [143], and Gradual unfreez-
during the training process when certain criteria are met. ing [144], [145], [146].
Datasets are divided into three subsets: training, validation, Another recent study that uses early stopping is ‘‘Early
and test subsets. The training dataset is utilized for mod- Stopping without a Validation Set’’ by Mahsereci et al. [138].
eling and assessing accuracy, whereas the validation subset The authors proposed a validation-free early stopping
measures model generalization. A threshold is set so as to approach that depends on the statistics of locally accessible
decide the early stopping condition and the ideal number of computed gradients. This method increases a little in compu-
epochs for training when the error on the validation subset tation complexity, delay, and memory. The method achieved
drifts from that on the training subset. As seen in Figure 9, comparable or better performance compared to traditional
during the early training of a model showing high bias and methods that use a validation set.
low complexity, both training and validation errors tend to
drop. This is evident in the underfitting area, where bias and 4) FUNCTION APPROXIMATION
generalization are included. In this area, the behavior is that Function approximation is a technique used in mathematics
of a model that has not been trained enough to recognize and computer science to estimate an unknown function using
the patterns in the data. In the overfitting area, the variance a set of input-output pairs or data points. The purpose of
or error increases as a result of the model being trained for this technique is to figure out a function that approximates
too long. This is evident in the growing divergence between the true underlying function as closely as possible. There
training and validation errors and the loss of generalization. are many methods for function approximation, including
There are a couple of techniques for mitigating overfitting, polynomial interpolation, the CORDIC algorithm, regression

VOLUME 12, 2024 146037


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

analysis, spline interpolation, and neural networks [147], There are many techniques to achieve sparsity: pruning, reg-
[148], [149]. The implementation of approximate functions ularization, dimensionality reduction techniques, and matrix
within complex systems is facilitated by the utilization factorization methods. We discussed pruning techniques in
of neural networks in software-hardware co-design. This the previous subsection. We can use regularization techniques
approach involves converting traditional approximable codes like L1 regularization to handle sparsity in neural network
into equivalent neural networks, resulting in improved exe- model parameters. To reduce the dimensionality of datasets
cution time performance at the expense of reduced output and extract features, we can use dimensionality reduction
accuracy [150], [151]. These techniques are also used at the techniques like Principal Component Analysis (PCA) and
circuit (hardware) level, and we will discuss them in detail t-Distributed Stochastic Neighbor Embedding (t-SNE), and
later. Matrix Factorization methods like Singular Value Decompo-
sition (SVD) and Non-negative Matrix Factorization (NMF).
5) PRUNING
We discussed dimensionality reduction techniques in the
compression subsection. These techniques and methods are
Pruning is a technique used a lot in deep learning and machine
parts of low-rank matrix factorization techniques [169],
learning models to make models smaller and simpler. The
which are unsupervised learning methods used for data anal-
goal is to remove redundant or unnecessary parameters from
ysis tasks such as dimension reduction, feature extraction,
the model, which can lead to better generalization perfor-
blind source separation, data compression, and knowledge
mance and faster inference times. There are several types of
discovery.
pruning techniques that can be used depending on the spe-
Recent years have seen the success of artificial neural net-
cific application and model architecture [152], [153], [154],
works in solving real-world problems and the rapid increase
[155], [156]. The goal of pruning is to generate a more
in their complexity and parameters. Larger networks are more
compact and efficient model that can be implemented on
computationally and memory-intensive, making them diffi-
resource-constrained devices or used in real-time applications
cult to use on embedded devices [170]. To address this, there
without reducing accuracy or performance. This can be done
is growing interest in sparsifying neural networks. Sparse
through a variety of methods, including magnitude-based
neural networks can match the performance of fully con-
pruning [157], where weights with small absolute values
nected networks while using less energy and memory, making
are removed, and iterative pruning [158], where weights are
them ideal for resource-limited devices [171]. NVIDIA [172]
gradually removed over multiple iterations of training. Prun-
has developed a straightforward and widely applicable tech-
ing methods can be broadly categorized into unstructured
nique for generating sparse deep neural networks through
pruning [157], [158], [159], and structured pruning [79],
inference by utilizing a particular form of sparsity structure
[160], [161], [162], [163]. For example, in deep learning,
known as 2:4 pattern. For example, the NVIDIA Ampere
Neuron or Weight pruning can be used to remove neurons
architecture’s third-generation Tensor Cores in A100 GPUs
or connections that do not contribute significantly to the final
utilize fine-grained sparsity in their neural network weights,
output. This can reduce the computational complexity of
enhancing matrix multiplication speed in deep learning
the model and speed up the training process [164]. Another
without losing accuracy. Another example, Lu et al. [173]
example is that in a convolutional neural network, filter
aims to develop an FPGA accelerator for sparse CNNs,
pruning can be used to remove filters that have low activation
addressing inefficiencies in existing FPGA architectures
values or are redundant, which can help to decrease the com-
designed for dense models. The proposed solution includes a
putational cost and memory requirements of the model [163],
weight-oriented dataflow for handling irregular connections
[152], [165], [166], [167]. For example, Luo et al. [168]
in sparse convolutional layers, a tile look-up table to eliminate
proposed a filter pruning algorithm called ThiNet, which
runtime indexing matches, and a weight layout with a channel
considers the interdependence of filters in a layer and prunes
multiplexer to prevent data access conflicts. Experiments
them in a way that preserves accuracy. The findings indicate
show the accelerator achieves 223.4-309.0 GOP/s on Xilinx
that ThiNet achieves a significant reduction in computational
ZCU102, offering a 3.6x-12.9x speedup over previous dense
resources for VGG-16, including over 3 times fewer FLOPs
CNN FPGA accelerators. Also, Tragoudaras et al. [174] used
and over 16 times compression, with a minimal accuracy loss
a state-of-the-art HLS tool to implement a MobileNetV2
of 0.52%. Additionally, ThiNet cuts parameters and FLOPs
model by integrating design methodologies with sparsifica-
by over half, with a slight accuracy decrease of about 1%.
tion techniques, including sparse matrix methods and two
weight pruning approaches. The objective is to develop hard-
6) SPARSITY ware accelerators that maintain error metrics comparable to
Sparsity is a crucial concept in modern data processing and state-of-the-art systems while significantly reducing infer-
machine learning to optimize energy, memory, and compu- ence latency and resource utilization.
tation in algorithms, all without significant loss of accuracy. In sum, Sparsification techniques, such as sparse matrix
Sparsity is a technique employed to ensure that a large pro- methods and weight pruning, are essential for enhancing the
portion of the elements in a dataset or matrix are zero or efficiency of deep neural networks. By reducing the num-
have values that will not significantly impact a calculation. ber of non-zero elements, these techniques lower memory

146038 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

usage and computational demands, enabling faster and more function call overhead through approximate function mem-
resource-efficient inference. Additionally, they help maintain oization [179]. Therefore, Arundhati et al. [180] introduced
model accuracy while optimizing hardware performance. The a software approach to function memoization that bypasses
challenges of implementing sparse algorithms can be more the execution of functions implemented using approximate
complex compared to their dense counterparts. However, computing techniques. A decision-making rule utilizing
sparsification is a crucial strategy for advancing the practi- the Bloom filter and Cantor’s pairing function is proposed
cality of real-time AI applications. to determine whether to search the look-up table (LUT)
or perform the actual computation. Additionally, a simple
7) APPROXIMATE MEMOIZATION approximation technique is proposed to search in the LUT
Memoization is a technique employed to store the outcomes to find an approximate one. Evaluation conducted using
of computationally expensive operations for subsequent uti- benchmarks from the AxBench suite demonstrates the effec-
lization in cases where identical operations and input data are tiveness of the proposed technique. To memoize a block
encountered [175]. Different levels of accuracy can be used to of code, Liu [181] proposed a hardware-compiler Codesign
compute many algorithms. Approximate computing exploits framework, AxMemo. The goal of AxMemo is to memoize
this to decrease execution time by determining the tradeoff code blocks with many inputs. In other words, AxMemo
between performance and accuracy. Approximate memoiza- tries to replace long instruction sequences with a few
tion extends this concept by providing approximate results for hash and lookup operations. Brumar et al. [176] introduced
new input data that correlate with previously computed and Approximate Task Memoization (ATM), a novel approach
stored data. This approach, which relies on software frame- to memoizing functions or tasks at runtime. Memoization
works, compilers, and programmer’s decisions, is particularly of previously executed tasks enables predictions of future
useful in optimizing computational efficiency. Real programs results without actual execution, preserving accuracy. The
often contain redundant computations due to factors like runtime system also incorporates task similarity measure-
repetitive inputs, pattern repetitions, repeated function calls, ment and correctness assessment to automatically determine
and poor programming practices [176]. There are many the feasibility of task approximation. The method results in a
works that achieve the functions or tasks memoization either 1.4x speed increase with memoization alone and a 2.5x boost
at compile-time or at runtime. Although, Large Language when adding task approximation, with a negligible average
Models (LLMs) train on extensive datasets, they can poten- accuracy drop of 0.7% (up to 3.2%).
tially expose sensitive information. Data preprocessing and Contrary to the aforementioned techniques, researchers
differential privacy techniques are designed to prevent data from Microsoft, in collaboration with researchers from
memorization and face the challenge of reliance on data struc- the Weizmann Institute [182], introduced a new training
ture assumptions that might lead to false privacy concerns. procedure for ReLU networks that utilizes complex recom-
Performance enhancement is a critical requirement in bination of neurons to achieve approximate memorization.
high-performance and embedded computing applications, This approach aims to address the shortcomings of previous
often relying on the expertise of performance engineers to constructions and achieve efficient memorization with an
optimize their efficiency by leveraging both manual work almost ideal number of neurons and weight magnitudes.
and numerous analysis and optimization tools. Pinto and Car- In relation to Large Language Models (LLMs), LLMs
doso, [175] introduced a methodology that automates code train on massive amounts of text, including sensitive infor-
analysis and memoization to simplify the application of mem- mation. LLM can potentially expose this sensitive infor-
oization. It aims to assist developers without optimization mation, including personal information. Previous research
expertise and provide customizable analysis for performance concentrated on literally preventing data memorization using
engineers. This approach caches the results of computations data preprocessing and differential privacy techniques. This
for efficiency and is tailored for both novice developers and process faces the challenge of reliance on data structure
expert performance engineers. Also, Suresh et al. [177] intro- assumptions that might lead to false privacy concerns and
duced a compile-time technique for function memoization, impact the model’s overall quality. Current research treats this
extended its scope to user-defined functions, and enabled issue of approximate memorization in LLMs by using Rein-
transparent application to dynamically linked functions. forcement Learning. For example, Kassem [183] proposed
High-performance and energy-efficient memoization a novel framework that employs a reinforcement learning
approaches face drawbacks like high runtime overheads and approach, specifically Proximal Policy Optimization (PPO).
limited applicability, while conventional hardware techniques This framework uses a negative similarity score, such as
use specialized caches that consume excessive area and BERTScore or SacreBLEU, to measure how close the LLM’s
energy. Zhang and Sanchez [178] introduced MCACHE, output is to the memorized data. If it’s too similar, that’s a
a hardware technique that utilizes data caches for memo- negative reward.
ization while sharing cache memory with regular program
data. This method boosts performance by 21x and outper- 8) ARCHITECTURE SEARCH
forms software memoization by 2.2x in runtime efficiency. This technique is a process in machine learning where a
The need to improve computing efficiency by reducing computer algorithm searches for the optimal architecture,

VOLUME 12, 2024 146039


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

or configuration, of a neural network (NAS) for a specific parallel applications, as synchronization points can cause
task [184]. There are several approaches to architecture threads or processes to spend a lot of time waiting [196]. The
search, including reinforcement learning, evolutionary algo- efficient execution of concurrent applications on multicore
rithms, and Bayesian optimization. These methods can be systems necessitates the implementation of synchronization
used to explore the vast space of possible network architec- mechanisms that consume significant amounts of time, either
tures and identify those that are most likely to perform well to enable access to shared data or to fulfill data dependencies.
on a given task. In general, developers often use synchronization to prevent
undesirable interactions like data races when multiple par-
9) KNOWLEDGE DISTILLATION allel threads access shared data. However, there are several
Knowledge distillation is an approach using machine learning drawbacks associated with standard synchronization mech-
to transfer knowledge from a large, sophisticated teacher anisms: synchronization overhead (time and space costs),
model to a simpler, faster, and smaller student model. The parallelism reduction (threads waiting), and failure propa-
student model mimics the teacher’s behavior efficiently, gation (unperformed synchronization operations can cause
using limited resources. It makes this model adequate for threads to hang indefinitely). Every synchronization point,
implementing on resource-constrained devices and using in acting as a serialization point, can potentially impede parallel
real-time applications. For example, in natural language pro- scalability [197]. Therefore, researchers are indeed exploring
cessing, a large language model can be distilled into a smaller the concepts of relaxed synchronization and approximate
and faster model that can be deployed on mobile devices parallelism to mitigate these issues with trading minor com-
[185], [186]. The soft targets produced by the teacher model putational errors for enhanced performance and efficiency.
can be seen as a compressed representation of the knowl- The synchronization error has higher performance compared
edge learned by the teacher model, and by incorporating with mixed precision but produces more errors. In par-
them into the student model training process, the student ticular, synchronization errors introduce non-deterministic
model can effectively learn from the teacher’s knowledge. errors that are complex to handle. Loading data into local
Knowledge distillation has been applied to a variety of tasks memory requires a synchronization point to ensure that all
and has been shown to be effective in minimizing the scale threads in a block have the same view of the local memory.
and computational intricacy of deep neural networks without To decrease the time lost during synchronization, SYprox
compromising their effectiveness [187], [188], [189], [190], was proposed by [198], which provides a synchronization
[191], [192], [193], [194]. For example, Hongxu et al. [195] elimination mechanism that defines a way to handle the
proposed a new method called DeepInversion. This tech- number of synchronization points. Lee et al. [199] introduced
nique reverses a trained network to create class-specific a novel algorithm for solving large-scale quadratic program-
images from random noise, refining the input and using ming problems in parallel computing systems. They proposed
batch normalization data for regularization. Adaptive Deep- ‘‘lazy synchronization,’’ which reduces the synchronization
Inversion enhances image variety by leveraging differences rate while improving processor utilization and convergence
between teacher and student network outputs. The method speed. Tested on Amazon’s 40-node distributed system, the
has been applied to network pruning, knowledge transfer, algorithm achieved speedup by 160x and reduced communi-
and continual learning without needing original data. Exist- cation overhead by 99.65% using the relaxed synchronization
ing knowledge distillation techniques used to train student technique compared to conventional methods. To convert
networks typically rely on task-specific data. However, the inherently sequential code to parallel approximations, Stitt
availability of such data may be limited due to privacy and Campbell [200] introduced an automatic parallelizing
or confidentiality considerations. Several techniques involve approximation-discovery framework, PANDORA, based on
generating training samples from the teacher network. Nev- symbolic regression machine learning. The findings show the
ertheless, the generated images often exhibit discrepancies code accelerated by 2.3x to 81x with maintaining acceptable
when compared to authentic ones, thereby imposing limita- accuracy. The framework’s capabilities are further demon-
tions on the performance of the student network. Therefore, strated through FPGA experiments and by eliminating loops
Tang et al. [188] proposed an approach for building training from the code. The authors defined some limitations of
datasets based on proposed web crawling (ICCD). They pro- PANDORA framework and suggested some solutions. For
posed a pseudo-classification strategy and frequency-domain example, PANDORA faces difficulty handling complex prob-
supervision (PCFS) to enhance performance by reducing the lems due to its reliance on symbolic regression, and consumes
divergence between the generated ICCD and target dataset. a lot of time for discovering approximations, etc.
The findings show the proposed PCFS surpasses the existing The majority of these works were noticed by
data-free methods. The code is available online. Bueno et al. [201], who applied the aggregate elimination
of all synchronization points without accounting for out-
B. APPROXIMATE PARALLELISM AND RELAXED put quality variations due to varying input data. Therefore,
SYNCHRONIZATION Bueno et al. [201] proposed a novel strategy by using super-
The relaxed synchronization technique removes synchroniza- vised learning methods to relax synchronization in parallel
tion points that represent one of the major bottlenecks in applications that allow trade-offs between quality and
146040 VOLUME 12, 2024
A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

execution time. Also, the authors proposed the relax factors approximable regions of a program. It then applies a vari-
to be applied to the input, application, and execution envi- ety of approximation techniques to these regions, such as
ronments together. The results show this proposed technique loop perforation and task skipping. Approximate program-
enhanced the K-means algorithm by a gain factor of 3.5x ming languages can be classified based on their approach to
for video processing while maintaining an acceptable quality approximation:
rate.
Overall, applications like image processing and neural a: LANGUAGE EXTENSIONS
networks can tolerate some errors, offering potential for sig- These are conventional programming languages augmented
nificant improvements in execution time and energy use. Key with new syntax and semantics to support approxima-
software approximation techniques include mixed precision, tion [203]. Examples of this category include EnerJ,
which uses lower precision data representation; perforation, Rely [204], and Chisel. EnerJ [205] is a Java extension
which skips instruction blocks, loop iterations, or data assum- with a design applicable to languages where data types are
ing nearby values are similar; and relaxed synchronization, explicitly declared by programmers. FlexJava [206] stream-
which removes synchronization points, a major bottleneck in lines approximate programming by automating annotations,
parallel applications. These approaches vary in performance making energy-efficient coding simpler and safer. FlexJava
and error. Typically, perforation and synchronization elimi- matches EnerJ’s energy savings by reducing the number of
nation offer higher performance but produce more errors than annotations by 2x to 17x and annotation time by up to 12x in
mixed precision. Synchronization elimination also introduces user studies. Typically, the foundational elements of language
complex, non-deterministic errors. extensions manifest in three primary stages:
• Introduction of Data Types: Extensions such as EnerJ
C. PROGRAMMING FRAMEWORKS AND TOOLS in Java incorporate novel data types like approx int or
1) PROGRAMMING FRAMEWORKS FOR AXC
approx float, which, though less precise in calculations,
yield benefits in performance and energy efficiency.
Approximate programming frameworks are considered
• Modification of Overloaded Operators: Arithmetic oper-
tools and mechanisms that help developers integrate approx-
ations such as addition, subtraction, multiplication, and
imations into their programs in a controlled way to manage
division may undergo alterations for approximate data
the trade-off between accuracy and resource usage. Approx-
types, facilitating the management of error propagation
imate programming languages are particularly advanta-
or enabling more relaxed calculations.
geous in scenarios where computational efficiency is of
• Implementation of Annotations: These serve as direc-
paramount importance and minor inaccuracies in the final
tives for the compiler, delineating the contexts in which
output do not significantly impact the overall result. Such
approximations are viable and specifying the acceptable
scenarios are commonplace in domains such as machine
threshold for errors, (e.g., @approx_tolerance (0.05) for
learning, signal processing, and big data analytics, where
a function).
computations can be computationally intensive. Program-
ming languages with approximate features offer novel con- Introducing language extensions for approximate comput-
structs and abstractions that empower developers to clearly ing faces key challenges: rigorous error tracking to control
define specific portions of a program where approximations compounded inaccuracies, ensuring type safety to avoid mix-
are deemed acceptable. The compiler and runtime system ing data types, and overcoming user resistance by offering
utilize these specifications to enhance the program’s per- clear benefits and easy integration to encourage widespread
formance, energy efficiency, or other measurable factors, adoption.
while also guaranteeing that the ultimate outcome falls within
acceptable margins of error. Approximation-enabled compil- b: PROBABILISTIC PROGRAMMING LANGUAGES
ers provide a powerful means of exploiting the error resilience These languages incorporate uncertainty directly into the
of applications. By automatically introducing approxima- language and employ statistical methods to compute approxi-
tions, they can significantly improve performance and energy mate results. PPLs are designed to express probabilistic mod-
efficiency without requiring extensive manual intervention. els and perform inferences over them [207]. They provide
However, they also face several challenges. One key chal- constructs to define random variables, specify dependencies
lenge is ensuring that the approximations do not significantly between variables, and encode probabilistic algorithms. PPLs
degrade the quality of the program’s output. This requires often incorporate advanced inference techniques like Markov
careful analysis of the program’s behavior and the impact chain Monte Carlo (MCMC) and variational inference. FAC-
of different approximation techniques. Another challenge is TORIE [208], FlexJava [206], Venture [209], BiiP [210],
managing the trade-off between accuracy and performance, Stan [211], SlicStan [212], Gen [213], Hakaru10 [214],
which can require sophisticated heuristics and tuning mech- HackPPL [215] Anglican [216], Infergo [217], Aloe [218],
anisms. For example, ACCEPT compiler was developed by PyMC3 (Python), and Pyro [219] are representative examples
Sampson et al. [202], which uses a combination of static and of this category. In addition, there are many studies [220],
dynamic program analysis to automatically determine the [221] developing operational semantics as a basis for

VOLUME 12, 2024 146041


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

probabilistic programming languages such as Anglican, Ven- 2) APPROXIMATION COMPUTING FRAMEWORKS


ture, and Church. For example, Dylus et al. [222] introduced a: APPROXIMATE COMPUTING FRAMEWORKS
a library for probabilistic programming in the functional There are several frameworks and libraries that have been
logic programming language Curry. Another example, Gen developed to support software-level approximate computing,
is a probabilistic programming language embedded in Julia, such as TensorRT, TVM, and FlexFlow. These frameworks
designed by Marco [213], which offers sufficient expressive- provide tools and APIs for optimizing and deploying approx-
ness and performance for general-purpose use. Gen auto- imate computations on different hardware platforms, such
matically optimizes custom inference strategies for specific as CPUs, GPUs, and FPGAs. They can also support dif-
probabilistic models using static analysis. The findings indi- ferent levels of approximation and error metrics and can
cate that Gen’s prototype matches Stan’s speed [211], is only be used to automate the process of tuning and optimiz-
about 1.4 times slower than a custom Julia sampler, and is ing the approximate computation [228], [229], [230], [231],
roughly 7,500 times quicker than Venture, another probabilis- [232]. The challenge of optimizing applications requires
tic language allowing custom inference. FACTORIE [208] is intensive resources and flexibility in precision. For example,
a Scala toolkit for probabilistic models, providing tools for ApproxTuner [233] is an automatic framework to address
building factor graphs, parameter estimation, and inference, this issue. ApproxTuner optimizes tensor-based applications
developed by McCallum and his colleagues. FACTORIE for accuracy-awareness, requiring just broad quality goals.
offers learning and optimization tools for classification and It integrates approximations across algorithmic, software,
prediction, plus NLP features like segmentation and tokeniza- and hardware levels through a unique three-phase tuning
tion. UMass Amherst offers tutorials and downloads for more method encompassing development, installation, and opera-
information [223]. tion stages, ensuring adaptability across devices. The frame-
work introduces predictive approximation-tuning for faster
c: STOCHASTIC PROGRAMMING LANGUAGES autotuning by estimating the accuracy effects of approxima-
These languages incorporate randomness directly into their tions analytically. Tested on 10 CNNs and a CNN-image
computations. Statistical programming languages focus on processing mix, it achieved up to 2.7x speedup on GPUs and
expressing statistical models and performing data analy- 1.9x on CPUs with minimal accuracy loss. ApproxTuner’s
sis. They provide a wide range of statistical functions and novel tuning method outpaced traditional tuning, offering
libraries for tasks such as data manipulation, regression anal- similar advantages more efficiently. Liu et al. [234] intro-
ysis, hypothesis testing, and visualization. While they may duced an adaptive program graph that allows for customizable
not explicitly deal with uncertainty, they often support proba- quality at the user level, based on criteria set by developers.
bility distributions and statistical techniques for uncertainty Approxilyzer framework [235] used both static and dynamic
estimation. They are often used in simulations, optimiza- analysis methods to help find opportunities for approximation
tion, and machine learning [224]. Examples include AMPL, in software applications at the binary level. This makes sure
GAMS, SimJulia, StochasticPrograms.jl and SimPy. that certain computations can be approximated without losing
accuracy.
The rising power needs of DNN accelerators have led to
d: BAYESIAN PROGRAMMING LANGUAGES the use of approximate multipliers in modern solutions. How-
These languages combine probabilistic modeling with ever, the accuracy assessment of these approximate DNNs
Bayesian inference to generate a library of functional presents a challenge due to the insufficiency of approximate
programming languages for Bayesian modeling and infer- arithmetic support in existing DNN frameworks. To miti-
ence [225], [226]. Bayesian programming aims to substitute gate this, Danopoulos et al. [236] proposed AdaPT, a rapid
traditional languages with a probabilistic approach that emulation framework that augments PyTorch, enabling it to
accounts for uncertainty and incompleteness. One popu- support both approximate inference and retraining aware of
lar example of a Bayesian programming language is ‘‘JS’’ approximation. AdaPT, designed for seamless deployment,
(Just Another Gibbs Sampler), which is specifically designed is compatible with a majority of DNNs. AdaPT notably
for Bayesian analysis of complex statistical models. JAGS enhanced error recovery and reduced inference time by up
provides a high-level syntax for creating and manipulating to 53.9x across different DNN models and applications com-
probabilistic graphical models and supports a wide range pared to conventional approximations.
of built-in probability distributions and statistical functions.
Stan [211] is a probabilistic language optimized for Bayesian
inference with Hamiltonian Monte Carlo methods, automat- b: APPLICATION-AWARE FRAMEWORK
ing model specification and inference. Such languages are This framework involves identifying the computations that
valuable in fields like machine learning and data analysis, are critical to the functionality of the application and ensuring
where probabilistic reasoning is crucial. By providing a ded- that these computations are not approximated. This involves
icated framework for Bayesian modeling, these languages analyzing the application and identifying the critical compu-
make it easier for developers to build applications that incor- tations that must be executed with high accuracy to ensure the
porate sophisticated probability [227]. overall functionality of the application [229], [237]. It is also

146042 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

called approximate-aware design framework [238], [239]. facilitates the generation of annotations for programs
ApproxHadoop is a framework for implementing approxi- designed for approximate computing. The framework facili-
mate computing in big data applications. It provides a way tates the extraction of information pertaining to the sensitivity
to automatically identify opportunities for approximation and of output, enabling the identification of a crucial subset of
selectively apply them to reduce the computational cost of data that requires precise computation and storage, while the
data processing [48]. Hanif et al. [240] introduced a frame- remaining data can be approximated.
work to systematically analyze the error resilience of deep
CNN and identify parameters for applying approximate com- e: PROFILING FRAMEWORK FOR APPROXIMATE
puting techniques. COMPUTING
It is a tool designed to analyze and measure the performance
c: DYNAMIC APPROXIMATION FRAMEWORK and accuracy of algorithms that use approximation tech-
This framework involves dynamically identifying the com- niques. These algorithms are often used in modern applica-
putations that can be approximated based on the input tions that require rapid processing of large data sets. This tool
data and the current state of the application. This involves trades off result accuracy with faster execution or less mem-
monitoring the application and identifying the computa- ory use. The profiling framework, such as AXPROF [249],
tions that can be approximated based on the current state provides developers with the necessary support to implement
of the application [241], [242], [243], [244]. For exam- these algorithms effectively and automatically. Based on the
ple, Wang et al. [245] introduced the Runtime Machine desired accuracy specified by developers, this framework
Learning-based Identification Model (RMLIM) to highlight begins to generate code for statistical analysis and models for
noncritical segments within a software program’s data flow analyzing accuracy, memory use, and timing. For verification
graph. Trained offline with a designated dataset, RMLIM is and assessment, this framework conducts suitable statisti-
subsequently applied at runtime for individual inputs. This cal tests for implementation to be sure the implementation
simplifies the identification process and enhances its applica- meets the specification. This type of framework is crucial
bility to real-time scenarios. Preliminary results indicate that in identifying bugs and performance optimizations in the
RMLIM retains comparable energy efficiency and accuracy implementation of approximate algorithms. AXPROF pro-
to prevailing runtime AC techniques. It notably reduces the filed 15 applications across data analytics, numerical linear
execution time by 40 to 61 percent. algebra, and approximate computing, effectively detecting
Recently, Soni et al. [246] introduced ‘‘As-Is,’’ an innova- bugs and providing various performance optimizations. The
tive Anytime Speculative Interruptible System, designed to tutorials and examples for this framework are available
enhance the adoption of approximate computing by address- online.
ing the lack of hardware support and real-time accuracy
guarantees. They proposed this system to leverage approx- VII. ARCHITECTURAL-LEVEL APPROXIMATE COMPUTING
imate computing to deliver early outputs that improve over TECHNIQUES
time, ensuring eventual full accuracy. It merges approximate A. APPROXIMATE MEMORY TECHNIQUES
and speculative computing to repurpose existing architectures
The constant communication between processors and
for efficient approximation, offering a solution that adapts to
off-chip memory causes memory subsystems to be the largest
real-time needs and allows users to choose between immedi-
consumers of time and energy in modern computer archi-
ate results and waiting for complete accuracy.
tectures, from servers to mobile devices. The escalating
disparity in speed between the CPU and the external memory,
d: DATA/INPUT-WARE APPROXIMATE FRAMEWORK known as the ‘‘Memory Wall’’. This problem is a significant
The framework aims to identify the data that can be reason- bottleneck in computer system performance. In order to
ably approximated without causing substantial disruption to overcome the memory wall and narrow the gap between
the system’s output. This is achieved through the introduction processors and memories, designers have experimented with
of intentional faults into the variables, followed by an analysis a wide variety of circuit and architectural advances, including
of the resulting impact on the output quality [247], [248]. 3D integration [250], bigger on-chip caches, memory-level
The approximation-based programming approach is parallelism [251], [252], faster off-chip interconnects [253],
well-suited for error-tolerant applications on constrained- new memory hierarchies, near-memory processing or in-
resource devices, as it allows for efficient computation and memory computing [254], [255], and more [256]. Figure 10
storage of program data. This is especially important for shows the classification of computing systems based on
devices like smartphones and tablets, where battery life is cru- where they process data [257]. They are still not satisfied with
cial. However, implementing this paradigm requires source reducing memory energy consumption for many developing
code annotations and type qualifiers, which can be prob- algorithms, such as machine learning, that pose increasing
lematic for large, real-world applications with limited access demands on the memory chip. Consequently, there is a need
to source code. Pooja et al. [248] and Bernard et al. [247] for the development of novel methodologies to enhance both
present an innovative sensitivity analysis framework which energy efficiency and performance.

VOLUME 12, 2024 146043


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

A common trait among the majority of emerging applica- power consumption without significantly compromising per-
tions that heighten memory consumption involves the ability formance. This leads us to the concept of Approximate
to endure approximations within the foundational compu- Dynamic Random-Access Memories (AxDRAMs). Approx-
tations or data. Despite this, these applications continue imate DRAMs refer to the subset of DRAM systems in which
to generate outputs that meet an acceptable level of qual- power conservation methodologies have been instituted at
ity. Approximate computing is one of the techniques that the expense of an increased bit-cell error rate. These entities
improves energy and performance by leveraging the inherent hold a critical position as fundamental components within the
resilience of many developing applications using techniques broader domain of approximation computing. By embracing
on memories. A key element of this methodology is the a trade-off between power efficiency and accuracy, approxi-
application of approximate memories. These are intentionally mate DRAMs open new avenues for energy-conscious design
designed memory circuits known to demonstrate imperfect in embedded systems and beyond [265].
data retention, a characteristic that may be attributed to Memory occupies a disproportionate amount of real estate
either the inherent tendency of these circuits to slowly lose on an on-chip computer’s integrated circuit and system lay-
data over time or to errors that transpire during read/write out. SRAM cell architecture is the most popular kind of
operations [9], [19]. Typical approximate techniques that memory architecture due to its speed and reliability [254].
have been proposed for developing circuits for approximation While the popularity of SRAM is well-established, opti-
memory include, for example, voltage scaling in the case of mizing its performance is an ongoing challenge. Various
SRAMs [258], lowering the refresh rate below the nominal methodologies have been explored to enhance the efficiency
value in DRAMs [259], and compressing or encoding the data of SRAM cells, including supply voltage scaling. The method
[256], [260]. To lay the groundwork for understanding these of supply voltage scaling aims to reduce the power consump-
concepts, the rest of this section will offer a brief overview tion of SRAM cells. However, when a substantial number
of dynamic and static random-access memories. These two of cells are in standby mode, this can increase the leakage
essential types of memory hold significant relevance in the power across the entire semiconductor chip [255], [266].
field of approximation memories. Nevertheless, a significant degradation in stability is observed
Dynamic random-access memories (DRAMs) have long when the supply voltage decreases, which causes an increase
been the cornerstone of memory storage in embedded sys- in the occurrence of read, write, and hold errors [256], [260].
tems. Due to its high-capacity, durability, and affordability, To minimize the occurrence of failures, we need to design a
DRAM remains the major choice for primary memory in circuit considering the necessary device capabilities.
numerous embedded systems. A DRAM is organized into In sum, the pursuit of innovative techniques to design
channels, modules, ranks, chips, banks, subarrays, rows, and approximate memories propels a new wave of research
columns, as shown in Figure 11(a) [261]. Manufactured in in hardware optimization. and steps forward in devel-
various capacities and featuring data bus widths ranging oping energy-saving memory technologies to trade off
from 4 to 16 pins, DRAM chips exhibit a degree of diversity power consumption and a tolerable error. For example,
[262]. For the creation of a wider data bus, numerous DRAM Enrico et al. [267] applied the approximate computing (AxC)
chips are typically amalgamated into a single module, form- methods to analyze hardware accelerator components for
ing what is referred to as a rank. A closer examination of each deep neural networks, focusing on computation, communi-
DRAM chip reveals a composition of numerous banks. Each cation, and memory subsystems. It examines performance
of these banks contains a series of two-dimensional arrays, enhancement aspects, including approximate multipliers, link
or subarrays, composed of individual DRAM cells. DRAM voltage swing reduction, voltage over-scaling, and lossy
operations can concurrently retrieve data from multiple chips compression methods for internal SRAM memory. The inves-
within the same rank. The chips in DRAM direct requests tigation aims to improve computing systems’ efficiency and
towards a specific bank, row, and column location [262]. effectiveness. Numerous methodologies and techniques have
There are four commands to achieve the DRAM access oper- been explored and developed in the scientific community for
ations: the read (RD) command, the write (WD) command, the design and implementation of approximate memories,
the activation (ACT) command, and the precharging (PRE) reflecting the complexity and multifaceted nature of this field
command. The Activation (ACT) command opens a row, of study. In the next subsections, these approaches will be
transferring its contents to the row buffer for read (RD) or explained in detail, and some of their key considerations as
write operations. This is followed by cell charging through the well as benefits will be highlighted.
Precharging (PRE) command. Figure 11(b) presents an illus-
tration of the instructions associated with DRAM, namely
ACT, RD or WR, and PRE. Moreover, it delineates the asso- 1) APPROXIMATE MEMORY BASED ON REFRESH RATE
ciated timing parameters, namely the delay from row address Periodic refreshes of DRAM are required, and these proce-
to column address (tRCD), the active time of the row (tRAS), dures may use up to half of the memory’s entire power [259].
and the precharge time of the row (tRP) [262], [263], [264]. During the refresh mode, the memory cannot serve any mem-
While traditional DRAMs have been instrumental in mem- ory access, and this increases the memory access latency,
ory storage, there has been a growing interest in optimizing consequently reducing the throughput of total memory.

146044 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

FIGURE 10. Categorizes computing systems by where they process data: (a)-(c) earlier CPU-centric models move data
to the core, (d) newer models use near-memory processing, and (e) computation-in-memory (by using memories with
built-in processing capabilities (e.g., phase change memory, memristors)) [257].

FIGURE 11. Illustration the Dynamic Random-Access Memories (DRAMs) (a) structures organization,
(b) Instructions of DRAM access operations [261].

Increasing the refresh period beyond the typical 64 millisec- of DRAM into accurate and approximate sections, a feature
onds utilized by the majority of DRAM-integrated circuits particularly applicable to LPDDR DRAM.
(ICs) nowadays is an effective method for lowering DRAM A hardware-based method has been developed for approx-
power consumption [259]. imating DRAM for generating high and low refresh rates for
Flikker [12] pioneered one of the initial methods in the most and least significant bits of the operand, respec-
the domain of approximate memory, specifically targeting tively [268]. This method allows for the partitioning of
low-power mobile DRAM. This approach begins by parti- DRAM pages into more than two parts, with the possibility
tioning an application into two distinct segments: critical and of suboptimal refresh rates. Raha et al. [259] depended on
non-critical, as shown in Figure 12. By employing a subop- different quality parameters for partitioning DRAM pages.
timal refresh rate, errors are intentionally injected into the These parameters are: error characteristics, frequency, critical
non-critical portion, thereby achieving refresh power savings. data percentage, and location. In a specific study conducted
Consequently, Flikker introduced a software technique that by [269], extensive tests were performed on 8 chips of GC-
facilitates two refresh controls, allowing for the segregation eDRAMs. The results show this approach can save energy,

VOLUME 12, 2024 146045


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

reaching 55% and 75% with an acceptance error rate of 10−3 of deep learning systems in IoT devices. As technology
and 10−2 , respectively. The authors used refresh rates ranging scales below 20 nm, DRAM cells have shorter retention
from 11 to 24 ms. times, increasing their refresh power. This is especially prob-
For a more in-depth look at how DRAM defects affect lematic in memory-intensive applications, where DRAM’s
error-tolerant applications, we recommend seeing the pro- refresh power significantly impacts total system power [276].
posed works in Table 3. The approach developed by Innovative solutions are being explored to address this
Enerj [205] allows developers to mark parts of an application challenge. For instance, a novel approach to enhancing mem-
that may tolerate errors and be moved into near-primitive ory efficiency was introduced by Nguyen et al. [276]. They
RAM (SRAM) or direct-access memory (DRAM). This developed a zero-cycle bit-masking (ZEM) technique inte-
approach is very important for approximate memories. The grated with ECC within the controller, specifically targeting
concept of approximating non-critical data allows a loosen- the asymmetry of retention failures in DRAM. By applying
ing of accuracy against energy efficiency for these types of this method, they were able to eliminate the need for DRAM
applications. refresh across various applications. The approach was tested
on Tiny DNN architectures like AlexNet, DCGAN, and RNN
using the Gem5 simulator. The results were promising, with
2) APPROXIMATE MEMORY BASED ON APPROXIMATE LSB performance improvements of 10.4%, 11.27%, and 17.31%,
OR COMPRESSION and total energy reductions of 30.2%, 34.38%, and 43.03%
The second strategy emerges from investigating the relation- for LPDDR3, DDR4, and HBM, respectively.
ship between output quality and the bit error rate of LSB
[265], [270], or the degree of compressing data in not-critical
regions [256]. Using such techniques can reduce energy con- 3) APPROXIMATE MEMORY BASED ON VOLTAGE SCALING
sumption. The output quality is little impacted by dropping Another technique to lower energy usage at the expense of
the LSBs of a data word and setting them to a constant value decreased frequency is voltage scaling. In the power man-
(i.e., 0). This method requires simple circuits. You can power agement strategy known as dynamic voltage scaling, the
down or remove bit cells to save a lot of energy [270]. Within voltage that is applied to a component may either be raised
memory system technologies, selective ECC has been applied or lowered, depending on the conditions that are present.
to SRAM and DRAM to reduce MSB errors. Many methods The purpose of voltage scaling (reducing voltages) is to
are focused on enhancing the memory word size in traditional reduce energy consumption. SRAM is more sensitive and
ECC memory configurations. For example, a technique has begins encountering mistakes at a lower operating volt-
been proposed that extends a 32-bit memory word to 36- age than logic parts. As voltage is scaled down, SRAMs
bit, incorporating a 4-bit ECC [271]. Reducing the number become more susceptible to malfunction [97], [258]. For
of ‘1’bits in data can lower power use in both DRAM and instance, a study conducted by Denkinger et al. [258] focused
SRAM, as DRAM power is tied to ‘1’-bit quantity and SRAM on evaluating the robustness of artificial intelligence (AI)
power to switch probability and voltage squared [272]. methods, specifically convolutional neural networks (CNNs),
Another approach to enhancing memory efficiency involves to SRAM errors in edge devices. By operating at reduced
the utilization of encoding/decoding [272] or compress- voltages and employing quantization, they explored ways to
ing/decompressing techniques [256], [260], [273] for the data enhance efficiency. Their findings revealed that quantization
written to or read from the memory. These methods can emerged as the most effective strategy, yielding energy sav-
be applied to error-tolerant applications such as machine ings as high as 61.3%. This was achieved with only a minimal
learning and video and image processing, where slight impre- accuracy loss of 7.1%. Further efficiency was gained through
cision is acceptable, to significantly improve memory usage voltage scaling, leading to an additional reduction of up to
and power requirements. This represents a strategic align- 11.0%. However, these benefits were accompanied by a total
ment with contemporary computational demands, offering a accuracy loss of 13.6%.
pathway to more sustainable and responsive memory man- In IoT nodes, the 6T Static Random-Access Memory
agement. (SRAM) cell, known for its compactness and minimal area,
Machine learning algorithms often don’t fit IoT devices is widely used for data processing, as referenced in [266].
like sensors due to their complexity, high memory, and However, this cell suffers from several inherent limitations
energy needs. The growth of the IoT has given rise to a that need to be considered when using it. Inherent limitations
new subfield of machine learning known as ‘‘tiny machine of this cell include reliability issues at low voltage conditions,
learning’’ (TinyML) (IoT). TinyML relieves these chal- potential conflicts during read/write processes, data can be
lenges and makes the deployment of these algorithms on disturbed during reads, high data retention voltage, and half-
IoT possible. For example, Raha et al. [274] introduced the select problems [277]. To address these challenges, several
foundational concepts of an approximate TinyML system, design strategies at the cell and architecture levels have
including input-adaptive approximations [275]. Among these been introduced. These strategies focus on reducing power
limitations are the technology scaling and memory tech- usage in write, read, and leakage states, improving stable
nologies, which are major challenges in the application data retrieval and write efficiency, and addressing half-select

146046 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

FIGURE 12. Memory allocation scheme (a) represents the baseline DRAM module, (b) module with sorting Data allocated in pages based on
different refresh rates, (c) Flikker Approach [12], and (d) QCA-DRAM Approach [259].

problems [266]. One approach to enhancing the performance memory handling methods [281]. While these strategies may
and stability of the 6T SRAM cell is to increase the number be tailored to particular applications [282], they necessitate a
of transistors within the cell. This modification can lead comprehensive understanding of the data’s computation and
to improved control and functionality, although it may also handling. This requirement can be time-consuming and often
impact the cell’s compactness [266], [277]. A novel approach serves as a barrier to quick implementation. For instance,
to reducing read and hold power in SRAM architecture memory reduction can be achieved by reducing buffer
was proposed by Gupta et al. [29], utilizing a reconfigurable sizes [282], memory reuse methodologies [283], and/or prun-
VDD scaling technique (R-VDD). This method significantly ing and quantization approaches [284], [285], with the cost of
minimizes power consumption. To implement this R-VDD sacrificing accuracy or throughput. In the context of Approx-
scaled architecture, they employed a ‘‘data-dependent low- imate Buffer (AxB) techniques, one innovative approach
power 10T’’ SRAM cell (D2LP10T). presented in [286] focuses on the reduction of buffer size.
This method involves the concatenation of data into buffers
4) APPROXIMATE MEMORY BASED ON APPROXIMATE using the fixed-point format with a chosen bit-width, where
READ/WRITE OPERATIONS all data within an AxB adheres to the same format. The
Emerging non-volatile memory based on memristor technol- primary goal of this technique is to minimize the memory
ogy is proposed as the solution for approximate comput- footprint, achieving reductions ranging from 27% to 68%
ing, which can balance performance and power consump- in applications such as the full SKA SDP signal processing
tion. When used for compute acceleration, approximation- computing pipeline and wavelet transform. Remarkably, this
augmented processing combines each processor with a tiny reduction is accomplished without substantial degradation in
amount of controllable associative memory [278]. Emerg- output quality. However, it does come with the drawback of
ing STT-MRAM (Spin Transfer Torque Magnetic Random requiring manual and labor-intensive Design Space Explo-
Access Memory) memories, which offer higher density ration (DSE). To further simplify this process, an application
and lower static power consumption compared to SRAM, DSE for buffer-sizing was proposed by [287]. This additional
face challenges of high energy usage in read/write oper- approach aims to reduce the memory footprint while ensuring
ations. QuARK [279] and Cast [280] are hardware and that the output quality remains above a specified threshold.
software approaches introduced for STT-MRAM caches.
These approaches allow tradeoffs in reliability for saving 6) EMERGING MEMORY DESIGN TECHNOLOGIES
energy in the on-chip memory hierarchy of multi-core sys- PROCESSING- IN-MEMORY (PIM)
tems operating approximate applications. PIM is a computing paradigm that enhances data processing
efficiency by integrating processing capabilities closer to
5) MEMORY REDUCTION BASED ON APPROXIMATE storage units. Traditional architectures store data in memory
COMPUTING TECHNIQUES and make CPUs to move it between components, leading
AxC techniques for memory reduction are commonly imple- to time-consuming and performance bottlenecks. PIM inte-
mented at the design stage, often in conjunction with specific grates processing elements directly into memory cells or

VOLUME 12, 2024 146047


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

controllers, allowing data to be processed in place without parallelism present within application algorithms, the accel-
transferring it to a separate unit. This results in significant eration of the computational process can be effectively
speedup and energy efficiency improvements, particularly achieved.
for data-centric workloads. There are different approaches Memristor, also referred to as Resistive Random
to implementing PIM: Processing in DRAM (P-DRAM), Access Memory (ReRAM), is a well-known technology
Processing in NAND Flash (P-NAND) [288], Processing in Processing-In-Memory (PIM) architectures due to its
in 3D Stacked Memory [289], [290], and Near-Memory capability of analog computing that speeds up matrix-
Computing/Processing (NMC/NMP). Instead of integrat- vector multiplications, which are essential for the systems.
ing processing into the memory cells themselves, the last Nonetheless, convolutional neural network training using a
approach (NMC) places specialized processing units near the high-precision backward propagation phase presents diffi-
memory, reducing data movement overhead. culties on account of the poor resolution of these analog
Processing-In-Memory (PIM) technology is widely used PIM accelerators. Hai et al. [292] addressed this challenge by
in image and neural network processing which consist of introducing a novel hybrid PIM accelerator for CNN training
two main types: analog-based PIM and digital-based PIM. on ReRAM arrays. ReHy combines analog PIM (APIM) for
In analog-based PIM, the arithmetic operations can be performance in the feedforward propagation phase (FP) and
achieved using resistance networks. In digital-based PIM, the digital PIM (DPIM) for accuracy in the backpropagation
execution the additions and multiplications can be through phase (BP), offering a comprehensive solution for CNN train-
basic operations like NOR which needs multiple clock ing. The study reveals that ReHy markedly improves CNN
cycles. The analog PIM is known for its high speed, but training efficiency, outpacing standard CPU/GPU archi-
it encounters accuracy problems and demands a significant tectures (baseline) and FloatPIM by 48.8 and 2.4 times,
area footprint to accommodate the required analog-to-digital respectively, while also reducing energy usage by 35.1 and
converter (ADC) and digital-to-analog converter (DAC) inter- 2.33 times compared to each.
face modules [291]. Byun et al. [291] proposed the analog This involves integrating processing capabilities into mem-
processor-in-memory filter within a CNN setup, which fea- ory storage to reduce the data movement between the CPU
tures a 16 × 4 SRAM, 16 DACs, and 4 ADCs, as depicted and memory. In the fields of image processing and com-
in Figure 13. It includes a controller for SRAM, DAC, and puter vision, convolutional neural networks (CNNs) have
ADC timing optimization and power reduction, alongside a emerged as a prevalent tool. While Graphics Processing Units
main controller overseeing all operations. Inputs flow from (GPUs) are commonly employed to enhance the acceleration
the AI controller to the DAC controller, utilizing a charge of CNNs, this approach is constrained by the substantial
sharing method. Power efficiency is achieved by activating computational costs and memory demands associated with
components only as required. the convolution process. This limitation has led to a focus
on approximate computing, a method explored in numerous
studies to mitigate computational expenses [267]. The intro-
duction of the Approximate Data Comparison processing-
in-memory (ADC-PIM) solution by Choi et al. [289] marks
a significant advancement in addressing the performance
bottleneck caused by increased memory bandwidth intensity.
Implemented in 3D-stacked memory, ADC-PIM strategically
compares data for similarity before it is loaded onto the GPU,
unlike conventional post-loading methods. This approach
results in the transfer of only essential data to the GPU,
reducing data movement and computational requirements.
The application of ADC-PIM has led to a 43% boost in
processing speed and a 32% reduction in energy use, with
minimal accuracy loss below 1%.
The limitations inherent in processing-using-DRAM are
primarily characterized by its limited support for a lim-
ited range of basic operations, including logic functions
and addition. Such constraints have impeded the complete
FIGURE 13. The architecture of the Convolutional Neural Network (CNN) exploitation of the capabilities inherent in processing-using-
implemented within an Analog Processor-In-Memory framework [291]. DRAM, thereby necessitating the investigation of strategies
to enable the execution of more complex and user-specified
On the other hand, the digital-based PIM excels in accuracy operations. Addressing this challenge, Hajinazar [293] pro-
but experiences higher latency due to the multiple clock posed SIMDRAM, an extensive framework explicitly crafted
cycles required for computations, particularly with multi- to empower processing-using-DRAM that supports com-
plications. Through the strategic utilization of the intrinsic plex functions and efficiently handles sophisticated and

146048 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

user-defined functions without hardware changes. They eval- benefits and drawbacks. Its rapid associative searching capa-
uated its performance, showing its superiority over traditional bilities are advantageous for applications like network routing
CPUs and GPUs in throughput and energy efficiency, espe- and data retrieval.
cially with 16 DRAM banks. SIMDRAM performed well However, CAM faces challenges including high costs,
in real-world applications with minimal overhead. This power consumption, limited scalability, and issues with data
marks a significant advancement in processing-using-DRAM integrity. Additionally, its read and write speeds may not
technology. align with conventional RAM, its design complexity demands
Within the field of Processing-In-Memory (PIM) or careful implementation, and its static nature complicates data
In-memory computing (IMC), the predominant focus of updates, potentially leading to higher latency during certain
research has been the optimization of energy efficiency, write operations [296]. Despite its limitations, CAM is valu-
specifically within a limited voltage range. This concen- able for rapid associative searches, but a thorough analysis
tration on a narrow voltage spectrum has consequently of its trade-offs is crucial for its effective integration in
restricted the applicability of IMC in scenarios charac- computing systems. Yinjin et al. [297] introduced CARAM,
terized by dynamic workloads, where optimization across a novel hybrid PCM and DRAM primary memory system,
a wide dynamic voltage range (WDVR) is necessitated. to address Phase-Change Memory’s (PCM) limitations like
In response to this limitation, a recent innovation has been slow memory write speed and limited robustness, despite
introduced by Zhang et al. [294]. They have implemented a its high read throughput and low standby power. CARAM,
novel IMC-based Binary Neural Network (BNN) accelerator. addressing the challenges of limited primary memory capac-
This innovative development addressed a previously unmet ity in modern DRAM-PCM combinations, improves memory
need within the IMC domain by supporting energy-efficient efficiency through deduplication, line sharing, and optimized
operations over a broad voltage range. memory use. It reduces write traffic and duplicate line
Processing-in-memory (PIM) represents a paradigm shift writes, thereby enhancing PCM wear-leveling and expand-
in computing architecture that takes advantage of the dis- ing memory capacity. CARAM also maintains high data
tinctive physical characteristics of emerging memory systems access performance, which is essential for memory system
to boost data processing. These systems include resis- optimization. Experimental results demonstrate CARAM’s
tive random-access memory (ReRAM), spin-transfer torque effectiveness, showing a 15%–42% reduction in memory
magneto-resistive random-access memory (STT-MRAM), usage, a 13%–116% increase in I/O bandwidth, and 31%–
and phase-change memory (PCM) [295]. The principal merits 38% energy savings compared to existing hybrid systems.
of PIM lie in its ability to minimize data movement and In conclusion, CARAM marks notable progress in memory
reduce latency. The inherent characteristics of Processing- technology, addressing PCM challenges effectively through
In-Memory (PIM) architectures significantly enhance per- its innovative design and deduplication strategy, making it a
formance and energy efficiency, especially in tasks that are promising area for future exploration.
data-intensive. However, the adoption of PIM is not devoid In contemporary computational systems, Hardware Search
of challenges. These challenges include heightened design Engines (HSEs) represent a paradigm shift from tradi-
complexity, the imperative of efficient thermal management, tional software search algorithms, offering enhanced location
and the need to maintain data integrity. Ongoing research access and data association capabilities. Hardware Search
in the development of advanced PIM circuits and systems Engines (HSEs), particularly Content Addressable Memory
remains a key area of focus, with potential for continued (CAM), mark a significant advancement in computational
innovation in the domain. systems, offering improved data retrieval and association.
However, CAM’s high energy use, especially in cells and
matchlines during searches, poses a challenge, notably in
7) APPROXIMATE CONTENT-ADDRESSABLE MEMORIES the energy-efficient multi-port CAM used in modern super-
In the field of Content-Addressable Memory (CAM), this scalar processors [298]. To overcome this, research has
memory system is notable for enabling data retrieval by focused on low-energy alternatives like precharge-free CAM,
content instead of location, enhancing parallel search capa- which balances speed and power efficiency in associative
bilities essential for high-speed, memory-intensive systems. memory [298]. Additionally, innovations include high-speed,
CAM’s adaptability is evident in its application across net- energy-efficient single-port CAM designed for dual-port
work routing, digital signal processing, and microprocessor functionality, improving search performance, and addressing
design. Recent advancements in CAM design have focused multi-port CAM limitations [299].
on improving efficiency in comparison-driven tasks. How-
ever, challenges remain in creating CAM systems that are
cost-effective, energy-efficient, and capable of similarity 8) FRAMEWORKS AND SIMULATORS FOR APPROXIMATE
searches. The use of approximate CAM is limited by fac- MEMORY
tors like similarity, accuracy, speed, complexity, and cost. An important role for approximate memory may be found
The exploration of Approximate Content-Addressable Mem- in error tolerant applications, where sacrificing perfect
ory (CAM) in computer memory systems reveals significant accuracy in data processing in favor of saving energy is

VOLUME 12, 2024 146049


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

acceptable. It is possible to introduce probabilistic errors into reduce power consumption: Dynamic Voltage and Frequency
read/write access in approximate memory. In most cases, Scaling (DVFS), Thermal Design Power Management (TDP),
energy-saving circuitry or architectural changes (such as Dynamic Memory Management (DMM), Dynamic Power
reduced refresh rates or reduced voltages) are at blame for Management (DPM), Task Migration, Adaptive Voltage
these malfunctions. Since the degree of error that may be Scaling (AVS), Frequency Scaling, Voltage Scaling, Clock
accepted varies from application to application, the capacity Gating, Power Gating, Energy-Efficient Scheduling, Near-
to simulate these systems is crucial [271], [300]. Through Threshold Voltage (NTV) Operation, Sub-Threshold Opera-
simulation, one may examine an application’s behavior and tion, etc. These techniques are energy-efficient approaches at
test its robustness against real-world error rates, thereby the architecture or system level and have been widely adopted
identifying the optimal trade-off between reduced energy in the Internet of Things (IoT). We will discuss shortly some
use and improved product quality. Menichelli et al. [271], of these techniques:
Stazi et al. [300], and Yayla et al. [301] proposed emulators
to reveal the effects of errors introduced by approximate 1) DYNAMIC VOLTAGE AND FREQUENCY SCALING (DVFS)
memory circuits and architectures on the hardware platform DVFS is a technique where the processor’s voltage and fre-
and software. Yarmand et al. [302] introduced a methodology quency are dynamically altered based on the according to
for identifying suitable approximation degrees for approx- the workload. When the workload is low, the voltage and
imable memories within a memory hierarchy for executing frequency can be scaled down to reduce the power consump-
error-tolerant applications. tion [306]. This technology is most effective in dynamic
power environments and is widely supported by chip manu-
facturers, often referred to as ‘‘turbo mode’’ in some contexts.

2) VOLTAGE OVERSCALING (VOS)


VOS is a method that reduces the supplied voltage of circuits
to improve energy efficiency. This can lead to increasing
the computation errors or failures due to insufficient voltage
provided to the transistors to switch states robustly. To bal-
ance the energy gains with reliability, systems might need
error management strategies. VOS is especially useful in
energy-sensitive devices like battery-operated gadgets or IoT
sensors, where longer battery life is crucial.

3) DYNAMIC POWER MANAGEMENT (DPM)


FIGURE 14. Optimization of slack intervals for energy efficiency in DPM is a technique that involves dynamically adjusting the
real-time system operations [305]. (a) Preliminary computation, (b) DPM
technique, (c) DVFS technique. power consumption of a system based on the workload. This
can be done by selectively turning off or reducing the power
B. VOLTAGE-FREQUENCY-POWER MANAGEMENT to different components of the system [307]. In idle time, the
TECHNIQUES system enters a deep sleep state. During this state, the total
One of the main trade-offs in system-level approximate com- energy can be dramatically reduced using power-gating and
puting is between the accuracy of the computation and its clock-gating.
performance (i.e., speed and energy consumption). In gen-
eral, increasing the level of approximation can lead to faster 4) DYNAMIC MEMORY MANAGEMENT (DMM)
and more energy-efficient computation, but it can also reduce DMM is a technique used to optimize the memory usage
the accuracy of the results. Reducing processing complexity by dynamically allocating and deallocating memory of a
in real-time systems provides for more idle (slack) time. The system based on the workload. For example, a portion of
slack time, as shown in Figure 14, is the period between the memory can be turned off when it is not being used to
the task’s end and its deadline. Exploiting time slack refers save power [308]. This technique is particularly useful in sys-
to utilizing periods of idle time or low activity in a sys- tems with varying memory requirements and limited memory
tem to reduce power consumption without compromising resources, such as embedded systems and mobile devices.
performance [305]. Voltage-Frequency-Power Management
Techniques are strategies used in electronic systems, par- 5) DYNAMIC THERMAL MANAGEMENT (DTM)
ticularly in processors, to optimize power consumption and DTM are technique used to manage the heat generation of
performance. These techniques dynamically adjust the oper- a system [309]. They monitor the temperature of the system
ating voltage and frequency of a system based on the and dynamically adjust the voltage, frequency, or workload
workload, power budget, and thermal conditions. There are distribution to prevent overheating. This can include tech-
some approaches that can be used to exploit time slack and niques like thermal throttling, where the system reduces its

146050 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

TABLE 3. Comparative analysis of various approximate memory implementation strategies.

performance to decrease heat generation when it detects that power-saving SNN processor that supports online learning.
it’s getting too hot. The researchers used various techniques, such as adaptive
clocking and event-driven to reduce the power consumption
and accelerate computation.
6) ADAPTIVE CLOCKIN
This technique involves adjusting the clock frequency of a
processor based on the workload [310]. For example, the 7) NEAR-THRESHOLD VOLTAGE (NTV)
clock frequency can be reduced during periods of low activity This technique offers significant energy efficiency improve-
to save power [311]. Li et al. [310] introduced a rapid and ments by operating processors close to the threshold voltage

VOLUME 12, 2024 146051


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

of the CMOS transistors [312]. While this approach reduces and approximate cores. This multiprocessor was supported by
energy consumption, it also presents challenges such as the capability of hardware approximation exploration across
increased latency and sensitivity to transistor variability. various applications through software instructions. The pro-
Techniques like massive parallelism and temporary voltage posed task mapping method tested on AxE achieved a 32%
boosts are proposed to mitigate these issues, and advanced speed-up and 21% energy savings while maintaining 99.3%
semiconductor technologies like finFETs help reduce vari- accuracy across three mixed workloads. However, MPSoC
ability concerns. Operating near this threshold minimizes architectures are becoming increasingly popular for demand-
energy consumption while still maintaining a higher degree ing workloads in low-power devices like wearables and
of reliability and lower error rates compared to VOS. NTV IoT sensors due to their high performance and exceptional
strikes a balance between energy efficiency and computa- QoS. Therefore, Ali et al. [307] introduced a comprehensive
tional reliability. NTV computing requires careful design review of MPSoC architectures and explored that schedul-
and technological choices to fully harness its energy-saving ing approaches and voltage-frequency-power management
potential. NTV is particularly suited for IoT applications, techniques are the most commonly used to reduce power
such as wireless audio hearables, which require continuous consumption in MPSoC.
operation but are not necessarily at full performance all the The growth of IoT has led to an increase in demand
time [312]. for low-cost, resource-constrained devices that have the
capability of power budgets. To increase these capabilities,
8) TASK MIGRATION
we need to plan new approaches, like approximate computing
techniques, to build a new generation of low-power IOT
Task migration involves moving tasks from high-power
devices. Therefore, Taştan et al. [317] proposed an approx-
devices to low-power devices when the high-power device
imate IoT processor using the RISC-V ISA, which was
is not being fully utilized. For example, tasks that are not
designed specifically for machine learning tasks like clas-
compute-intensive can be moved from a CPU to a low-power
sification and clustering. The proposed processor achieves
GPU [313].
up to 23% power savings in ASIC implementations, main-
By utilizing these approaches, systems can reduce power
taining over 90% top-1 accuracy on trained models and
consumption during periods of idle time or low activity with-
test datasets. The integration of IoT in smart cities has
out impacting performance. D’Agostino et al. [314] provided
necessitated advanced solutions for processing mixed work-
an encouraged information about energy-effect computing in
loads, combining real-time data with historical records for
hardware and software.
enhanced analytics. Jawarneh et al. [318] introduced Spa-
tialSSJP, an adaptive system that efficiently manages stream-
C. APPROXIMATE PROCESSORS static joins, optimizing for Quality of Service (QoS) and
As computing tasks become increasingly complex, there’s a geo-statistical accuracy. SpatialSSJP was implemented on
rising demand for new paradigms like approximate comput- Spark Structured Streaming and tested on large datasets.
ing that enhance efficiency. However, the majority of existing Consequently, SpatialSSJP showed significant performance
hardware-based approximation solutions have been tailored improvements over existing methods and achieved high accu-
to specific applications or limited to smaller computing units, racy levels, with notable gains in optimal scenarios.
necessitating significant engineering work for full system Deep learning tasks require optimized memory bandwidth
integration [315]. Approximate processors and accelerators due to their intense resource and memory requirements.
are integrated approximate computing units consisting of the The requirements make them suitable for parallel comput-
co-design of hardware and software. They were designed to ing architectures like TPUs, which feature deeply pipelined
enhance computational efficiency by allowing for controlled networks of processing elements for efficient dataflow and
inaccuracies, which are particularly useful in error tolerance high performance [17], [319]. Google’s Tensor Process-
applications. ing Units (TPUs) are specialized ASICs that accelerate
Research interest is growing in ARM processors, which, machine learning by using less precise formats like bfloat16
due to their low-power architecture and supporting by various instead of 32 bits floating-point format significantly cutting
tools., are prevalent in mobile devices. Furthermore, there computation time and memory use while preserving accu-
are available open-source instruction set architectures (ISA) racy for many task [17]. TPUs were used to implement
for processors, which are represented by open, royalty-free the NN applications (MLPs, CNNs, and LSTMs) in data-
RISC-V architectures, supported by major tech firms [315], centers. Elbtity et al. [319] proposed an approximate tensor
[316]. Aponte-Moreno et al. [316] proposed a fault tolerance processing unit (APTPU) consisting of two key components:
approach to reduce the execution time by using approximate approximate processing elements (APEs) with low-precision
computing at the software level. The researchers used the multipliers and approximate adders, and pre-approximate
ARM and RISC-V microprocessor architectures for testing units (PAUs) that pre-process operands for the APEs within
the proposed approach. In another work, Baroughi and his the APTPU’s systolic array. However, Systolic array DNN
colleagues [315] introduced AxE, the first general-purpose, accelerators are known for their cost efficiency but strug-
heterogeneous RISC-V MPSoC platform that combines exact gle with high energy use, limiting their use in low-power

146052 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

devices. Approximate computing offers a solution at the ever, local processing requires solutions that are low-power,
expense of slight accuracy losses, which could, however, accurate, fast, and cost-effective. Many algorithms in this
make DNNs more prone to disturbances like permanent field use basic functions like trigonometric and logarithmic
faults, already a concern in accurate DNNs especially in functions. Calculating transcendental functions on computers
critical applications like autonomous driving where reliabil- typically involves software, leading to delays. Thus, hardware
ity is paramount. Ensuring the reliability of DNN hardware implementations have become vital due to their performance
often requires extensive fault injection testing [320], [321]. benefits over software. Numerous publications detail these
Siddique et al. [320] and Ahmadilivani et al. [321] addressed hardware implementations for arithmetic units and elemen-
the challenge of exploring approximation and fault resiliency tary functions.
of DNN accelerators. Siddique et al. [320] conducted a
detailed analysis of fault resilience and energy consumption A. APPROXIMATE ADDERS
in various AxDNNs on a layer and bit level, using the Evoap- Approximate computing is an emerging paradigm that aims to
prox8b signed multipliers. Their findings reveal that a single optimize power consumption, area, and delay. This approach
permanent fault in AxDNNs could result in as much as a involves the strategic redesign of a system’s logic circuit to
66% drop in accuracy, while the same fault might cause just allow for controlled imprecision in calculations by allowing
a 9% accuracy reduction in a conventional DNN acceler- for some degree of inaccuracy in the results. The computing
ator. At similar work, Ahmadilivani et al. [321] focused on error is generally undesirable, but there are some applications
enhancing DNN accelerators’ fault resilience and approxima- that can tolerate imprecise computation.
tion, using AxC arithmetic circuits for error emulation and a A critical focus within this domain has been on the design
GPU-based framework for swift evaluation. It also delves into of arithmetic circuits, particularly adders. Adders represent
analyzing fault propagation and masking in networks. a fundamental element in these arithmetic units that have
TPUs enhance deep learning efficiency through optimized received special attention from researchers and play an impor-
dataflow and low-precision support, but at the cost of poten- tant role in error-tolerant applications. Accurate adders may
tial accuracy drops and fault vulnerability. Advancements in suffer from high delays, complexity, or power consumption.
approximate computing show promise in mitigating these A Ripple Carry Adder (RCA) works by adding the bits of
issues, crucial for applications demanding high performance the two numbers one by one, starting from the least sig-
and reliability. nificant bit (LSB) to the most significant bit (MSB) in a
There are many interesting approximate processors and chain-like manner. The critical path of an adder is defined by
accelerators that have gained importance in different its whole carry chain. Although the RCA is relatively slow,
applications, such as energy-efficient IoT devices, real- it is a simple and commonly used circuit for small addition
time video processing, and machine learning inference operations. For larger additions, other types of adders, such
tasks, where trade-offs between precision and performance as carry-lookahead adders or carry-select adders, are used,
can yield significant benefits. Some of these proces- which have faster carry-propagation but suffer from overhead
sors and accelerator will be mentioned in applications and higher power consumption. Approximate computing is
section. becoming increasingly important as the demand for more
efficient computing grows, as it allows for the same task
to be completed with fewer resources. For computationally
VIII. CIRCUIT-LEVEL APPROXIMATIONS intensive processes like machine learning, this speeds up and
The concept of approximating logical functionality is suffi- improves outcomes.
ciently generic that it is applicable to both software [148] In digital circuit design, the approximation computing
and hardware [15]. When multiplication and division based technique provides a potential solution for decreasing power,
on logarithms, were first being developed, the early 1960s area, and latency. This is accomplished by redesigning the
marked the beginning of the acceptance of approxima- logic circuit using many different implementing approaches
tion computing [322]. The considerable research interest that permit a decrease in accuracy [15]. Approximate com-
in designing approximate circuits has been propelled by puting can be applied to circuits at different levels: transistor
the substantial potential for power consumption reduction. level, logic gate level, and architecture level. In the literature,
Approximate computing focuses primarily on arithmetic a wide variety of approximation adders [15], [16], [323],
units, e.g., adders and multipliers, at the level of custom [324], [325], [326], [327], [328], [329] have been reported:
hardware, as these constitute the fundamental components segmented adders, where an n-bit adder is partitioned into k-
of numerous error-tolerant applications and all computa- bit subadders [15], [323], [324]; an approximate full adder,
tions. The current research in VLSI design focuses heavily in which a single full adder can be approximated at the
on real-time DSP and machine learning for applications logic or transistor level [16], [325], [326]; carry-select adders,
like surveillance and wearable technology. These areas need which are multi-stage subadders are utilized [327], [328];
quick, accurate data analysis for pattern recognition. IoT and and speculative adders, reimagining traditional designs, opti-
edge processing emphasize immediate, local processing over mize performance by bypassing the infrequently used critical
cloud computing due to latency and connectivity issues. How- path [329], [330].

VOLUME 12, 2024 146053


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

A more in-depth look at the many different kinds of due to its high noise immunity, low power consumption,
approximation adders (and, more generally, approximate and reliable operation at low voltages. Despite its benefits,
units) reveals that the many techniques now in use adopt it has drawbacks like the presence of bulky PMOS transistors,
one of three different methodologies towards inaccuracy [15], increased transistor count, high input impedance, and high
[331]: 1) insignificant and frequently errors, 2) significant delay. The CMOS full adder is usually designed using mul-
and improbable errors, or 3) a combination of 2 and 3. In the tiple stages of CMOS inverters and transmission gates [334],
first methodology, the designers engineer and approximate [335]. For example, a 14-transistor (14T) CMOS complete
the lower significant bits of the arithmetic unit to obtain a adder cell which boasts a 50% reduction in the threshold loss
small magnitude of errors that are frequent. The quality of issue and an increase in the output voltage swing, but has
the application is not substantially diminished by these errors, significant delays.
as they are overshadowed by the system’s intrinsic truncation Pass-Transistor Logic (PTL) is a digital logic circuit
and noise errors. For instance, Optimized Feedback Lower- design method that uses pass transistors to implement logic
part Constant-OR Adder [15] and Lower-part OR Adder functions. It offers greater efficiency and energy savings
(LOA) [325] follow the first methodology. In the second compared to traditional static CMOS logic. PTL achieves
methodology, the designers engineer the more significant bits smaller circuit sizes and lower production costs due to fewer
of the arithmetic unit to appear as infrequent errors but large transistors in its gates, leading to less power use and reduced
in magnitude. The thinking behind this is that applications are propagation delays. However, PTL can face signal degra-
resilient enough to recover from occasional errors. Examples dation from parasitic capacitances, especially in complex
of this methodology are the Almost Correct Adder (ACA) circuits, which may impact performance.
[323], the Feedback Approximate Adder (FAA) [15], and the Hybrid logic circuits offer a balance between speed and
Generic Accuracy Configurable Adder (GeAr) [324]. In the power consumption, attracting increasing attention due to the
third methodology, the designers engineer the arithmetic unit, proliferation of hybrid-based topologies in recent years [332],
including the two previous methodologies, to enhance the [333], [336], [337], [338], [339], [340]. The purpose of this
existing approximate adders, for example, the enhanced [15] review is to provide the designer with a sample but effective
and hybrid [331] approximate adders. Combining the best method for discovering which topologies are optimal in terms
features of both previous principles is the preferred and of power consumption, throughput, or a mix of these metrics.
sometimes necessary choice for real-world applications. The This survey considers hybrid architectures and includes the
primary purpose of this work is to survey the current research most current topologies. Several requirements are traded off
and development status of various approximation approaches. to attain distinct benefits in full adder designs. In this context,
the number of transistors, delay time, power consumption,
and output voltage swing are crucial [336]. In contrast to
1) APPROXIMATE FULL ADDER AT THE TRANSISTOR LEVEL the typical CMOS full adder, which requires 24 transistors,
As the one-bit full adder (OBFA) is the primary circuit the Mirror adder requires 28 transistors. Both provide precise
for implementing an n-bit adder, the fundamental arithmetic output voltage levels, which results in a large area and signif-
circuit of any digital system, it plays a crucial role in the icant energy use. In order to reduce the number of transistors
calculation process [332]. A full adder is a type of com- required while maintaining the entire output voltage swing,
binational logic circuit designed to add together three bits: several different designs have been proposed. Unfortunately,
two input bits and a carry bit from a previous stage. But an many designs have reduced the number of transistors to
approximate hybrid full adder is a modified version of a full achieve low power consumption; however, this comes at the
adder that uses a combination of two or more logic styles expense of a diminished output voltage swing [337], [341].
together to reduce power consumption and area overhead For example, the proposal in [337] employs just 8 transistors,
while maintaining reasonable accuracy [332]. based on two XNOR gates, each with three transistors and an
There are seven different full adder cells based on static inverter, making it a simpler architecture in terms of transistor
logic styles, which are: the Complementary Metal-Oxide count. However, this lower-transistor-based circuit has an
Semiconductor (CMOS), Complementary Pass-transistor issue with threshold loss, which causes the logic voltages
(CPL), pass-transistor logic (PTL), single-rail pass transis- ‘1’ and ‘0’ to be slightly off from Vdd and 0, respectively.
tor (LEAP), double pass transistor (DPL), pseudo-NMOS Many designs have implemented solutions to this problem,
(Ratioed logic), gate diffusion input (GDI), and hybrid full often by expanding the allowed range of the output voltage
adders. As a result, a great deal of thought and care must or by increasing the voltage at each of the outputs. When
be invested into selecting a particular topology of OBFA at operating at low supply voltages (i.e., Vdd=1.8 volts), the
the transistor level and designing the associated circuit to deteriorated output might lead the circuit to give incorrect
affect the overall performance and energy of the system. Also, outputs for certain input combinations, making it all the more
we recommend to read these papers [332], [333]. crucial to minimize threshold loss [336]. In order to reduce
The CMOS Full Adder is a widely used circuit for binary the threshold loss problem and increase output voltage swing,
addition of two 1-bit numbers, employing NMOS and PMOS Hassani et al. [336] proposed a 16-transistor accurate full
transistors. This logic style is widely used in digital circuits adder (16T FA) design using a 10T CMOS FA. This design

146054 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

serves as a foundational element for proposing a Lower-part-


OR (LOA) approximate adder [325]. This design reduces
power consumption by 53% compared to the LOA adder,
at the cost of a 12% drop in accuracy and increased delays.
Diverse methodologies for designing XOR-XNOR cir-
cuits have been published in recent years and are used
to prevent glitches in the complete adder’s output nodes.
Kandpal et al. [340] proposed a 10-transistor XOR-XNOR
circuit to provide full swing outputs with improvements in
power consumption and performance. The outcomes show
the power delay product (PDP) is 7.5% higher than that of FIGURE 15. Architecture of Feedback approximate adder cell [15].
the available XOR-XNOR modules in 2020. They used this
XOR-XNOR circuit to design a 1-bit full adder (OBFA)
called high-speed hybrid full adder design-4 (HSHFA-D4) cant segment, while approximate full adders are used for
using 20 transistors. The results show a 28.13% improvement the less significant ones. This group has a basic truncation
in terms of PDP compared to other architectures. Another method [15], [16], [325], [345], [346]. The Lower-part OR
hybrid full adder design is called scalable low-power hybrid Adder (LOA) [325] is the most well-known design in this
full adder (SLPHFA) which proposed by Hasan et al. [339]. class, which proposed in 2010, where consists of two subad-
Without resorting to an intermediary propagate signal, the ders: an accurate subadder and an approximate subadder. The
carry signal and summing operation are generated via a higher significant (accurate) subadder achieves the error-free
novel AND-OR module and two XOR modules, respec- calculation by using a conventional precise adder like the rip-
tively, utilizing transmission gates and CPL logic styles. Both ple carry adder (RCA) or the carry-lookahead adder (CLA).
HSHFA-D4, SLPHFA and HFA-22T [342] characterize by The lower significant (approximate) subadder is constructed
no driving capability. Figure 15 shows different classical and by only OR gates to approximately obtain LSB summations.
hybrid FA adders based on various logic styles. Moreover, the accurate subadder’s precision is enhanced by
The Gate Diffusion Input (GDI) method is an efficient using the carry from the MSB input pair of the inaccurate
technique for designing full adders, reducing transistor count subadder through AND gate. However, the precise subadder
and power consumption while offering compact design, but size determines the LOA critical path delay, and LOA has
faces limitations in voltage scaling and operating speed. The positive and negative errors.
GDI-10T full adder is proposed by Nirmalraj et al. [343], In 2012, Albicocco et al. [347] proposed LOAWA,
which consists of one 4T XOR gate and two 2:1 multiplexers. a modified version of LOA adder, by removing AND
Combining the GDI and PTL logic styles produced a novel gate that provides a carry from inaccurate part to accu-
twist on the conventional full adder circuit; as a result, the rate part, and this design has only positive errors. After
design only required 10 transistors to perform addition. a year, Gupta et al. [326] proposed an approximate adder,
The shrinking of MOSFETs leads to challenges like APPROX5, where an inaccurate part is composited by one of
increased leakage current and higher manufacturing costs. the input pairs. In 2018, Dalloo et al. [16] studied, analyzed,
To address these, feature size scaling in digital circuits is key and systematically designed the approximate adder called
for reducing power-delay product (PDP) and power consump- Optimized Constant Lower-part OR Adder (OLOCA), where
tion. Carbon nanotube field-effect transistors (CNFETs), the inaccurate part is constructed by ones and OR gates.
including p-type and n-type, are emerging as alternatives Dalloo et al. [15], [16] showed that the number of OR gates
to MOSFETs, offering higher switching speeds and similar must not be less than two. In the same year, Dalloo [15]
mobility for equivalent sizes [344]. There is a substantial systematically designed an approximate adder segment (cell)
amount of published material that describes the circuit imple- called Feedback Approximate Adder Cell (FAA), which
mentation using CNFETs. For example, Bhargav et al. [344] constructs an accurate adder segment with a unique logic
proposed 10T and 13T approximate adders, based on 32 nm circuit as shown in Figure 15. This cell has the capability of
CNFET technology. smoothly error-correction, which means it composites partly
errors produced by the inaccurate subadder through returning
carry feedback to the inaccurate subadder. The cell feeds an
2) APPROXIMATE FULL ADDER AT GATE LEVEL accurate subadder by carrying the MSBs of the inaccurate
In an effort to lessen the critical path and hardware complex- subadder. The authors pointed out that the cell can be repeated
ity of precise adders, a number of approximation approaches and connected through OR gate. Furthermore, Dalloo [15]
have been developed. Approximate adders are based on the modified OLOCA, called OFLOCA, to construct the inac-
idea that they can complete the addition faster than precise curate subadder with ones, two OR gates, and two bits of
adders by breaking the carry propagation chain. This kind the cell. The cell can be repeated to lessen the critical path.
of approximate adder separates the adder into two separate OFLOCA outperforms the state-of-the-art architectures such
segments: an exact adder is used for the higher signifi- as OLOCA, LOA, etc.

VOLUME 12, 2024 146055


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

FIGURE 16. Block schematics of some approximate gate-level adders, (a) LOA [325], (b) LOAWA [347], (c) APPROX5 [326], (d) OLOCA [16],
(e) OFLOCA [15], (f) HOERAA [348], (g) HERLOA [350], and (i) HOAANED [349].

In 2019, Balasubramanian and Maskell [348] proposed a proposed in 2021 as a hardware-optimized approximate adder
modified OLOCA by using a 2-to-1 multiplexer (MUX21) with a near normal error distribution (HOAANED).
in the MSB of an inaccurate subadder, and the carry of the In 2020, Seo et al. [350] proposed an approximate adder
MSB of the inaccurate subadder (denoted as Ck ) is used to called Hybrid Error Reduction LOA (HERLOA) approximate
control the multiplexer and feed the accurate subadder. The adder. This design is a modified LOA with a similar structure
multiplexer is fed the carries of Ck and Ck−1 and generates to OFLOCA in using carry feedback to lower significant bits
the MSB’s sum of inaccurate subadders. This adder is called of inaccurate subadder through OR gates and the two-bit
Hardware Optimized and Error Reduced Approximate Adder feedback cell, but with modifications. Lee et al. [345], 2021,
(HOERAA), which cannot correct the error that occurs when proposed a new approximate adder called Error Reduced
the carries Ck and Ck−1 are ‘‘01’’ and the inputs Xk and Carry Prediction Approximate adder (ERCPAA), which aims
Yk are ‘‘‘‘X0’’ or ‘‘0X’’. To partly solve the issue, the same to reduce error metrics while increasing cost metrics.
authors [349] modified HERLOA by adding OR gate after the Figure 16 shows the architectures of the aforementioned
multiplexer, as shown in Figure 16. The modified design is gate-level approximate adders, where n, k, and n-k refer to

146056 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

the size of the approximate adder, inaccurate subadder, and convolution operations depend heavily on multiplication-
accurate subadder, respectively. X and Y are the inputs to the accumulation processes. Consequently, there has been a
approximate adder, but SUM refers to its output. notable shift in research focus towards developing low-power,
Furthermore, the Parallel-Prefix Adders (PPA) are among high-performance approximate multipliers. This develop-
the fastest adders, and their process of binary addition is seg- ment stems from the need to optimize energy efficiency
mented into three distinct stages [351], [352]: pre-processing, and processing speed in such tasks, addressing the inherent
prefix-processing, and post-processing. The pre-processing limitations of multipliers in comparison to simpler, more
stage calculates generate and propagate signals, The heart energy-efficient adders. The operational structure of a mul-
of the PPA, the prefix-processing stage, leverages the prefix tiplier comprises three stages: partial product generation,
operator for accelerated computation of carry. The nature of partial product reduction (accumulation), and final addition.
the prefix operator allows for the flexibility of executing indi- Approximations can be introduced in any of these stages, but
vidual operations in any sequence, which has led to the devel- the accumulation stage, in particular, is a focus of research for
opment of various parallel-prefix architectures. Lastly, in the its significant power and delay consumption, highlighting the
post-processing stage, the sum bits are calculated adding importance of designing low power and delay approximate
the previous carries and propagate signals. Researchers have multipliers. The Wallace tree, Dadda tree, and carry-save
paid special attention to being approximated; for exam- adder array are primary structures for partial product accumu-
ple, recently, Rosa et al. [344] proposed approximate parallel lation in multipliers. The Wallace tree uses parallel-operating
prefix adders (PPAs) using proposed approximate prefix full or half adders (FAs/HAs) without carry propagation,
operators (AxPOs), which consist of carry operator nodes. leading to a logarithmic delay (O(log(n))). Its FAs, acting as
They used four well-known PPA adder architectures: Kogge- (3:2) compressors, can be replaced by other compressors, like
Stone, Brent-Kung, Ladner-Fischer, and Sklasky, to apply (4:2), to reduce delay.
approximate AxPOs. Advanced digital design requires par- The Dadda tree is similar but uses fewer adders. In contrast,
allel prefix circuits like adders or priority encoders, which the carry-save adder array passes carry and sum signals from
conventional design techniques often struggle to balance one row of FAs/HAs to the next, operating in series, resulting
between area and delay effectively. Therefore, the NVIDIA in a linear delay (O(n)), which is longer than the Wallace
Applied Deep Learning Research group [352] proposed a tree’s. For example, Sabetzadeh et al. [358] proposed a new
reinforcement learning-based method with a specialized envi- approximate multiplier which produced the least significant
ronment and representation for efficiently designing parallel half of the product using an approximate multiplier with error
prefix circuits. There are interesting review papers [353], compensation capability and the other half using an accurate
[354] on gate-level architectures of approximate adders. multiplier. The proposed design enhances the energy-delay
To minimize the critical path and complexity of precise product by 77% over exact designs and 54% compared to
adders, numerous alternative approximation schemes have existing approximate designs, on average.
been proposed. These approximate schemes are the specu-
lative adders such as Carry Cut-Back adder [355], Reverse
Carry Propagate adder [356], VASP adder [330], and seg-
mented adders such as Feedback adder [15] and a low-latency
generic accuracy configurable adder (GeAr) [324].
In conclusion, the characteristics and performance of seg-
mented and speculative adders diverge significantly from
those of approximate full adders. Segmented adders split the
carry chain, leading to larger but infrequent errors. In contrast,
speculative adders offer high accuracy but at the cost of
complex circuits. This creates a trade-off: speculative adders
are less favorable due to their complexity compared to seg-
FIGURE 17. The structure of 7 × 7 BAM multiplier [325].
mented and approximate full adders. The design of adders
thus requires balancing efficiency with precision.

1) TRUNCATED MULTIPLIERS
B. APPROXIMATE MULTIPLIER In the quest for efficient computational operations, the design
Multipliers exhibit high complexity, which tends to consume of approximate multipliers is a key area of focus, particularly
energy and cause increased delays in computational oper- for applications that demand a balance between accuracy and
ations. Multipliers are essential to microprocessors, digital power consumption. Among the various strategies employed,
signal processors, and embedded systems. Their applica- truncated multiplication, simplifies operations by discard-
tions vary from fundamental filtering operations to advanced ing the least significant bits of input operands or removing
convolutional neural networks [357]. This is especially the partial products (AND gates) or Full adder cells, thus
important in large-scale machine learning tasks because reducing the silicon area and speeding up the multiplier,

VOLUME 12, 2024 146057


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

but with a manageable loss in precision. By employ- and ease of implementation. The use of modified Booth
ing appropriate correction functions, the truncation error encoding significantly streamlines the multiplication of large
can be effectively minimized. For example, Broken-Array numbers by reducing partial products. The modified Booth
Multiplier (BAM) [325] is a variant of truncated array mul- encoding (MBE) can reduce the number of PPs by half [364].
tipliers. This design shares foundational similarities with Zhu et al. [365] introduced a novel approach to designing
the conventional array multiplier but introduces a distinctive Approximate-Truncated Booth Multipliers (ATBMs). These
modification: the strategic omission of Carry-Save Adder ATBMs are crafted using a combination of Modified Radix-
(CSA) cells in both horizontal and vertical orientations. 4 Booth Encoders (AMBEs), Approximate 4:2 compressors
This alteration, as depicted in Figure 17, is not arbitrary (ACs), and a technique of gradually truncating partial prod-
but is governed by two critical parameters: the Horizontal ucts. A key feature of this design is its ability to adjust
Break Level (HBL) and the Vertical Break Level (VBL). accuracy levels. This adjustability is achieved by varying the
These parameters determine the specific cells to be omitted, number of AMBEs and ACs incorporated into the system,
as marked by hatching in the figure. The primary advantage thereby allowing for a customizable balance between preci-
of this design lies in its compact and expedited circuitry, sion and computational efficiency.
achieved at the expense of precision. Then, Farshchi et al. However, traditional Booth multipliers suffer from high
[359] modified BAM multiplier using booth encoding. Then, hardware complexity, limiting their applicability in resource-
Roy and his colleagues realized the needs of computational constrained scenarios. Haider and colleagues [366] addressed
applications in real-time precision demands through design- this challenge by introducing an innovative approximation
ing approximate and reconfigurable circuits, ensuring power approach to enhance the efficiency and reduce the hard-
consumption aligns with computational accuracy. Therefore, ware complexity of Booth multipliers while maintaining
Roy et al. [360] proposed an accuracy reconfigurable version negligible error rates. The new approach requires only N/4
of approximate Broken-Array booth multiplier. This design Booth decoders, reducing the Normalized Mean Error Devi-
incorporates partial error correction by adding sign bits to a ation (NMED) and Power-Area-Product (PAP) in the 16-bit
Broken-Array multiplier. This new reconfigurable multiplier BD16.4 approximate Booth multiplier compared to existing
design significantly reduces power consumption compared to advanced multipliers.
traditional and modern multipliers.
4) SEGMENTED (RECURSIVE) MULTIPLIERS
2) COMPRESSOR-BASED MULTIPLIERS Approximate segmented (recursive) multipliers offer another
Another approach, compressor-based designs, which stand way of dividing the multiplication process into smaller
out for their ability to streamline the accumulation stage. multiplier blocks. The simplest method in this category
These designs utilize various compressor configurations, involves using smaller, approximate multipliers to develop
such as 7:3, 5:2, 4:2, and 3:2 compressors [361], [362]. larger multipliers, leading to the generation of approximated
Among these, the 4:2 compressor is often favored for partial products [367], [368]. The low-power approximate
its structural regularity, particularly when implemented in techniques are applied more aggressively to the segments
cascading configurations. This preference is also reflected dealing with less significant bits. The research teams focus
in its widespread application in the design of Dadda on developing approximate 2 × 2 or 4 × 4 multipliers,
multipliers [361], [362]. For example, Edavoor and col- utilizing near-exact half (HA) and full (FA) adders, or alter-
leagues [362] introduced an innovative 4:2 compressor natively, employing approximate counters or compressors
design. This approximation-based approach yields significant for this purpose. In 2011, Kulkurani et al. [367] proposed
improvements. Specifically, it achieves a 56.80%, 57.20%, under-designed approximate multiplier (UDM) based on the
and 73.30% reduction in area, power consumption, and delay, proposed approximate 2 × 2 multiplier block. The approxi-
respectively. These improvements are in comparison to a con- mate 2 × 2 multiplier produced the output error only when
ventional, accurate 4:2 compressor. However, this is balanced the inputs are one where the output is ‘‘111’’ instead of
by an error rate of 25% and a maximum error distance of the accurate output is ‘‘1001’’, reducing the accumulating
±1. The 4:2 compressor efficiently executes four additions stage. In 2016, Rehman et al. [369] also proposed a 2 ×
at once, enhancing parallelism which in turn minimizes the 2 approximate multiplier block with a lower magnitude of
critical path and dynamic power dissipation. Dornelles et al. maximum error. After a year, Venkatachalam et al. [370] used
[363] proposed two topologies based on CMOS+ gates to the statistical analysis to transform the partial products am,n
decrease the power, area, and delay of the 4:2 compressor. and an,m , to form propagate pm,n and generate gm,n signals as
follows:

3) BOOTH ENCODING MULTIPLIER pm,n = am,n + am,n


The ever-growing demand for efficient and compact digital gm,n = am,n .am,n (1)
circuits has fueled the development of approximate com-
puting techniques. In the domain of multiplication, Booth In comparison, the chances of gm,n being one are sub-
multipliers represent a popular choice due to their versatility stantially lower at 1/16, unlike am,n , which has a higher

146058 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

probability of 1/4. On the other hand, the probability for 7) HYBRID MULTIPLIER
pm,n to be one is 7/16, exceeding the likelihood for gm,n . Lastly, hybrid techniques that combine two or more of these
Using this transform concept, Venkatachalam and his col- methods can offer a balanced solution, optimizing both accu-
leagues proposed approximate 4 × 4 multiplier blocks and racy and power consumption. For example, Ansari et al. [377]
a 4 × 2 compressor using the proposed approximate half and proposed and developed a new 4 × 4 and higher approximate
full adders. Furthermore, they used the proposed blocks and multipliers using a combination of booth input encoding
compressor to build higher approximate multipliers, which and a proposed approximate (4:2) compressor. The proposed
characterize a high error rate and cost. But Waris et al. design achieved a 52% reduction in the PDP-MRED product
[368] achieved similar methodology to transform the partial and outperformed other similar-accuracy approximate Booth
products to form propagate- and generate-signals and design multipliers. Choudhary et al. [378] introduced an automated
NOR based two approximate HF (NxHA) and one FA (NxFA) method for generating approximate circuits with formal
adders. The authors used these adder cells to build two 4 × worst-case relative error (WCRE) guarantees using Look-Up
4 approximate multiplier blocks (more-approximated (MxA) Tables (LUTs) and SAT-based techniques. The proposed 8-bit
and less-approximated (LxA) multipliers) and then larger approximate multiplier reduced the power consumption and
multipliers. Waris and his colleagues designed the multiplier delay by 83.33% and 25.3%, respectively, with only a 1.2 dB
with a lower error rate (approximately half), lower cost, and SNR degradation in a Finite Impulse Response (FIR) filter.
higher performance than Venkatachalam’s design.
C. APPROXIMATE DIVIDER
5) FPGA BASED DESIGNED MULTIPLIER There are many exact algorithms that have been proposed for
FPGA provided high-speed multipliers, which are charac- implementing division operations. Digit recurrence, which
terized by their flexibility of reconfiguration and different is a trusted and exact division algorithm and offers simple
precision formats to optimize performance for a specific task. logic but faces latency and space inefficiencies, is there-
FPGA has a limited number of depicted multipliers; there- fore limiting its use in high-speed applications [379]. The
fore, we need to design this operation using FPGA LUTs. digit recurrence is an iterative algorithm including Rostering,
The designers will face the challenge of design complexity, Non-Rostering, and SRT dividers (as a sub-branch of non-
and then creating efficient FPGA designs requires specialized restoring). For example, Patankar et al. [379] introduced the
knowledge. For example, Ullah et al. [371] introduced a new exact USP-Awadhoot divider, as a digit recurrence design can
approximate multiplier architecture tailored for FPGA-based be adaptable as restoring or non-restoring and optimized for
systems, offering a methodical design approach and an acces- space-efficient electronic applications.
sible online library. This innovation outperforms traditional Designing an efficient divider necessitates using an inexact
ASIC-based approximations in terms of area, latency, energy computation to address the inherent issues of high latency,
efficiency, and accuracy. Specifically, it surpasses Xilinx large area, and significant power consumption in typical
Vivado’s multiplier IP, showing up to 30% area, 53% latency, traditional division circuits [380]. Piso et al. [381] showed
and 67% energy improvements with minimal accuracy com- that a 1% improvement in a division circuit block can
promise. The provided open-source library aims to spur boost system performance by up to 20%. An approximate
further research within the FPGA community, marking a sig- divider is a computational unit designed to perform division
nificant shift towards optimized reconfigurable computing. with a trade-off between accuracy and efficiency and used
in various error-tolerant applications, including image pro-
6) LOGARITHMIC MULTIPLIER cessing, machine learning, wearable electronics, etc. Recent
Logarithmic multipliers (LM), especially the base-2 log- research on approximate dividers focuses on finding effective
arithm, offer a highly efficient approach for converting trade-offs by reducing the complexity, such as employing
multiplication to addition and shifting operations. They sig- approximate subtraction or reciprocal [382], [383], [384],
nificantly improve the hardware efficiency of error-tolerant [385, logarithmic [382], truncating [383], reducing the num-
applications [372], [373]. The implementation of these ber of iterations [385], using lookup tables, or applying other
multipliers comes with accuracy and design complexity bot- approximate techniques). Furthermore, the researchers focus
tlenecks, and it requires a dedicated circuit to compensate on developing new methods for error analysis and manage-
for errors and improve both hardware and accuracy. For ment to minimize errors.
example, Pilipovi´c et al. [374] proposed a two-stage approx-
imate logarithmic multiplier that uses less area and energy. 1) FLOATING, FIXED POINTS, AND FPGA DIVIDER
Makimoto et al. [375] proposed two-segment piecewise- The floating-point divider is a complex component in
linear compensation to Mitchell’s logarithmic multiplier to arithmetic-heavy digital designs, categorized into combina-
improve its accuracy. Yu et al. [376] proposed an approximate tional and sequential types, with a focus on the latter. Peter
LM, named HEALM, that integrates error compensation with Malik [386] implemented three iterative division algorithms,
mantissa truncation, using a lookup table to enhance accuracy including Newton-Raphson, Goldschmidt, and combined
and efficiency. Goldschmidt and binomial divisions. The key principle of

VOLUME 12, 2024 146059


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

these algorithms is that the implementation of the division ogy. The authors also in-depth analyze the designs based on
operation depends on an inverse process of the multiplication PTL and CMOS technologies.
operation, where the denominator is iteratively subtracted
from the numerator. The accuracy depends on the number
4) APPROXIMATE LOGARITHMIC DIVIDERS
of iterations and the computation complexity of the itera-
tion. Bureneva et al. [387] proposed a fixed point version Approximate Logarithmic Dividers operate on the principle
of Newton-Raphson division. Recently, Ebrahimi et al. [388] of logarithmic computation to perform division, which is a
proposed RAPID, the tunable accuracy multiplier and divider fundamentally different approach from traditional division
architectures, customized for FPGAs. algorithms. When multiplication and division based on log-
arithms were first developed by Mitchell in the early 1960s,
it marked the beginning of the acceptance of approximation
2) TRUNCATED, APPROXIMATE RECIPROCAL AND DYNAMIC computing [322]. Logarithmic Dividers (LDs) introduced
ITERATION STOPPING DIVIDERS significant errors, which makes them unsuitable for appli-
Approximate floating-point (FP) multipliers have been exten- cations where high precision is required. However, LDs are
sively explored in recent applications, which overwhelm the characterized by low complexity, low power consumption,
study and development of approximate FP dividers, despite and high speed. This makes them well-suited for use in
their significant utility [382]. We noticed that a number of error-tolerant applications such as digital signal process-
significant research efforts have been dedicated to developing ing, image and video processing, and machine learning
approximate dividers using different approximate comput- algorithms, where they contribute to more energy-efficient
ing techniques. For example, Oelund et al. [382] proposed designs [391], [392]. Liu et al. [391] addressed the issue
an approximate floating-point divider using an approximate of high errors by combing restoring-array and logarithmic
hardware-friendly reciprocal and iterative logarithmic multi- dividers to design approximate hybrid dividers. Also, Wu et
plier. The authors corrected the errors by storing them in a al. [392] introduced a low-power, high-performance approx-
lookup table. The accuracy of this design can be configured imate divider using logarithmic operation and piecewise
in real-time. Vahdat et al. [383] also used the approximate constant approximation. The design was optimized using a
reciprocal multiplied by the truncated value of the dividend heuristic algorithm to minimize errors.
for designing the approximate divider. Truncated dividers
are a basic approach to approximate division by limiting 5) APPROXIMATE HIGH-RADIX DIVIDERS
calculations to a certain number of bits, offering speed The importance of high-radix dividers is their ability to signif-
and simplicity at the cost of potential errors, which vary icantly improve computing speed and efficiency, but they also
by application. Behroozi et al. [385] introduced SAADI, need careful consideration of hardware complexity, power
an approximate divider design which boosts energy effi- consumption, and precise control. For example, Chen et al.
ciency in error-tolerant applications by allowing dynamic [393] proposed a high radix divider and analyzed and com-
adjustments in accuracy for energy-quality balance. SAADI pared this design with other different approximate dividers.
can dynamically balance the accuracy, speed, and energy in They showed that the approximate radix-2 divider is particu-
a division circuit by adjusting iteration counts for recipro- larly beneficial in constrained-resource applications, but the
cal approximation, diverging from traditional fixed-accuracy high-radix divider is useful for applications requiring high-
designs. It achieves 92.5% to 99.0% accuracy in divisions speed computations. The decision to implement a high-radix
while providing flexibility in latency scaling, showcasing its divider should be based on a comprehensive analysis of these
potential in low-power signal processing. Wang et al. [389] factors (computational efficiency, precision, latency, scalabil-
introduced an approximate divider called ‘‘HEADiv’’ based ity, and circuit complexity) in the context of the intended.
on the truncated Taylor series, and the induced error is com-
pensated by carefully considering the associated hardware
complexity. D. APPROXIMATE ELEMENTARY AND ACTIVATION
FUNCTIONS
The importance of elementary and activation functions in
3) APPROXIMATE SUBTRACTOR-BASED DIVIDER various computational paradigms cannot be overstated. Ele-
Designing approximate dividers can be achieved by employ- mentary functions such as trigonometric, exponential, and
ing an approximate subtractor. This method is character- logarithmic forms are the bedrock of numerous applications
ized by the ability to fine-tune error management through in science and engineering. However, the complexity of
adjustable approximation levels in the subtractors, but it may these functions often poses challenges in real-time or high-
influence the overall efficiency. The subtractor is a common performance computing environments. The need for rapid
unit in the class of division algorithms called digit recur- calculations in applications like signal processing and control
rence algorithms. For example, Jha and Mekie [390] proposed systems makes it imperative to find efficient ways to imple-
inexact restoring-array dividers (IRADs) using four different ment these functions, often leading to a trade-off between
proposed approximate subtractors based on CMOS technol- speed and accuracy.

146060 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

On the other hand, activation functions are the linchpin of [406], [407], [408], [409], [410] have also garnered a con-
deep learning algorithms, particularly in the architecture of siderable interest in recent years.
neural networks. These specialized functions introduce the There are benefits and drawbacks of strategies mentioned
necessary non-linearity that enables the network to learn from above [404]. The LUT method, known for its simplicity
data and adapt to various complexities. They are crucial in and speed in computing elementary functions, demands sig-
applications that require pattern recognition, such as image nificant silicon space due to its memory-based accuracy.
and speech recognition, natural language processing, and While it’s computationally simple and fits stable functions,
even in complex game theory problems. However, the choice its accuracy hinges on memory size [411]. Recent advance-
of an activation function and its implementation can signif- ments have focused on enhancing LUT methods through
icantly impact the learning efficiency and operational com- techniques like linear interpolation [412], range addressable
plexity of a neural network. The challenges here are multi- LUTs (RALUT) [413], table-lookup-and-addition methods
fold, including but not limited to, the vanishing and exploding like multipartite methods [414], input-aware quantized table
gradient problems, computational cost, training convergence lookup [415] and twofold lookup methods [416] though
and the risk of overfitting. Therefore, understanding the math- these also introduce challenges in terms of hardware com-
ematical properties and computational complexities of these plexity. Polynomial approximation, used for finer estimates,
functions is crucial for both academic research and practical necessitates numerous multipliers, adders, and coefficient-
implementations. storing LUTs, making it area-inefficient and slow [417]. The
These characteristics collectively ensure that activation CORDIC algorithm is a cost-effective iterative method using
functions can effectively support the diverse needs of neural adders, shift operations, and registers. However, it’s limited
network training. For instance, the Gaussian Error Linear by serial multiplier-like delays and a narrow input range,
Unit (GELU), the Rectified Linear Unit (RELU), the Leaky making it slower for exponential and hyperbolic functions.
RELU, the Sigmoid, and the Hyperbolic Tangent (Tanh) are Despite this, enhancements over the past two decades show
all examples of popular activation functions. GELU is known promise for efficient real-time computing solutions. Function
for its smoothness and is often used in transformer mod- approximation varies in complexity and suitability [418].
els like GPT-3, BERT, and most other Transformers [394]. Piecewise Linear Approximation (PLA) is basic, offering low
For models like BERT and GPT-2 that employ the tanh computational needs and ease of use, ideal for simple control
approximation, utilizing this method to approximate the systems and initial data analysis but struggles with com-
GELU activation function is advisable for model reproduc- plex functions. Piecewise Nonlinear Approximation (PNA)
tion. However, it’s worth noting that this approach generally addresses this by using nonlinear functions for better com-
yields less accurate results and can be slower for large input plexity handling, useful in machine learning and financial
sizes compared to directly computing the accurate GELU models but with higher computational demands. Piecewise
function [395]. RELU, characterized by its simplicity and Polynomial Approximation (PPA) uses segmented polynomi-
computational efficiency, is widely used in convolutional als for function approximation, common in signal processing
neural networks but suffers from the ‘‘dying RELU’’ prob- and scientific computing but can have boundary issues.
lem where neurons can sometimes become inactive. Leaky Hybrid methods mix various techniques to enhance accuracy
RELU addresses this issue by allowing a small, non-zero or efficiency [404]. Recently, there’s been significant inter-
gradient when the input is less than zero. Sigmoid and Tanh est in approximation and stochastic computing, known for
functions are among the earliest used activation functions high speed, fault tolerance, and low cost. While stochastic
and are particularly useful in scenarios where the output computing offers low power usage, it faces challenges like
needs to be scaled between specific ranges; however, they reduced precision and longer latency [419]. Approximate
are less popular in deep networks due to the vanishing gra- computing, on the other hand, balances hardware cost and
dient problem. Each of these activation functions has its accuracy, showing potential for improving integrated system
own advantages and disadvantages, and the choice often performance [410], [420].
depends on the specific requirements of the neural net- For example, Dong et al. [41] introduced a piecewise linear
work architecture and the problem being solved. Figure 18 approximation computation (PLAC) method for nonlinear
shows the most used activation functions over the last unary functions, which includes an optimized segmenter
six years [396]. and quantizer, enhancing the universal and error-flattened
Typically, five prevalent computing techniques are used piecewise linear approximation approach. Then Wu et al.
to implement these functions, including look-up table [42] developed PLAC without a multiplier, later optimized
(LUT) approach [397], [398], the polynomial approximation by Zhang et al. [43] to minimize segment count and
methodology [399], [400], Piecewise linear, nonlinear and reduce the maximum absolute error (MAE). For their circuit
polynomial approximation, shift-and-add algorithms [401] designs, all authors focused on the [0,1) interval. However,
like the coordinate rotation digital computer (CORDIC) this approach requires the use of the exponential func-
algorithm [402], [403] and Hybrid Approaches [404]. The tion’s scaling property for processing inputs and outputs.
approximate and stochastic computing approaches [405], Recently, Dalloo et al. [404] proposed hybrid approach for

VOLUME 12, 2024 146061


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

FIGURE 18. The most popular activation functions are used in over the last six years [396].

implementing exponential and hyperbolic functions with predictable modifications to enhance efficiency. Stochastic
input range [10, -10]. Hajduk and Dec [417] proposed a ALS uses probabilistic algorithms to randomly simplify or
simple FPGA-based method for implementing the hyperbolic modify digital circuits, leading to varied outcomes in different
tangent function using ordinary or Chebyshev polynomial iterations [423], [424]. Furthermore, ALS can be catego-
approximations. The authors examined different implementa- rized into four main categories, each of which has unique
tion configurations to show their effects on FPGA resources methodologies and applications: structural netlist transfor-
and calculation time. For more details, we recommend two mation, Boolean rewriting, approximate high-level synthesis
references [404] which provides a valuable literature review (AHLS), and evolutionary synthesis.
about methods of implementing these functions. For design-
ing Nth root and power operations, Changela et al. [421] A. STRUCTURAL NETLIST TRANSFORMATION
proposed a low-complexity VLSI architecture using three Structural netlist transformation involves the optimization
classes of radix-4 CORDIC algorithms. They computed loga- of a given logic circuit by transforming its netlist struc-
rithms, division, and exponential operations using the radix-4 ture. This can be achieved through various techniques such
of the modified hyperbolic vectoring, linear vectoring, and as gate replacement, reordering, or removing [355], [422],
the modified scaling-free hyperbolic rotation CORDICs, [423], [425]. Several Approximate Logic Synthesis (ALS)
respectively. methods function by manipulating the circuit netlist. For
In our analysis, we found that although there have been sig- instance, Gate-Level Pruning (GLP) [355] and Circuit Carv-
nificant improvements in these techniques, challenges persist ing [425]achieve this through the removal of gates from a
in attaining balance among energy efficiency, latency, accu- circuit. Conversely, SASIMI [5] employs a different approach
racy, and hardware complexity. Specifically, certain existing by altering the circuit’s wiring. The primary goal is to reduce
systems face constraints in terms of scalability, adaptability, the overall complexity of the circuit without significantly
and performance, especially when confronting the rigorous affecting its functionality. The AxLS framework [426] is an
demands of real-time digital signal processing (DSP) and open-source tool dedicated to the exploration and testing of
artificial intelligence tasks. We believe addressing these chal- existing netlist transformation techniques, serves as a pivotal
lenges demands innovative approaches that can effectively resource in the field of Approximate Logic Synthesis (ALS).
achieve the trade-offs of hardware designs in elementary and This ALS converts Verilog netlist to synthesized netlist (in
activation function computations. XML format) based on standard-cell library and then applied
the approximate techniques under user and application con-
IX. APPROXIMATE LOGIC SYNTHESIS AND straints. Finally, AxLS uses external synthesis and simulation
FRAMEWORKS tools for analyzing and evaluating the approximate netlist.
Approximate logic synthesis (ALS) is an automated design Figure 19 shows the simplified version of the AxLS frame-
approach to approximate digital circuits that can achieve work.
a balance between accuracy and efficiency in terms of In another work, Witschen [427] introduced an innova-
power, area, and performance. It automates combinational tive methodology for ALS called MUSCAT, which generates
and sequential circuits and accelerator design to be adapted valid approximate circuits by inserting cutpoints into the
to various applications and technological and user con- netlist. It utilizes formal verification engines to identify
straints [422]. In real-world applications, the current chal- minimally unsatisfiable subsets, ensuring optimal cutpoint
lenge that faces ALS is how to adapt to different accuracy activation without violating quality constraints. MUSCAT
needs while managing power and delay variations. To accom- outperformed the state-of-the-art methods, including AIG-
plish that, it requires the design of quality-configurable rewriting [428] and EvoApproxLib [429], achieving up to
circuits that can adjust to varying accuracy levels in real time. 80% higher savings in circuit area with lower computation
Furthermore, it must not only focus on gate-based netlists or times.
Boolean circuit representations but also on inexact operators.
There are two main approaches to ALS: deterministic and B. BOOLEAN REWRITING
stochastic. Deterministic ALS uses predictable techniques to Boolean rewriting focuses on the manipulation of Boolean
design approximate digital circuits, making definitive and functions to achieve a more efficient representation. This

146062 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

FIGURE 19. The simplified version of the AxLS framework [426].

can involve the use of approximation techniques to sim- Boolean networks. Another approach, Barbareschi et al.
plify complex Boolean expressions. The main objective is [434], introduced an open-source systematic approximate
to reduce the computational complexity of the function while design approach tailored for combinational logic circuits.
maintaining an acceptable level of accuracy. For example, The authors potentially minimized hardware resource needs
Hashemi et al. [430] introduced BLASYS, a novel paradigm by using the non-trivial local rewriting of and-inverter
that uses Boolean matrix factorization (BMF) to synthe- graphs (AIG) to reduce the AIG-node count. Through multi-
size approximate circuits. This method allows for a balance objective optimization, the approach judiciously balances
between accuracy and circuit complexity. This approach approximation with optimal error and hardware trade-offs
saved the power up to 63%, with a cost of 5% of the average and includes the synthesis of Pareto-optimal configurations
relative error. to ascertain tangible benefits. Meng et al. [435] intro-
Recently, Rezaalipour et al. [431] proposed a new duced ALSRAC, an open-source simulation-based Approx-
algorithm named XPAT, which is designed for creating imate Logic Synthesis (ALS) flow, employing approximate
approximate circuits through Boolean rewriting. The XPAT re-substitution with an approximate care set. Utilizing logic
algorithm uses an SMT solver to customize circuits based simulation, the authors recommend approximating the care
on a sum of products template. It outperforms existing meth- set in ALSRAC by identifying external don’t-cares (EXDCs)
ods (MUSCAT and BLASYS) in reducing circuit areas by through the maximum error distance constraint. They trans-
9.85% on average, with up to 60.4% improvement in some lated the proposed care patterns to internal nodes rather
cases. Figure 20 shows the test results which indicate sav- than primary inputs (PIs) to enhance scalability. Also, they
ings in area for different error thresholds (ET). However, noticed that in the larger circuits, the number of PIs increases
XPAT has longer runtimes for larger benchmarks, potentially exponentially with increasing EXDCs. Experimental out-
addressable by using multi-level templates or applying XPAT comes indicate that the proposed approach results in an area
iteratively to circuit parts. reduction of 7%−18% in comparison to existing state-of-the-
Ammes et al. [432] introduced a two-level approximate art methods.
logic synthesis method using cube insertion and removal,
demonstrating scalability for large circuits with high error
thresholds. The method achieved literal number reductions
ranging from 38% to 93%, depending on the error rate,
ranging from 1% to 5%. The authors provided the codes
online. While the authors made significant strides with
their two-level synthesis method, it’s essential to recognize
the broader landscape of research in this domain. Diverse
methodologies have been introduced, each with its own
unique approach and emphasis. Among these, the work of
Wu et al. [433] stands out, offering a multilevel perspective
on the problem by proposing ALFANS as an advanced
multilevel approximate logic synthesis framework, utiliz-
ing the Boolean network representation of circuits. Central FIGURE 20. Comparison of areas of Multiply-add obtained by XPAT,
to ALFANS is its capability for node simplification in MUSCAT, and BLASYS [431].

VOLUME 12, 2024 146063


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

C. APPROXIMATE HIGH-LEVEL SYNTHESIS (AHLS) mutation to explore the design space and find optimal or
In contrast to traditional approximate logic synthesis (ALS) near-optimal solutions. Evolutionary algorithms are heuris-
techniques, which focus on gate-based netlists or Boolean tic and metaheuristic search algorithms such as the genetic
circuit representations, Approximate High-Level Synthe- algorithm (GA), genetic programming (linear and cartesian
sis (AxHLS) aims to utilize inexact operators. AxHLS is genetic programming), machine learning, deep learning, etc.
a strategy that aims to efficiently implement designs in For example, Ranjan et al. [439] proposed a novel approach
high-level languages such as behavioral Verilog or C lan- that leverages state-of-the-art AI generative networks to syn-
guage. It focuses on the design and synthesis of hardware thesize constraint-aware arithmetic operator designs to be
at a high abstraction level. The strategy involves transform- optimized specifically for FPGA.
ing high-level descriptions, like C/C++ code, into hardware Despite the focus of the most existing approximate logic
descriptions, like VHDL or Verilog. The main goal is to create synthesis methods primarily on ASIC designs, there are
hardware that meets specific performance, power, and area not many works of ALS for FPGA design. For example,
constraints while often sacrificing accuracy. Wu et al. [440] introduced a novel method specifically tai-
One of the early works in this field, Nepal et al. [436], lored for FPGA design. They used the adaptability of lookup
introduced an advanced ABACUS methodology for the tables and developed a technique that combines wire removal
autonomous generation of approximate designs from behav- and local function alteration.
ioral RTL descriptions, expanding potential approximation One of the evolutionary synthesizers was the development
avenues. The Automated Behavioral Approximate CircUit of a reinforcement learning-based logic synthesis framework
Synthesis (ABACUS) methodology is an approximate logic known as AISYN by Pasandi et al. [441]. This study advo-
synthesis tool that transforms RTL descriptions into ASTs cated for the incorporation of Artificial Intelligence (AI),
through applying various operators, such as data type sim- particularly Reinforcement Learning (RL), into logic synthe-
plifications, arithmetic operation approximations, and loop sis procedures. The hypothesis is that AI and RL can aid
modifications. It utilizes a design space exploration tech- in increasing Quality of Results (QoR) by avoiding local
nique for identifying optimal designs on the Pareto frontier, minima, thereby transforming logic synthesis optimization
considering accuracy and power balance. ABACUS focuses into an AI-guided process. Experimental evaluations show
on optimizing critical paths post-synthesis for additional AI-guided logic synthesis can significantly improve key
power savings through voltage scaling. This tool, featuring metrics like area, delay, and power. A RL-aided rewriting
a recursive stochastic evolutionary algorithm, generates opti- algorithm improved total cell area by 69.3%, highlighting
mal approximate hardware variants from high-level Verilog the transformative potential of AI and RL in enhancing logic
inputs, with the codes accessible online. synthesis efficiency. Furthermore, Pasandi et al. [442] devel-
Recently, Leipnitz and colleagues [437] developed oped Deep-PowerX, a framework combining deep learning,
an AHLS design framework for FPGAs capable of approximate computing, and low-power design for logic opti-
autonomously determining the most effective combina- mization at the synthesis level. It significantly reduces the
tions of multiple approximation techniques. This approach dynamic power consumption and area of digital CMOS cir-
could be suitable for specific applications and constraint cuits with acceptable error rates. Compared to exact solutions,
design. The proposed method outperformed single-technique it achieves up to 1.47× and 1.43× reductions in power and
approaches in various benchmarks, reducing mean squared area, respectively, and surpasses current approximate logic
error by up to 30% and increasing accuracy by up to 6.5%. synthesis tools by 22% and 27%, with much lower run-times.
Additionally, Castro-Godínez et al. [438] developed a new Within the field of Genetic Programming (GP), the absence
approximate high-level synthesis framework for approximate of a Boolean function benchmark suite for logic synthesis
accelerators based on a library of approximate functional (LS) has been recognized as a significant issue [443], [444].
units. Furthermore, this framework addresses the challenge Roman et al. [444] developed a benchmark suite for logic
of optimizing resources while meeting accuracy constraints. synthesis, encompassing various Boolean functions used in
It features ‘‘AxME,’’ which represents analytical models for evaluating genetic programming systems. They presented
resource estimation, and ‘‘DSEwam,’’ which represents a baseline results from previous studies and their own experi-
methodological approach for the exploration of design space ments using Cartesian genetic programming (CGP). To auto-
in applications that exhibit a tolerance for errors. These mate the functional approximation of combinational circuits
tools enable the automatic generation of optimal approximate at many levels, including gate and register-transfer levels,
accelerators from C language descriptions. The framework is Sekanina et al. [445] proposed a genetic programming-based
released as open-source to advance research in approximate approach structure.
accelerator generation.
E. ERROR ESTIMATION AND EVALUATION FRAMEWORKS
D. EVOLUTIONARY SYNTHESIS Error Estimation Frameworks offer structured methodologies
Evolutionary synthesis employs evolutionary algorithms, for evaluating potential inaccuracies in computational sys-
such as genetic algorithms, to optimize digital circuits. tems. Tailored for contexts employing approximations, these
It involves iterative processes of selection, crossover, and frameworks leverage sophisticated algorithms to balance

146064 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

accuracy and efficiency trade-offs, thereby serving as bench-


marks for assessing computational result reliability. In the
ALS process, as depicted in Figure 21, integration with
error modeling and Quality of Results (QoR) evaluation
is fundamental [422]. The process commonly starts with
an error-modeling phase, which is designed to evaluate the
effects of removing individual gates on the circuit’s accuracy
and give annotating them with the error percentage. This step
can guide ALS methods in identifying the least error-prone FIGURE 21. Overview of the standard ALS process with the error
transformation. Then, it is followed by a post-synthesis error modeling and quality of results (QoR) evaluation [422].

estimation phase, or QoR evaluation. This evaluation is sig-


nificant for ensuring that the resultant circuit complies with
specified demands [422]. Monte Carlo sampling methods are layers, is becoming popular. This approach involves several
commonly used in ALS approaches to determine the actual approximation techniques from circuit to application level.
error introduced by approximations in the process, making it FPGA approximate computing frameworks harness the flex-
a prevalent technique in error evaluation in approximate com- ibility of FPGAs to optimize computational tasks by trading
puting. This method is notably utilized in Approximate Logic off accuracy for improved performance, energy efficiency,
Synthesis (ALS) methods, including BLASYS [430] and or reduced resource usage [453]. They support adaptable
Vasicek [446] for Quality of Result (QoR) evaluation, as well precision for diverse tasks, making them efficient for appli-
as in Su’s approach [447] for error modeling. However, cations requiring high computational power.
a significant constraint inherent to Monte Carlo sampling is Efforts to enhance program efficiency and error tolerance
the absence of definitive guarantees, as its worst-case error have led to SIMD use in FPGAs, facing issues like lim-
only reflects the highest error within a limited sample space, ited approximation operations, isolated kernel adjustments
thus providing a limited exploration scope. Error Estimation without thorough evaluation, and a lack of targeted optimiza-
Frameworks, for example, VECBEE [448], is a key frame- tion strategies. Furthermore, a comprehensive multi-level
work in Approximate Logic Synthesis (ALS), combining approach is essential across all layers from application to
Monte Carlo simulation with signal propagation for error circuit [24]. To address these challenges, Ebrahimi et al.
estimation. It’s adaptable to various error metrics and cir- [24] proposed a cross-layer methodology for multi-kernel
cuit representations, balancing accuracy with efficiency. This applications using toolchains across layers of abstraction.
approach was integrated into the open-source ALS methods, For designing hardware with runtime-adjustable accuracy,
which significantly contribute to optimizing circuit approxi- Alan et al. [23] introduced a unique cross-layer approach
mations. that reduces energy use with sightly increasing area. Also,
Recently, Rezaalipour et al. [449] proposed a novel Hanif et al. [25] tried to facilitate DNN implement-
SMT and SAT solver-based algorithm for error evaluation ing on resource-constrained devices through introducing a
in approximate computing, adaptable to any circuit and cross-layer approach using various optimization techniques
error metric. This approach significantly outperforms tra- across the computing stack’s layers.
ditional methods, including AIG-rewriting [428] and [450], Several research studies have shown that applying the
by efficiently and systematically navigating the error space, principles and techniques of approximate computing, initially
ensuring more accurate and reliable design validations. designed for the ASIC platform, yields different advan-
In sum, Approximate Logic Synthesis (ALS) techniques, tages when implemented on FPGA platforms [454]. Another
essential in digital circuit design, are categorized into four challenge is that it is time-consuming to explore numerous
main types: structural netlist transformation, Boolean rewrit- approximate accelerator variants due to the vast architecture
ing, approximate high-level synthesis (AHLS), and evolu- space required for simple applications like Gaussian filters.
tionary synthesis. These categories utilize unique methods for Therefore, Ullah et al. [371] developed a method to systemat-
introducing approximations in circuits to enhance efficiency ically create various effective approximate multiplier designs
and performance. Key insights into these techniques are also for FPGA platforms. Subsequently, the authors deployed a
provided by review papers such as [451] and [452]. These range of machine learning models to assess and choose con-
works highlight ALS’s significance in optimizing digital cir- figurations that meet the specific accuracy and performance
cuits, especially in applications that balance computational requirements of the application. Furthermore, Prabakaran et
accuracy with efficiency. al. [455] introduced a novel end-to-end automated framework
named Xel-FPGAs, aimed at enhancing the efficiency of
exploring FPGA-based approximate accelerators integrated
X. EMERGING COMPUTING FRAMEWORKS with advanced statistical and machine learning methodolo-
A. CROSS-LAYER AND END-TO-END AXC FRAMEWORKS gies. This approach is designed to significantly cut down on
The cross-layer approximation approach, aimed at leveraging the traditionally lengthy exploration time, making the pro-
the error resilience of applications across various abstraction cess more efficient and effective. The Xel-FPGAs framework

VOLUME 12, 2024 146065


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

reduced FPGA-based accelerator exploration time by 95%, learning, are limited by device variations and noise. The
boosting efficiency with minimal performance impact. It’s proposed MLEC technique improves the accuracy of binary
open-source and available online. dot-products (DPs) in these systems.
Developed primarily by Xilinx Lab researchers [456], By integrating Shannon-inspired Statistical Computing
FINN is an open-source tool designed for building fast and with Approximate Computing, we can significantly enhance
flexible deep learning inference on FPGAs. Unlike general the robustness, adaptability, and energy efficiency of approx-
DNN accelerators, FINN provides an end-to-end flow that imate computing systems. Leveraging information-theoretic
emphasizes co-design and exploration to optimize quantiza- principles allows for precise error management, optimal
tion and parallelization for specific resource and performance design, and dynamic adaptation, making it possible to exploit
requirements. On a ZC706 FPGA platform consuming less the benefits of approximation without sacrificing reliability
than 25 W, FINN delivers record-breaking image classifica- or performance. This hybrid approach provides a power-
tion speeds up to 12.3 million classifications per second at ful framework for developing advanced computing systems
0.31 microseconds latency with 95.8% accuracy on MNIST, capable of meeting the demands of modern data-centric appli-
and 21,906 classifications per second at 283 microseconds cations and nanoscale technologies.
latency with over 80% accuracy on CIFAR-10 and SVHN
datasets.
C. BRAIN-INSPIRED COMPUTING
Astounding progress in several tasks has been driven primar-
B. SHANNON-INSPIRED STATISTICAL COMPUTING ily by advances in deep learning, which form the backbone
In 1948, Shannon established information as a statistical of today’s Artificial Intelligence (AI) developments. The
quantity and introduced a theory of communication over rapid development of artificial intelligence (AI) demands
noisy channels. He defined channel capacity based on noise the rapid development of domain-specific hardware specifi-
statistics and demonstrated that reliable communication is cally designed for AI applications. Neuro-inspired computing
possible if the transmission rate is below this capacity. Shan- (Neuromorphic computing) chips integrate a range of fea-
non also showed that error control codes can approach this tures inspired by neurobiological systems and could provide
channel capacity. Shanbhag et al. [457] inspired to Shannon an energy-efficient approach to AI computing workloads.
theory to design and develop principles and fundamental Neuromorphic computing refers specifically to the design
limits for realizing statistical information processing systems of hardware systems that emulate the neural architecture of
using stochastic components. the brain. It aims to create physical circuits and devices
The Shannon-inspired Statistical Computing framework that operate similarly to biological neurons and synapses.
[457] leverages the statistical properties of both application Neuromorphic computing, inspired by neuroscience, is key
data and nanoscale hardware to create robust, energy- to next-generation AI. It focuses on three levels: computing
efficient, and scalable computing systems. By integrating models, architecture, and learning algorithms [459]. Spik-
computation within memory (DIMA) and sensor arrays ing Neural Networks (SNNs), with more realistic neuronal
(DISA), and employing statistical design techniques like dynamics than Artificial Neural Networks (ANNs), serve as
Data-driven Hardware Resilience (DDHR), Statistical Error the computing model. Architecturally, SNNs enable efficient
Compensation (SEC) and Hyperdimensional Computing in-memory computing. Neuro-inspired learning paradigms,
(HD), it ensures high reliability even in the presence of including online, learning-to-learn, and unsupervised learn-
significant hardware noise and errors. This framework is par- ing, allow continuous adaptation and form the basis for
ticularly advantageous for data-centric applications, offering low-power, accurate, and reliable neuromorphic systems.
enhanced performance and energy savings by minimizing This includes designing and fabricating neuromorphic chips
data transfer and adapting dynamically to errors. Further- that replicate the brain’s parallel processing capabilities.
more, this framework allows circuits to operate at lower Neuromorphic systems often use spiking neural networks
SNR levels, significantly saving energy. However, it comes (SNNs), where information is processed in discrete spikes,
with complexities in design and implementation, requiring similar to neural spikes in the brain. This leads to event-
sophisticated error-aware models and initial training over- driven, as opposed to clock-driven, computation, which can
heads. This framework is used in advanced machine learning be more energy-efficient [460]. Neuromorphic computing
accelerators, low-power medical devices, and large-scale sen- can be implemented by combining analog and digital circuits
sor networks, where traditional deterministic computing falls to replicate the analog nature of biological processes. This
short due to increasing stochasticity at the nanoscale level. can involve using memristors, specialized transistors, and
Another example, Kim et al. [458] introduced a maxi- other nanoscale components to build artificial neurons and
mum likelihood (ML)-based statistical error compensation synapses [461].
(MLEC) technique to enhance the compute signal-to-noise To further enhancement, Sen et al. [462] introduced
ratio (SNR) in 6T SRAM-based analog in-memory com- AxSNN, an approach applying approximate computing to
puting (IMC) architectures. These architectures, known for enhance the efficiency of Spiking Neural Networks (SNNs)
their energy efficiency and compute densities in machine by selectively skipping low-impact neuron updates, utilizing

146066 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

static and dynamic parameters to identify approximable neu- and development avenue for advancing approximate com-
rons. They implemented AxSNN in both hardware (SNNAP, puting paradigms [464], [466]. Qaim et al. [467] surveys
synthesized in 45nm technology) and software (on a 2.7 GHz energy-efficient solutions including data compression and
Intel Xeon server) to achieve 1.4–5.5x reduction in scalar approximate computing techniques for IoWT applications
operations across six image recognition benchmarks, demon- from 2010 to 2020. It categorizes these solutions, highlighting
strating significant improvements in computational efficiency their pros, cons, and key performance parameters. The study
and energy savings with minimal quality loss. discusses trade-offs and suggests future research directions
to improve wearable device performance and address chal-
lenges. Recently,
XI. APPLICATIONS
A. APPROXIMATE INTERNET OF THINGS (IOT)
The widespread use of smart devices and sensors has led to B. DEEP AND MACHINE LEARNING
a demand for intelligent computing to handle vast amounts Strategic approximate computing reduces precision in neural
of data. Conventional exact computing is commonly used network computations on a need basis that saves power and
in these devices which suffer of high-power consumption time, particularly for multiply-and-accumulate (MAC) opera-
and low performance [463]. To address these challenges, tions, which dominate the energy use in DNNs [468]. Studies
approximate IoT (AxIoT) is the optimal solution and an indicate that these operations are responsible for consuming
emerging paradigm in this field, which depends on approx- up to 99% of the energy in Deep Neural Networks (DNNs)
imate computing techniques. These techniques manage the [469]. Concentrating on simplifying multipliers within MAC
intensive computational processing and analysis demands of units results in huge energy savings with hardly any degra-
IoT devices. The exact results are not always required, this dation in accuracy. These approximate designs can be cus-
feature allows us to accept some errors. Approximate com- tomized to suit different accuracy specifications as they are
puting offers significant benefits in the resource-constrained error-configurable. Efficiency is further optimized through
IoT devices. Additionally, the big challenges that are faced dynamic reconfigurability and temperature-aware methods
by a designer are maintaining the quality of computations that control chip temperatures while ensuring computational
to meet specific application needs, reliability, and security speed, energy savings, and output quality remain balanced.
concerns. To address these challenges, various strategies have Sarwar et al. [470] proposed an approximate multiplier and a
been proposed, for example, energy of IOT can be saved using Multiplier-less Artificial Neuron (MAN) to improve neural
Bloom filter [93], [94], 6T SRAM [266], voltage-frequency- networks’ energy efficiency by exploiting error tolerance
power management techniques [307], [312], approximate and computation sharing. They also recommend retraining
IoT processor [317], In-memory computing IMC-based to offset accuracy losses. Evaluations show MANs signifi-
Binary Neural Network (BNN) accelerator [294], DRAM cantly reduce power and size with minimal accuracy impact,
Refresh rate [12]. For example, Ghosh et al. [464] illus- maintaining consistent speed. Peng et al. [471] proposed
trated synergistic approximation by utilizing a smart camera ‘‘AXNet,’’ a unified neural network that simplifies training
system that performs DNN-based image classification and and improves efficiency by integrating approximation and
object detection, highlighting how the sensor, memory, com- prediction tasks. The results show a 50.7% increase in safe
pute, and communication subsystems can all be effectively approximation rates and significant reductions in training
approximated. Adaptive approximation levels, which allow time. The codes are available online. Ashar et al. [472] pro-
for dynamic adjustment based on application needs, can posed a novel quantize-enabled multiply-accumulate (MAC)
manage the accuracy-efficiency trade-off. Hierarchical sam- unit with a right shift-and-add computation for runtime trun-
pling algorithms, such as stratified reservoir sampling, offer cation without extra hardware. Applying this MAC to a LeNet
rigorous error bounds while enhancing computation effi- DNN model reduced resources by 42% and delay by 27%,
ciency, as demonstrated in the APPROXIOT system [51]. ideal for high-throughput edge-AI applications.
APPROXIOT was implemented based on Apache Kafka. Balancing CNN accuracy, efficiency, and resource use
Fabjančič et al. [465] introduced ‘‘Mobiprox,’’ a framework presents challenges, exacerbated by high storage and compu-
for on-device deep learning with adjustable accuracy. It fea- tational needs and inefficient hardware deployment. Despite
tures tunable tensor operation approximations and runtime this, CNNs are crucial in computer vision, though at
layer adjustment, utilizing a profiler and tuner for optimal the expense of greater computational demand, as seen
configuration. The results show Mobiprox’s implementation in models like VGG-16. Addressing these issues requires
on Android OS reduces energy use by up to 15% in mobile software-hardware co-optimization, including model com-
applications like activity recognition and keyword detec- pression techniques like pruning and parameter quantization,
tion, with minimal accuracy loss. Mobile devices, reliant to enhance CNN efficiency on FPGA platforms [473],
on battery power, face constraints that make battery life a [474]. For example, Sui and his colleagues proposed
crucial factor. Reducing computational energy could signif- new CNN pruning [474] and Quantization [473] meth-
icantly benefit these systems. Therefore, exploring mobile ods aimed at reducing storage requirements, computational
approximate computing emerges as a promising research load, and enhancing hardware deployment efficiency. The

VOLUME 12, 2024 146067


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

study [474] presented a new CNN pruning method (KRP) as well as loop perforation, aim to decrease complexity
which combined with GSNQ quantization [473] to achieve and latency while maintaining a balance between speed
a 27× reduction in model size and improving FPGA effi- and accuracy. Comparing to traditional exact implementa-
ciency. There are many techniques to enhance deep CNN tions, the results in showed the proposed method achieved a
and machine learning algorithms including fixed-point arith- 49% reduction in power consumption and a 3.2× speedup.
metic, zero-skipping and weight pruning for enhancing Furthermore, it used 40% fewer hardware resources and
CNN on FPGA [475], compression through innovative use consumed 82% less energy for classifying touch inputs, all
of parallel layer processing and pipelining [476], Com- with a minimal accuracy loss of less than 5%. In addition,
pression though using reversed-pruning, peak-pruning and Mienye and Jere [483] addressed a gap in the literature by
quantization [477], cross-layer approach using the structure providing a comprehensive overview of decision tree-based
pruning and inputs and network parameters quantization at methods in machine learning. It explored core concepts, algo-
the software and approximate arithmetic units at hardware rithms, and applications, from early development to recent
level [25], quantization in PIM [478], approximate accel- high-performing ensemble algorithms. Also, they discussed
erators [455], approximate adder [420], approximate loga- the methods. tree pruning to enhance the performance of
rithmic multiplier [29], [374], Computation skipping [124], model and reduce the overfitting.
ApproxTuner [233] frameworks [236], [240], Approximate
Memory based on Voltage Scaling [258], analog processor- C. DATA MINING
in-memory [291], hybrid PIM accelerator [292], DCT, Redundant computations and data are considered as big chal-
Quantization, Sparse matrix compression [273], and other lenges for algorithms in terms of speed, scalability, memory,
techniques mentioned in Table 3. and efficiency, for example, unnecessary computations, func-
Approximate processors and accelerators, which embody tion calls, Redundant iterations, redundant memory access,
the synergy of hardware and software co-design in approx- etc. These inefficiencies increase execution time, waste com-
imate computing, are engineered to boost computational putational resources, and reduce the scalability of data mining
efficiency through permissible inaccuracies, making them analyses [484]. The core challenge of data miming algorithms
well-suited for applications where error tolerance is accept- is to extract extracting hidden knowledge from large datasets
able. There are many proposed approximate processors and mitigating use of redundant computations. Sampling
depend on approximate computing techniques to enhance is a data reduction strategy, addresses these volume-related
the overall system efficiency, including relaxed precision in challenges in environments running big data tasks like clas-
TPUs for implementing NN applications (MLPs, CNNs, and sification and clustering. There are several papers discussed
LSTMs) in datacenters [17], mixed precision for GPU Tensor the concept of approximate data mining. For example, strat-
Cores for Deep Neural Networks (DNNs) [479], processing ified random sampling were used from streaming and stored
elements (APEs) consisting of a low-precision multiplier data [36], [37]. Graph sampling [38] is a very effective
and an approximate adder in TPU [319], parallel analog method to deal with scalability issues when analyzing large-
convolution-in-pixel, and low-precision quinary weight neu- scale graphs. There are others data mining techniques such
ral networks [480]. Gharavi et al. [481] proposed enhancing as memoization (storing and reusing previous computa-
multicore performance with configurable approximate Arith- tions), efficient data structure design, and careful algorithm
metic units. Their machine learning framework dynamically optimization, loop perforation, iteration skipping, memory
adjusted frequency and precision to optimize performance access skipping, Computation skipping, Function approx-
within TDP constraints. Experiments showed a 19% speed imation, etc., we discussed these techniques in previous
increase using a floating point approximate ALU with three sections.
configurations per core, all within the same TDP limit. The machine learning algorithms are considered common
Machine learning enhances IoT by analyzing vast data for data mining algorithms. For example, approximate nearest
actionable insights, crucial for applications like wearables neighbor search (ANNS) is considered as core solution in
and smart devices [463]. The embedded processing near the data-mining and is widely used in different applications
sensor is often preferred to cloud processing due to privacy, such as computer vision, information retrieval, etc. [485].
latency, and bandwidth constraints. Despite these advantages, Approximate nearest neighbor search algorithms are used
sensor devices face significant challenges related to energy for fast retrieval of relevant information. Instead of perfect
consumption, cost, throughput, and accuracy. Circuit design- matches, these algorithms find items that are ‘‘close enough’’
ers are key to developing energy-efficient solutions for these in high-dimensional data spaces and saving computational
tasks. For example, Younes and his colleagues [482] pub- expense during large-scale searches [486]. For instance,
lished a couple of research papers which applied algorithmic numerous major corporations, including google, employ this
level approximate computing techniques (AxCTs) to super- strategy [487].
vised machine learning algorithms, specifically K-Nearest
Neighbor (KNN) and Support Vector Machine (SVM), for D. SECURITY
applications in touch modality and image classification. We know that approximate computing promises significant
These techniques, including reduced sampling and precision advantages but there are security implications, particularly
146068 VOLUME 12, 2024
A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

in sensitive applications, require careful consideration and XII. TOOLS AND LIBRARIES OF APPROXIMATE CIRCUIT
further research. Approximate computing can complicate Approximate computing is an emerging paradigm that allows
reverse engineering efforts but could introduce new target trading off design accuracy and improvements in design
areas for hardware Trojans, particularly in circuits controlling metrics such as design area and power consumption. This
the level of approximation. Approximate circuits’ defense paradigm is widely used in applications across various
against passive side-channel attacks can differ with voltage- abstraction layers through managing and controlling the error.
frequency settings, making security assessments challenging. Numerous researchers share code or plan to release libraries
Approximate circuits, particularly at operational limits, may of approximate components for application use, or offer
be vulnerable to fault injection attacks, but the full effects and benchmark suites of diverse applications as open-source to
countermeasure effectiveness are still unclear [488], [489]. assist others in their research. Additionally, another group of
Also, Processing-In-Memory alters security models due to researchers makes available free software tools to support
factors like architecture changes, different programming scientific research. In Table 4, we present a selection of
models, side-channel risks, device reliability, and poten- libraries for approximate components and established bench-
tial hardware Trojans. To address these challenges, Yellu mark suites. At end of this table, we include two websites:
and Yu [490] proposed obfuscating the boundary between one offering access to published papers with accompanying
approximate and precise computations by blurring the entry codes, particularly in the DL/ML fields, and the other hosting
point and broadening the transition zone. The entry-blurring a comprehensive collection of open-source benchmark suites,
scheme uses a hidden quality metric correlated with approxi- along with tested and recommended platforms for their use.
mation errors to conceal the switch between modes, enhanc- In this review paper, we have noted at the conclusion of each
ing resilience to attacks. The boundary-broadening scheme discussed work that the authors have made the code available.
extends the transition zone with dual thresholds and ran- There are several free software programs provided the
dom AC module selection, further securing AC systems. teams of researchers that can be used for circuit or processor
Their methods significantly improve application quality (up design at the transistor and logic levels, for example BLASYS
to 168% over baseline) with minimal impact on latency, tool-chain framework which is used to design approximate
area, and power costs (increases limited to 6% and 8%, circuits through a couple of free tools. Recent advancements
respectively). Another work, Islam [491] highlighted the have facilitated rapid DNN deployment on FPGAs through
security risks in AC synthesis for implementing approxi- automation tools. FP-DNN [504] streamlines converting Ten-
mate Computing (AC). He showed how vulnerabilities could sorFlow DNN models into efficient FPGA implementations,
be exploited to insert malicious elements like Hardware supporting networks like CNNs with improved performance
Trojans without affecting efficiency. Therefore, he pro- and flexibility. ARTLCNN compiler [505] streamlines FPGA
posed a defense mechanism using input vectors and path hardware customization for CNN inference, significantly
profiling to detect such threats and emphasized the neces- boosting performance by using an optimized RTL mod-
sity of incorporating security into AC systems to prevent ule library and a flexible system template. Tested on Intel
exploitation and suggesting future enhancements to synthe- FPGAs with complex CNNs, it achieves over double the effi-
sis tools for improved security. The study [485] introduced ciency of current automated solutions. Another open-source
a cloud-assisted LSH scheme for efficient Approximate tool called FINN which primarily developed by Xilinx Lab
Nearest Neighbor searches. He tackled the high computa- researchers [456] for building fast and flexible deep learning
tional demands of traditional LSH, especially on devices inference on FPGAs. FINN provides an end-to-end flow and
with limited resources. This approach ensures data privacy focuses on co-design and exploration to optimize quanti-
and includes a method to verify the integrity of cloud- zation and parallelization tuning for specific resource and
processed results. Experiments and analyses confirmed the performance needs. It’s not a general DNN accelerator [506].
scheme’s effectiveness, security, and practical applicability, The choice of the tool is determined by many factors,
offering a viable solution for resource-constrained environ- including the amount of simulation and synthesis needed, the
ments. The research [492] introduced a multilevel approx- complexity of the circuit, and the designer’s expertise with
imate architecture for Ring-Learning-with-Errors (R-LWE), the program.
a quantum-resistant cryptographic scheme ideal for IoT due
to low area and memory requirements. The proposed novel
AxRLWE approach is tailored for resource-constrained IoT XIII. CHALLENGES AND FUTURE DIRECTIONS
devices, achieves substantial reductions in area and energy A. CHALLENGES IN PROCESSING-IN-MEMORY (PIM)
on FPGAs and ASICs, with some compromise on quantum IMPLEMENTATION
security. Research and development in the field of Processing-In-
We conclude that while approximate computing offers Memory are ongoing, and it has the potential to play a crucial
significant advantages in certain applications, its security role in addressing the performance bottlenecks faced by
implications, particularly in sensitive applications, require traditional computing architectures in dealing with massive
careful consideration and further research. amounts of data in modern computing scenarios. In previous

VOLUME 12, 2024 146069


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

TABLE 4. Open-source libraries and benchmark suits of approximate computing techniques and applications.

section, we primarily discussed approximate memory tech- Processing-In-Memory (PIM) holds immense potential for
niques, focusing on strategies to overcome the ‘‘Memory revolutionizing computing architectures by bringing process-
Wall’’ through various circuit and architectural advance- ing capabilities closer to data storage, thereby reducing data
ments, including Approximate Computing. It details methods movement overhead and improving system efficiency. PIM is
such as voltage scaling, lowering refresh rates, and data a nascent technology with ongoing advancements in materi-
compression or encoding to enhance energy efficiency in als, devices, and circuit design. However, the realization of
memory systems, particularly for applications like machine PIM faces several challenges must be addressed to realize the
learning that can tolerate some level of errors. These strate- full potential of PIM.
gies aim to balance power conservation with acceptable error The integration of processing logic into memory cells or
rates, contributing to the broader field of energy-efficient and controllers faces challenges of increasing hardware complex-
performance-optimized memory design. ity and design. This necessitates careful design and using
As new memory technologies continue to evolve, future efficient power management techniques (e.g. DVFS, DPM,
directions for PIM may involve the integration of processing etc.) to handle unacceptable increased power consumption
logic with emerging memory technologies like resistive RAM and heat dissipation that come with added processing capa-
(ReRAM) or phase-change memory (PCM). The concept of bilities in memory. As processing tasks move closer to

146070 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

the memory, ensuring data reliability and integrity during B. ADDRESSING DESIGN AND VERIFICATION
in-memory computation poses a critical challenge. Safe- COMPLEXITIES IN APPROXIMATE COMPUTING CIRCUITS
guarding data against errors and corruption during processing In the pursuit of energy-efficient computing, a pivotal chal-
necessitates developing robust error correction techniques to lenge emerges in meeting the real-time precision demands of
ensure maintaining the integrity of data during the processing various applications. Conventional circuits typically function
cycle. Furthermore, scalability issues arise as PIM is extended at constant power levels without adjusting for the specific
to larger memory sizes and higher bandwidths. These include precision needs of individual tasks. This one-size-fits-all
maintaining performance efficiency and managing the com- approach to power consumption, irrespective of the required
plexities of larger PIM systems. To address scalability issues accuracy for distinct operations, represents a significant
in PIM, we need to trend to use approximate computing and hurdle in optimizing energy efficiency across varied comput-
management strategies, for example by reducing precision ing applications. The optimal approach for energy-efficient
or using adaptive precision scaling or lossy compression, computing involves designing circuits that are both approx-
using approximate memory access including partition data on imate and reconfigurable, ensuring that power consumption
critical important and refresh rate. is closely aligned with the required computational accu-
For successful PIM implementation, developing appropri- racy. Reconfigurable circuits adapt their configuration to the
ate programming models and software support is crucial. current computational needs, optimizing energy efficiency
This involves creating new programming paradigms and tools by alternating between high-precision and lower-precision
that can efficiently leverage the PIM architecture to mini- modes as necessary. This combination offers a tailored
mize data movement and maximize computational efficiency. balance between energy conservation and computational
As future directions, PIM shows promise in advancing arti- accuracy for various applications.
ficial intelligence and machine learning workloads, big data Designing approximation circuits with the aforementioned
analytics, and high-performance computing. The exploration features while adhering to quality constraints significantly
of memory-centric architectures and emerging memory tech- extends the design cycle. This complexity arises as designers
nologies, alongside considerations for security and privacy, must ensure that circuits not only meet functional and optimal
can further enhance PIM’s capabilities [507]. As new mem- performance criteria but also operate within predefined error
ory technologies like ReRAM and PCM evolve, integrating margins. To enhance the design of approximation circuits
these with processing capabilities poses challenges in terms with effective error management, it’s essential to employ
of compatibility, performance optimization, and leveraging advanced methods such as Approximate Logic Synthesis
their unique properties for PIM. Also, the ensuring the secu- (ALS). ALS is geared towards meeting diverse accuracy
rity and privacy of data processed within memory becomes requirements while also addressing power and delay vari-
an important consideration. This includes addressing poten- ances. Integrating ALS, especially AHLS, and automated
tial vulnerabilities and safeguarding against unauthorized design exploration tools along with appropriate analytical
access or tampering. The limited precision of analog PIM or semi-analytical error models underscores the necessity
accelerators, particularly during the high-precision backward for designing quality-configurable circuits that adjust to dif-
propagation phase in CNN training, presents challenges that ferent accuracy levels in real time. It’s also important to
necessitate innovative solutions like hybrid PIM accelerators extend approximation beyond traditional gate-based designs
and Shannon-inspired statistical computing principles. to include more complex functional units. Though initial
Instead of just augmenting existing CPUs with PIM capa- research efforts, such as those by Lee [508] and Alan [509]
bilities, future directions might involve the exploration of and their colleagues, have started to address these challenges
memory-centric architectures where the memory is at the through proposals for approximate high-level synthesis in
center of computation, and traditional CPUs are reimagined custom hardware circuit design and runtime accuracy-
as accelerators. Additionally, the integration of approximate configurable circuits, respectively, this area is still in its
computing techniques with PIM holds promise for opti- nascent stages. This trend towards customized ML models
mizing computation in memory-intensive tasks and further requires new Auto-ML tools and co-design strategies that
improving energy efficiency while providing satisfactory out- integrate algorithmic and hardware considerations for opti-
put quality for specific applications. For example, recently, mal use of approximate computing in advanced ML settings.
Jinyu et al. [478] introduced CIMQ, a quantization frame-
work for improving neural network accelerator efficiency
using Computing in Memory (CIM) architectures. C. ADAPTIVE ERROR REDUCTION IN RECONFIGURABLE
Embracing these challenges and future directions can APPROXIMATE CIRCUIT
pave the way for the widespread adoption of Processing- The current challenge is that each application requires a spe-
In-Memory and revolutionize modern computing paradigms. cific characteristic of approximations to mention the accuracy
In conclusion, while challenges exist, ongoing research and within acceptable level. Different approximations have vary-
development efforts in the field will unlock the full poten- ing effects on the performance and accuracy of application.
tial of Processing-In-Memory for next-generation computing Intuitively, there is no one-size-fits-all solution. Therefore,
systems. the future direction is to design on universal design with

VOLUME 12, 2024 146071


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

adapting itself to reduce error as possible as. Addressing nique known as reduced precision to lower energy usage [17].
the challenge of reducing errors in input data reconfigurable Google also employs approximate computing strategies in
approximate circuits, particularly with an approximate adder, its data centers to optimize energy usage without compro-
involves a dynamic and adaptive approach. In such circuits, mising the quality of service. Another example is that IBM
the configuration of each full adder can be adjusted based on has developed an AI accelerator chip capable of achieving
the input data and the carry signal. This adaptability allows high performance (in TOPS) by integrating multiple and
the circuit to modify how it performs approximations in real- multi-level approximate techniques [13].
time, optimizing for accuracy in critical computations while Due to these significant advancements achieved by AxC,
still benefiting from the efficiency of approximation in less this motivated other communities in electronic design
critical areas. automation (EDA) and software engineering to develop
For instance, if the approximate adder detects that the input tools and methodologies to facilitate approximate computing
data or the carry signal leads to a potentially significant error, designs. Therefore, developing specialized hardware archi-
it can reconfigure itself to reduce or eliminate the approxi- tectures optimized for AxC, coupled with corresponding
mation for that specific calculation. This self-adjusting capa- software tools and programming models, will be crucial for
bility ensures that the circuit maintains a balance between realizing its full potential. Effective error resilience tech-
the desired efficiency of approximation techniques and the niques and error estimation Frameworks to manage and
need for accuracy in the output, particularly for computations mitigate errors introduced by approximation are essential
where precision is crucial. By dynamically adjusting the level for ensuring the reliability and robustness of AxC systems.
of approximation based on the input data characteristics and To fully realize the potential of AxC, it is crucial to investi-
the computational context, such reconfigurable approximate gate its application across multiple layers of the computing
circuits can effectively minimize errors while still leveraging stack, including hardware, architecture, software, and algo-
the benefits of approximate computing. rithms. Most of the current research concentrates on error-
In recent years, much of the focus in approximation tolerant applications, but we believe the next research area
computing was on a single-layer approach, limiting approx- is to demonstrate the effectiveness of these AxC techniques
imation to specific modules. However, researchers are now in safety-critical applications. Therefore, a comprehensive
aiming to maximize the advantages of approximate comput- approach to implementing AxC across different layers can
ing by integrating various techniques across different design unlock new efficiencies and capabilities in modern comput-
levels, including hardware or software or both, for a given ing systems.
application. This approach, known as cross-layer codesign, One of the recent future directions is to build systems
represents a significant and ongoing challenge in the field. employing dynamic, adaptive approximation techniques that
For example, explores approximation strategies in printed cir- can adjust the level of approximation based on applica-
cuits for machine learning, enhancing efficiency and reducing tion requirements, input data characteristics, and available
complexity. resources. This ensures optimal trade-offs between accuracy
and performance. AxC is well-suited for machine learning
XIV. PERSPECTIVES ON FUTURE DIRECTIONS and AI applications, where small losses in accuracy can
In recent years, the field of approximate computing (AC) be tolerated in exchange for significant performance gains.
has witnessed significant advancements, positioning it as Research will focus on developing approximate algorithms
a potential mainstream computing approach in future sys- and hardware accelerators tailored for these applications.
tems. One primary reason for this shift is the diminishing AxC is expected to find applications in various emerging
returns on performance improvement through the scaling fields, such as IoT, edge computing, and embedded sys-
of CMOS technology. Additionally, the diversity of mod- tems, where energy efficiency and real-time performance
ern architectures, ranging from high-performance computing are critical.
(HPC) to embedded systems like the Internet of Things (IoT) Recent advancements in neuromorphic computing have
and autonomous vehicles, necessitates a balance between addressed the power and latency issues of traditional digital
efficiency in terms of memory, performance, power consump- systems. The researchers attempt to create more efficient and
tion, and the quality of final outcomes. However, approximate intelligent computer systems to mimic the human brain by
computing is one of the most promising techniques for many constructing sophisticated hardware architectures and devel-
future applications, especially those related to human percep- oping new theories and brain-inspired algorithms. Brain-
tion [15], [16]. Recent trends indicate its increasing adoption inspired computing faces several significant challenges, holds
in various domains, including AI-based applications and ser- promising future directions, and directly relates to emerg-
vices, supported by industry leaders like Google and IBM. ing non-volatile memory (eNVM). eNVM is attractive for
Major corporations such as IBM, Google, Intel, and ARM are implementing the synapses in the neural network [461].
actively engaged in pioneering research and the development Processing-In-Memory (PIM) is the most attractive architec-
of commercial offerings that incorporate approximate com- ture used in designing brain-inspired computing models. The
puting strategies. For example, Google’s Tensor Processing brain-inspired computing model is based on the so-called
Units (TPUs), which employ an approximate computing tech- Spiking Neural Networks (SNNs). Recent research highlights

146072 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

the potential of hybrid neural networks (HNNs) in various XV. CONCLUSION


applications. The emerging trend of designing hybrid neu- In this paper, we explored the state-of-the-art of approximate
ral networks (HNNs) by combining spiking and artificial computing, focusing on its application in data, software, hard-
neural networks leverages the strengths of both. Therefore, ware, and architecture, and highlighting its benefits across
Zhao et al. [510] proposed a framework using hybrid units various fields. It reviews recent progress and challenges
(HUs) to link and integrate multiple neural network struc- in approximate computing, with a detailed examination of
tures, especially the integration of spiking neural networks its significant impact, particularly in machine learning and
(SNNs) within HNNs. Overall, the future of brain-inspired IoT. The survey emphasizes the transformative potential of
computing lies in continuing to refine these hybrid mod- approximate computing in these areas and aims to enrich
els and exploring new materials and architectures to bridge the research community, offering a valuable reference for
the gap between biological and artificial neural systems. researchers. We explored and discussed the state-of-the-art
By integrating approximate computing techniques, HNNs data level of approximation. We focused on the data sam-
can achieve better performance and energy efficiency, making pling algorithms used in various frameworks to improve the
them more viable for large-scale, real-time applications. This efficiency and speed of processing large datasets. We dis-
opportunity must be exploited by researchers and designers to cussed programming models and software frameworks (e.g.,
align with the broader goal of creating scalable and sustain- ApproxHadoop), which are used for processing large datasets
able AI systems that can handle increasingly complex tasks across clusters of computers or on the cloud or another exam-
with minimal resources. ple, like BlinkDB, which is specifically designed for approx-
We know that one of the primary concerns for com- imate queries on large datasets. We review the state-of-the-art
munities is ensuring the security and privacy of data, data structures, which are not less important than the others
especially when the data is processed using AC techniques. because they offer efficiency in data storage and computa-
Future crucial research areas focus on developing secure tion. For example, Boom filters are widely used in IOT and
and privacy-preserving AC methodologies. The potential wearable electronics, where battery life is a major concern.
of approximate computing extends beyond traditional AI In this review paper, we expanded our focus on approxi-
and signal processing applications. Emerging areas such as mations beyond the data level. We performed an extensive
hardware security, cryptocurrency mining, and lattice-based analysis of optimizing the code using approximate computing
post-quantum cryptography are poised to benefit from the techniques and discussed and categorized the most important
efficiency gains offered by approximate computing. These types of approximate programming languages. Regarding
applications require significant computational resources and the architecture level, we discussed the state-of-the-art dif-
can tolerate a degree of error, making them ideal candidates ferent approximate memories and emphasized significant
for approximate computing techniques. We also believe that innovation in the approximate Processing-In-Memory and
one of the challenges that faces approximate computing is Content-Addressable Memory (CAM). Expanding our focus
the lack of a systematic and theoretical foundation for AC, on approximations beyond just memories to explore the last
including formal models for error analysis, performance opti- innovate works on processors, especially in the AI domain.
mization, and algorithm design. Therefore, establishing a At the circuit level, we presented and discussed the state-
strong theoretical foundation will guide future research and of-the-art of all arithmetic units, elementary and activation
development in this field. functions, and emphasized in our discussion on approximate
While challenges remain, the ongoing research and devel- logic synthesis.
opment in approximate computing are paving the way for Regarding the application level, we focused on the emerg-
its widespread adoption. By leveraging approximate com- ing IOT, DL/ML, and data mining applications and discussed
puting techniques, future systems can achieve significant software, hardware, cross-layer, and end-to-end approxima-
improvements in energy efficiency and performance, partic- tions. We highlighted the traditional use of approximate com-
ularly in error-tolerant applications. The growing demand puting techniques focusing on a single subsystem. The core
for approximate computing will necessitate diverse con- argument is that to realize the full benefits of approximate
tributions from various stakeholders, including hardware computing, we need to move beyond these siloed approaches
designers, system developers, test engineers, and researchers. and focus on a full-system approach through applying
Collaborative efforts across these disciplines will be essen- approximations strategically across different system layers.
tial to advance approximate computing as a mainstream There are great advantages to using cross-layer and end-to-
paradigm. This interdisciplinary approach will help address end approximations (full-system approach): they can lead to
the challenges associated with approximate computing, significant improvements in speed, energy consumption, and
such as error management, reliability, and user accep- overall optimization; they can control holistic errors through
tance. As the field continues to evolve, we expect to understanding the propagation of errors throughout the entire
see more innovative applications and a growing integra- system, which enables better management of overall accu-
tion of approximate computing into mainstream computing racy; and they can tailor the solutions where A full-system
paradigms. view allows for custom-designed approximations matching

VOLUME 12, 2024 146073


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

the specific tolerance and performance requirements of indi- [16] A. Dalloo, A. Najafi, and A. Garcia-Ortiz, ‘‘Systematic design of an
vidual applications. We reported well-established libraries approximate adder: The optimized lower part constant-OR adder,’’ IEEE
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 8, pp. 1595–1599,
and benchmark suites for evaluating the quality-of-service of Aug. 2018, doi: 10.1109/TVLSI.2018.2822278.
approximate designs. We presented some open-source tools [17] N. P. Jouppi et al., ‘‘In-datacenter performance analysis of a tensor
and logic syntheses. We intended to discuss the security processing unit,’’ in Proc. ACM/IEEE 44th Annu. Int. Symp. Com-
put. Archit. (ISCA). New York, NY, USA: ACM, Jun. 2017, pp. 1–12,
concerns of using approximate computing. Despite advances doi: 10.1145/3079856.3080246.
in approximate computing, there’s a critical need for con- [18] N. Enright Jerger and J. San Miguel, ‘‘Approximate computing,’’
tinued innovation to unlock its full potential in complex IEEE Micro, vol. 38, no. 4, pp. 8–10, Jul. 2018, doi:
10.1109/MM.2018.043191120.
system designs. Our survey concludes with a discussion on [19] Q. Xu, T. Mytkowicz, and N. S. Kim, ‘‘Approximate computing:
these challenges and future research directions. Establishing A survey,’’ IEEE Des. Test. Comput., vol. 33, no. 1, pp. 8–22, Feb. 2016,
standardized benchmarks and error metrics for approximate doi: 10.1109/MDAT.2015.2505723.
computing. This will enable researchers and designers to [20] D. Mohapatra, V. K. Chippa, A. Raghunathan, and K. Roy, ‘‘Design
of voltage-scalable meta-functions for approximate computing,’’
compare different approaches and help users select the most in Proc. Design, Autom. Test Eur., Mar. 2011, pp. 1–6, doi:
appropriate solution for their use case. 10.1109/DATE.2011.5763154.
[21] A. Rahimi, A. Ghofrani, K.-T. Cheng, L. Benini, and R. K. Gupta,
‘‘Approximate associative memristive memory for energy-efficient
REFERENCES GPUs,’’ in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE),
[1] A. G. M. Strollo and D. Esposito, ‘‘Approximate computing in the Mar. 2015, pp. 1497–1502, doi: 10.7873/DATE.2015.0579.
nanoscale era,’’ in Proc. Int. Conf. IC Design Technol. (ICICDT), [22] G. Rodrigues, F. Lima Kastensmidt, and A. Bosio, ‘‘Survey on approx-
Jun. 2018, pp. 21–24, doi: 10.1109/ICICDT.2018.8399746. imate computing and its intrinsic fault tolerance,’’ Electronics, vol. 9,
[2] A. Najafi, ‘‘Systematic design of low-power processing elements using no. 4, p. 557, Mar. 2020, doi: 10.3390/electronics9040557.
stochastic and approximate computing techniques,’’ Ph.D. thesis, Dept. [23] T. Alan, A. Gerstlauer, and J. Henkel, ‘‘Cross-layer approximate hardware
Phys. Electron. Eng., Univ. Bremen, 2021, doi: 10.26092/elib/460. synthesis for runtime configurable accuracy,’’ IEEE Trans. Very Large
[3] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and Scale Integr. (VLSI) Syst., vol. 29, no. 6, pp. 1231–1243, Apr. 2021, doi:
D. Burger, ‘‘Dark silicon and the end of multicore scaling,’’ in Proc. 38th 10.1109/TVLSI.2021.3068312.
Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2011, pp. 365–376. [24] Z. Ebrahimi, D. Klar, M. A. Ekhtiyar, and A. Kumar, ‘‘Plasticine:
[4] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, ‘‘Toward dark A cross-layer approximation methodology for multi-kernel applications
silicon in servers,’’ IEEE Micro, vol. 31, no. 4, pp. 6–15, Jul. 2011, doi: through minimally biased, high-throughput, and energy-efficient SIMD
10.1109/MM.2011.77. soft multiplier-divider,’’ ACM Trans. Design Autom. Electron. Syst.,
[5] M. Shafique, S. Garg, J. Henkel, and D. Marculescu, ‘‘The EDA chal- vol. 27, no. 2, pp. 1–33, Nov. 2021, doi: 10.1145/3486616.
lenges in the dark silicon era,’’ in Proc. 51st ACM/EDAC/IEEE Design [25] M. A. Hanif and M. Shafique, ‘‘A cross-layer approach
Autom. Conf. (DAC), Jun. 2014, pp. 1–6, doi: 10.1145/2593069.2593229. towards developing efficient embedded deep learning systems,’’
[6] J. L. Hennessy and D. A. Patterson, ‘‘A new golden age for computer Microprocessors Microsyst., vol. 88, Feb. 2022, Art. no. 103609, doi:
architecture,’’ Commun. ACM, vol. 62, no. 2, pp. 48–60, Jan. 2019, doi: 10.1016/j.micpro.2020.103609.
10.1145/3282307. [26] S. Mittal, ‘‘A survey of techniques for approximate computing,’’ ACM
[7] K. Rupp. (2024). Karlrupp/Microprocessor-Trend-Data. Accessed: Comput. Surv., vol. 48, no. 4, pp. 1–33, Mar. 2016, doi: 10.1145/
Sep. 2, 2024. [Online]. Available: https://github.com/karlrupp/ 2893356.
microprocessor-trend-data [27] H. Jiang, C. Liu, L. Liu, F. Lombardi, and J. Han, ‘‘A review, classifi-
[8] S. Dutt, S. Nandi, and G. Trivedi, ‘‘A comparative survey of approximate cation, and comparative evaluation of approximate arithmetic circuits,’’
adders,’’ in Proc. 26th Int. Conf. Radioelektronika (RADIOELEK- ACM J. Emerg. Technol. Comput. Syst., vol. 13, no. 4, pp. 1–34,
TRONIKA), Apr. 2016, pp. 61–65, doi: 10.1109/RADIOELEK. Aug. 2017, doi: 10.1145/3094124.
2016.7477392. [28] F. Betzel, K. Khatamifard, H. Suresh, D. J. Lilja, J. Sartori, and
[9] S. Venkataramani, S. T. Chakradhar, K. Roy, and A. Raghunathan, U. Karpuzcu, ‘‘Approximate communication: Techniques for reduc-
‘‘Approximate computing and the quest for computing efficiency,’’ in ing communication bottlenecks in large-scale parallel systems,’’ ACM
Proc. 52nd ACM/EDAC/IEEE Design Autom. Conf. (DAC), Jun. 2015, Comput. Surv., vol. 51, no. 1, pp. 1–32, Jan. 2018, doi: 10.1145/
pp. 1–6, doi: 10.1145/2744769.2744904. 3145812.
[10] M. D. Hill and M. R. Marty, ‘‘Amdahl’s law in the multicore era,’’ [29] G. Zervakis, H. Saadat, H. Amrouch, A. Gerstlauer, S. Parameswaran,
Computer, vol. 41, no. 7, pp. 33–38, Jul. 2008, doi: 10.1109/mc.2008.209. and J. Henkel, ‘‘Approximate computing for ML: State-of-the-art,
[11] M. Shafique and S. Garg, ‘‘Computing in the dark silicon era: challenges and visions,’’ in Proc. 26th Asia South Pacific Design
Current trends and research challenges,’’ IEEE Des. Test. IEEE Autom. Conf. (ASP-DAC), Jan. 2021, pp. 189–196.
Des. Test. Comput., vol. 34, no. 2, pp. 8–23, Apr. 2017, doi: [30] J. Henkel, H. Li, A. Raghunathan, M. B. Tahoori, S. Venkataramani,
10.1109/MDAT.2016.2633408. X. Yang, and G. Zervakis, ‘‘Approximate computing and the efficient
[12] S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn, ‘‘Flikker: machine learning expedition,’’ in Proc. IEEE/ACM Int. Conf. Com-
Saving DRAM refresh-power through critical data partitioning,’’ ACM put. Aided Design (ICCAD). New York, NY, USA: ACM, Oct. 2022,
SIGPLAN Notices, vol. 46, no. 3, pp. 213–224, Mar. 2011, doi: pp. 1–9.
10.1145/1961296.1950391. [31] J. Lee, L. Mukhanov, A. S. Molahosseini, U. Minhas, Y. Hua,
[13] W. Liu, F. Lombardi, and M. Shulte, ‘‘A retrospective and prospec- J. M. del Rincon, K. Dichev, C.-H. Hong, and H. Vandierendonck,
tive view of approximate computing [point of view],’’ Proc. IEEE, ‘‘Resource-efficient convolutional networks: A survey on model-,
vol. 108, no. 3, pp. 394–399, Mar. 2020, doi: 10.1109/JPROC.2020. arithmetic-, and implementation-level techniques,’’ ACM Comput. Sur-
2975695. veys, vol. 55, no. 13s, pp. 1–36, Dec. 2023, doi: 10.1145/3587095.
[14] V. K. Chippa, S. T. Chakradhar, K. Roy, and A. Raghunathan, [32] H.-H. Que, Y. Jin, T. Wang, M.-K. Liu, X.-H. Yang, and F. Qiao,
‘‘Analysis and characterization of inherent application resilience for ‘‘A survey of approximate computing: From arithmetic units
approximate computing,’’ in Proc. 50th ACM/EDAC/IEEE Design design to high-level applications,’’ J. Comput. Sci. Technol.,
Autom. Conf. (DAC). New York, NY, USA: ACM, May 2013, pp. 1–9, vol. 38, no. 2, pp. 251–272, Apr. 2023, doi: 10.1007/s11390-
doi: 10.1145/2463209.2488873. 023-2537-y.
[15] A. Dalloo, ‘‘Enhance the segmentation principle in approximate [33] H. J. Damsgaard, A. Ometov, and J. Nurmi, ‘‘Approximation oppor-
computing,’’ in Proc. Int. Conf. Circuits Syst. Digit. Enterprise tunities in edge computing hardware: A systematic literature review,’’
Technol. (ICCSDET), Dec. 2018, pp. 1–7, doi: 10.1109/ICCS- ACM Comput. Surv., vol. 55, no. 12, pp. 1–49, Mar. 2023, doi:
DET.2018.8821112. 10.1145/3572772.

146074 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

[34] V. Leon, M. Abdullah Hanif, G. Armeniakos, X. Jiao, M. Shafique, [52] Y. Park, J. Qing, X. Shen, and B. Mozafari, ‘‘BlinkML: Effi-
K. Pekmestzi, and D. Soudris, ‘‘Approximate computing survey, Part I: cient maximum likelihood estimation with probabilistic guarantees,’’
Terminology and software & hardware approximation techniques,’’ 2023, in Proc. Int. Conf. Manage. Data, Jun. 2019, pp. 1135–1152, doi:
arXiv:2307.11124. 10.1145/3299869.3300077.
[35] V. Leon, M. Abdullah Hanif, G. Armeniakos, X. Jiao, M. Shafique, [53] M. R. Anderson and M. Cafarella, ‘‘Input selection for fast feature engi-
K. Pekmestzi, and D. Soudris, ‘‘Approximate computing survey, Part neering,’’ in Proc. IEEE 32nd Int. Conf. Data Eng. (ICDE), May 2016,
II: Application-specific & architectural approximation techniques and pp. 577–588, doi: 10.1109/ICDE.2016.7498272.
applications,’’ 2023, arXiv:2307.11128. [54] Z. Carmichael, H. F. Langroudi, C. Khazanov, J. Lillie, J. L. Gustafson,
[36] K. K. Pandey and D. Shukla, ‘‘Stratified sampling-based data reduc- and D. Kudithipudi, ‘‘Performance-efficiency trade-off of low-
tion and categorization model for big data mining,’’ in Communication precision numerical formats in deep neural networks,’’ 2019,
and Intelligent Systems (Lecture Notes in Networks and Systems), arXiv:1903.10584.
J. C. Bansal, M. K. Gupta, H. Sharma, and B. Agarwal, Eds., Singapore: [55] S. Cherubin and G. Agosta, ‘‘Tools for reduced precision computation:
Springer, 2020, pp. 107–122, doi: 10.1007/978-981-15-3325-9_9. A survey,’’ ACM Comput. Surv., vol. 53, no. 2, pp. 1–35, Apr. 2020, doi:
[37] T. D. Nguyen, M.-H. Shih, D. Srivastava, S. Tirthapura, and B. Xu, ‘‘Strat- 10.1145/3381039.
ified random sampling from streaming and stored data,’’ Distrib. Parallel [56] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer,
Databases, vol. 39, no. 3, pp. 665–710, Sep. 2021, doi: 10.1007/s10619- ‘‘A survey of quantization methods for efficient neural network infer-
020-07315-w. ence,’’ 2021, arXiv:2103.13630.
[38] J. Zhang, H. Chen, D. Yu, Y. Pei, and Y. Deng, ‘‘Cluster-preserving [57] T. Zebin, P. J. Scully, N. Peek, A. J. Casson, and K. B. Ozanyan,
sampling algorithm for large-scale graphs,’’ Sci. China Inf. Sci., vol. 66, ‘‘Design and implementation of a convolutional neural network on an
no. 1, Nov. 2022, Art. no. 112103, doi: 10.1007/s11432-021-3370-4. edge computing smartphone for human activity recognition,’’ IEEE
[39] S. Shankar and A. G. Parameswaran, ‘‘Towards observability for Access, vol. 7, pp. 133509–133520, 2019, doi: 10.1109/ACCESS.2019.
production machine learning pipelines,’’ Proc. VLDB Endowment, 2941836.
vol. 15, no. 13, pp. 4015–4022, Sep. 2022, doi: 10.14778/3565838. [58] Quantization Aware Training | TensorFlow Model Optimization.
3565853. Accessed: Mar. 15, 2023. [Online]. Available: https://www.
[40] B. G. Galuzzi, L. Milazzo, and C. Damiani, ‘‘Best practices in flux sam- tensorflow.org/model_optimization/guide/quantization/training
pling of constrained-based models,’’ in Machine Learning, Optimization, [59] P.-E. Novac, G. Boukli Hacene, A. Pegatoquet, B. Miramond, and
and Data Science (Lecture Notes in Computer Science), G. Nicosia, V. Gripon, ‘‘Quantization and deployment of deep neural networks on
V. Ojha, E. La Malfa, G. La Malfa, P. Pardalos, G. Di Fatta, G. Giuffrida, microcontrollers,’’ Sensors, vol. 21, no. 9, p. 2984, Apr. 2021, doi:
and R. Umeton, Eds., Cham, Switzerland: Springer, 2023, pp. 234–248, 10.3390/s21092984.
doi: 10.1007/978-3-031-25891-6_18. [60] J. Zhai, B. Li, S. Lv, and Q. Zhou, ‘‘FPGA-based vehicle detection and
[41] N. Sobhani and S. J. Delany, ‘‘Identity term sampling for measuring tracking accelerator,’’ Sensors, vol. 23, no. 4, p. 2208, Feb. 2023, doi:
gender bias in training data,’’ in Artificial Intelligence and Cognitive 10.3390/s23042208.
Science (Communications in Computer and Information Science). Cham, [61] P. Colangelo, N. Nasiri, E. Nurvitadhi, A. Mishra, M. Margala, and
Switzerland: Springer, 2023, pp. 226–238, doi: 10.1007/978-3-031- K. Nealis, ‘‘Exploration of low numeric precision deep learning inference
26438-2_18. using Intel FPGAs,’’ in Proc. IEEE 26th Annu. Int. Symp. Field-
[42] H. Wu, H. Xu, X. Tian, W. Zhang, and C. Lu, ‘‘Multistage sampling Program. Custom Comput. Mach. (FCCM), Apr. 2018, pp. 73–80, doi:
and optimization for forest volume inventory based on spatial auto- 10.1109/FCCM.2018.00020.
correlation analysis,’’ Forests, vol. 14, no. 2, p. 250, Jan. 2023, doi: [62] G. Dai and J. Fan, ‘‘An industrial-grade solution for crop
10.3390/f14020250. disease image detection tasks,’’ Frontiers Plant Sci., vol. 13,
[43] B. Zhang, Y. Du, H. Huang, Y.-E. Sun, G. Gao, X. Wang, and S. Chen, pp. 1–12, Jun. 2022. [Online]. Available: https://www.
‘‘Multi-layer adaptive sampling for per-flow spread measurement,’’ in frontiersin.org/articles/10.3389/fpls.2022.921057
Algorithms and Architectures for Parallel Processing (Lecture Notes in [63] M. M. Farag, ‘‘A self-contained STFT CNN for ECG classifica-
Computer Science), Y. Lai, T. Wang, M. Jiang, G. Xu, W. Liang, and tion and arrhythmia detection at the edge,’’ IEEE Access, vol. 10,
A. Castiglione, Eds., Cham, Switzerland: Springer, 2022, pp. 743–758, pp. 94469–94486, 2022, doi: 10.1109/ACCESS.2022.3204703.
doi: 10.1007/978-3-030-95384-3_46. [64] D. Costa, M. Costa, and S. Pinto, ‘‘Train me if you can: Decentralized
[44] S. Moshtaghi Largani and S. Lee, ‘‘Efficient sampling for big learning on the deep edge,’’ Appl. Sci., vol. 12, no. 9, p. 4653, May 2022,
provenance,’’ in Proc. Companion ACM Web Conf. New York, doi: 10.3390/app12094653.
NY, USA: ACM, Apr. 2023, pp. 1508–1511, doi: 10.1145/3543873. [65] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia,
3587556. B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, ‘‘Mixed
[45] V. Sanca and A. Ailamaki, ‘‘Sampling-based AQP in modern analytical precision training,’’ 2017, arXiv:1710.03740.
engines,’’ in Data Management on New Hardware. New York, NY, USA: [66] Train With Mixed Precision. Accessed: Jun. 8, 2023. [Online]. Available:
ACM, Jun. 2022, pp. 1–8, doi: 10.1145/3533737.3535095. https://docs.nvidia.com/deeplearning/performance/mixed-precision-
[46] E. A. Deiana, V. St-Amour, P. A. Dinda, N. Hardavellas, and training/index.html
S. Campanoni, ‘‘Unconventional parallelization of nondeterministic [67] S. Kang, K. Choi, and Y. Park, ‘‘PreScaler: An efficient system-
applications,’’ in Proc. 23rd Int. Conf. Architectural Support Pro- aware precision scaling framework on heterogeneous systems,’’ in
gram. Lang. Operating Syst. New York, NY, USA: ACM, Mar. 2018, Proc. 18th ACM/IEEE Int. Symp. Code Gener. Optim., vol. 48. New
pp. 432–447, doi: 10.1145/3173162.3173181. York, NY, USA: ACM, Feb. 2020, pp. 280–292, doi: 10.1145/3368826.
[47] N. Laptev, K. Zeng, and C. Zaniolo, ‘‘Early accurate results for advanced 3377917.
analytics on MapReduce,’’ 2012, arXiv:1207.0142. [68] S. Yesil, I. Akturk, and U. R. Karpuzcu, ‘‘Toward dynamic preci-
[48] I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen, ‘‘Approx- sion scaling,’’ IEEE Micro, vol. 38, no. 4, pp. 30–39, Jul. 2018, doi:
Hadoop: Bringing approximations to MapReduce frameworks,’’ ACM 10.1109/MM.2018.043191123.
SIGPLAN Notices, vol. 50, no. 4, pp. 383–397, Mar. 2015, doi: [69] W.-F. Chiang, M. Baranowski, I. Briggs, A. Solovyev, G. Gopalakrishnan,
10.1145/2775054.2694351. and Z. Rakamarić, ‘‘Rigorous floating-point mixed-precision tuning,’’
[49] G. Hu, D. Zhang, S. Rigo, and T. D. Nguyen, ‘‘Approximation with error ACM SIGPLAN Notices, vol. 52, no. 1, pp. 300–315, Jan. 2017, doi:
bounds in spark,’’ 2018, arXiv:1812.01823. 10.1145/3093333.3009846.
[50] D. L. Quoc, R. Chen, P. Bhatotia, C. Fetzer, V. Hilt, and T. Strufe, [70] P. V. Kotipalli, R. Singh, P. Wood, I. Laguna, and S. Bagchi, ‘‘AMPT-GA:
‘‘StreamApprox: Approximate computing for stream analytics,’’ Automatic mixed precision floating point tuning for GPU applications,’’
in Proc. 18th ACM/IFIP/USENIX Middleware Conf., Dec. 2017, in Proc. ACM Int. Conf. Supercomputing. New York, NY, USA: ACM,
pp. 185–197, doi: 10.1145/3135974.3135989. Jun. 2019, pp. 160–170, doi: 10.1145/3330345.3330360.
[51] Z. Wen, D. L. Quoc, P. Bhatotia, R. Chen, and M. Lee, ‘‘Approx- [71] H. Guo and C. Rubio-González, ‘‘Exploiting community structure
IoT: Approximate analytics for edge computing,’’ in Proc. IEEE 38th for floating-point precision tuning,’’ in Proc. 27th ACM SIGSOFT
Int. Conf. Distrib. Comput. Syst. (ICDCS), Jul. 2018, pp. 411–421, doi: Int. Symp. Softw. Test. Anal. New York, NY, USA: ACM, Jul. 2018,
10.1109/ICDCS.2018.00048. pp. 333–343, doi: 10.1145/3213846.3213862.

VOLUME 12, 2024 146075


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

[72] S. Garg, J. Lou, A. Jain, Z. Guo, B. J. Shastri, and M. Nahmias, [89] R. Patgiri, A. Biswas, and S. Nayak, ‘‘DeepBF: Malicious URL detec-
‘‘Dynamic precision analog computing for neural networks,’’ IEEE tion using learned Bloom filter and evolutionary deep learning,’’ 2021,
J. Sel. Topics Quantum Electron., vol. 29, no. 2, pp. 1–12, Mar. 2023, arXiv:2103.12544.
doi: 10.1109/JSTQE.2022.3218019. [90] B. H. Bloom, ‘‘Space/time trade-offs in hash coding with allowable
[73] G. Giamougiannis, A. Tsakyridis, M. Moralis-Pegios, C. Pappas, errors,’’ Commun. ACM, vol. 13, no. 7, pp. 422–426, Jul. 1970, doi:
M. Kirtas, N. Passalis, D. Lazovsky, A. Tefas, and N. Pleros, ‘‘Ana- 10.1145/362686.362692.
log nanophotonic computing going practical: Silicon photonic deep [91] S. Z. Kiss, É. Hosszu, J. Tapolcai, L. Rónyai, and O. Rottenstreich,
learning engines for tiled optical matrix multiplication with dynamic ‘‘Bloom filter with a false positive free zone,’’ IEEE Trans. Netw. Service
precision,’’ Nanophotonics, vol. 12, no. 5, pp. 963–973, Mar. 2023, doi: Manage., vol. 18, no. 2, pp. 2334–2349, Jun. 2021, doi:
10.1515/nanoph-2022-0423. 10.1109/TNSM.2021.3059075.
[74] W. Fornaciari, G. Agosta, D. Cattaneo, L. Denisov, A. Galimberti, [92] Y. Wu, J. He, S. Yan, J. Wu, T. Yang, O. Ruas, G. Zhang, and B. Cui,
G. Magnani, and D. Zoni, ‘‘Hardware and software support for mixed ‘‘Elastic Bloom filter: Deletable and expandable filter using elastic fin-
precision computing: A roadmap for embedded and HPC systems,’’ in gerprints,’’ IEEE Trans. Comput., vol. 71, no. 4, pp. 984–991, Apr. 2022,
Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Apr. 2023, pp. 1–6, doi: 10.1109/TC.2021.3067713.
doi: 10.23919/date56975.2023.10137092. [93] F. G. Gebretsadik, S. Nayak, and R. Patgiri, ‘‘EBF: An enhanced Bloom
[75] S. Yamagiwa, W. Yang, and K. Wada, ‘‘Adaptive lossless image data filter for intrusion detection in IoT,’’ J. Big Data, vol. 10, no. 1, p. 102,
compression method inferring data entropy by applying deep neural Jun. 2023, doi: 10.1186/s40537-023-00790-9.
network,’’ Electronics, vol. 11, no. 4, p. 504, Feb. 2022, doi: 10.3390/elec- [94] H. A. Seymen and M. E. Yalçın, ‘‘Design and implementation of a
tronics11040504. lightweight Bloom filter accelerator for IoT applications,’’ in Proc. 14th
[76] H. M. Yasin and S. Y. Ameen, ‘‘Review and evaluation of end-to-end Int. Conf. Electr. Electron. Eng. (ELECO), Nov. 2023, pp. 1–5, doi:
video compression with deep-learning,’’ in Proc. Int. Conf. Mod. Trends 10.1109/eleco60389.2023.10415987.
Inf. Commun. Technol. Ind. (MTICTI), Dec. 2021, pp. 1–8, doi: [95] L. Luo, D. Guo, R. T. B. Ma, O. Rottenstreich, and X. Luo, ‘‘Opti-
10.1109/MTICTI53925.2021.9664790. mizing Bloom filter: Challenges, solutions, and comparisons,’’ 2018,
[77] C. Ma, D. Liu, X. Peng, L. Li, and F. Wu, ‘‘Convolutional neural arXiv:1804.04777.
network-based arithmetic coding for HEVC intra-predicted residues,’’ [96] A. Singh, S. Garg, R. Kaur, S. Batra, N. Kumar, and A. Y. Zomaya,
IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 7, pp. 1901–1916, ‘‘Probabilistic data structures for big data analytics: A comprehensive
Jul. 2020, doi: 10.1109/TCSVT.2019.2927027. review,’’ Knowl.-Based Syst., vol. 188, Jan. 2020, Art. no. 104987, doi:
[78] S. Wiedemann, H. Kirchhoffer, S. Matlage, P. Haase, A. Marban, 10.1016/j.knosys.2019.104987.
T. Marinc, D. Neumann, A. Osman, D. Marpe, H. Schwarz, T. Wiegand, [97] P. Reviriego, P. Junsangsri, S. Liu, and F. Lombardi, ‘‘Error-tolerant
and W. Samek, ‘‘DeepCABAC: Context-adaptive binary arithmetic cod- data sketches using approximate nanoscale memories and voltage
ing for deep neural network compression,’’ 2019, arXiv:1905.08318. scaling,’’ IEEE Trans. Nanotechnol., vol. 21, pp. 16–22, 2022, doi:
10.1109/TNANO.2021.3139394.
[79] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, ‘‘AMC: AutoML for
[98] F. Deng and D. Rafiei, ‘‘New estimation algorithms for stream-
model compression and acceleration on mobile devices,’’ in Computer
ing data: Count-min can do more,’’ Webdocs.Cs.Ualberta.Ca, Univ.
Vision—ECCV 2018 (Lecture Notes in Computer Science), V. Ferrari,
Alberta, Edmonton, AB, Canada, 2007. [Online]. Available: http://www.
M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., Cham, Switzerland:
cs.ualberta.ca/~fandeng/paper/cmm.pdf
Springer, 2018, pp. 815–832, doi: 10.1007/978-3-030-01234-2_48.
[99] G. Pitel and G. Fouquier, ‘‘Count-min-log sketch: Approximately count-
[80] D. Dai, C. Dong, S. Xu, Q. Yan, Z. Li, C. Zhang, and N. Luo, ‘‘Ms RED:
ing with approximate counters,’’ 2015, arXiv:1502.04885.
A novel multi-scale residual encoding and decoding network for
[100] Z. Wei, Y. Tian, W. Chen, L. Gu, and X. Zhang, ‘‘DUNE: Improv-
skin lesion segmentation,’’ Med. Image Anal., vol. 75, Jan. 2022,
ing accuracy for sketch-INT network measurement systems,’’ 2022,
Art. no. 102293, doi: 10.1016/j.media.2021.102293.
arXiv:2212.04816.
[81] D. G. Cortés, E. Onieva, I. P. López, L. Trinchera, and J. Wu, [101] T. Yang, J. Jiang, P. Liu, Q. Huang, J. Gong, Y. Zhou, R. Miao,
‘‘Autoencoder-enhanced clustering: A dimensionality reduction approach X. Li, and S. Uhlig, ‘‘Elastic sketch: Adaptive and fast network-
to financial time series,’’ IEEE Access, vol. 12, pp. 16999–17009, 2024, wide measurements,’’ in Proc. Conf. ACM Special Interest Group Data
doi: 10.1109/ACCESS.2024.3359413. Commun. New York, NY, USA: ACM, Aug. 2018, pp. 561–575, doi:
[82] Z. Duan, M. Lu, J. Ma, Y. Huang, Z. Ma, and F. Zhu, ‘‘QARV: 10.1145/3230543.3230544.
Quantization-aware ResNet VAE for lossy image compression,’’ IEEE [102] K. Zhao, J. Wang, H. Qi, X. Xie, X. Zhou, and K. Li, ‘‘HBL-sketch:
Trans. Pattern Anal. Mach. Intell., vol. 46, no. 1, pp. 436–450, Jan. 2024, A new three-tier sketch for accurate network measurement,’’ in Algo-
doi: 10.1109/TPAMI.2023.3322904. rithms and Architectures for Parallel Processing (Lecture Notes in
[83] H. Zhang, Z. Hu, C. Luo, W. Zuo, and M. Wang, ‘‘Semantic image Computer Science), S. Wen, A. Zomaya, and L. T. Yang, Eds., Cham,
inpainting with progressive generative networks,’’ in Proc. 26th ACM Switzerland: Springer, 2020, pp. 48–59, doi: 10.1007/978-3-030-38991-
Int. Conf. Multimedia. New York, NY, USA: ACM, Oct. 2018, 8_4.
pp. 1939–1947, doi: 10.1145/3240508.3240625. [103] T. Yang, S. Gao, Z. Sun, Y. Wang, Y. Shen, and X. Li, ‘‘Diamond
[84] Y. Yang, K. Zheng, B. Wu, Y. Yang, and X. Wang, ‘‘Network intru- sketch: Accurate per-flow measurement for big streaming data,’’ IEEE
sion detection based on supervised adversarial variational auto-encoder Trans. Parallel Distrib. Syst., vol. 30, no. 12, pp. 2650–2662, Dec. 2019,
with regularization,’’ IEEE Access, vol. 8, pp. 42169–42184, 2020, doi: doi: 10.1109/TPDS.2019.2923772.
10.1109/ACCESS.2020.2977007. [104] J. Zhu, J. Jin, Z. Gao, and P. Reviriego, ‘‘Single event transient tol-
[85] S. Wiedemann, H. Kirchhoffer, S. Matlage, P. Haase, A. Marban, erant count min sketches,’’ Microelectron. Rel., vol. 129, Feb. 2022,
T. Marinc, D. Neumann, T. Nguyen, H. Schwarz, T. Wiegand, D. Marpe, Art. no. 114486, doi: 10.1016/j.microrel.2022.114486.
and W. Samek, ‘‘DeepCABAC: A universal compression algorithm for [105] G. Cormode and S. Muthukrishnan, ‘‘An improved data stream summary:
deep neural networks,’’ IEEE J. Sel. Topics Signal Process., vol. 14, no. 4, The count-min sketch and its applications,’’ J. Algorithms, vol. 55, no. 1,
pp. 700–714, May 2020, doi: 10.1109/JSTSP.2020.2969554. pp. 58–75, Apr. 2005, doi: 10.1016/j.jalgor.2003.12.001.
[86] X. Wang, Z. Liu, Y. Gao, X. Zheng, X. Chen, and C. Wu, ‘‘Near-optimal [106] A. Ebrahim, ‘‘High-level design optimizations for implementing data
data structure for approximate range emptiness problem in information- stream sketch frequency estimators on FPGAs,’’ Electronics, vol. 11,
centric Internet of Things,’’ IEEE Access, vol. 7, pp. 21857–21869, 2019, no. 15, p. 2399, Jul. 2022, doi: 10.3390/electronics11152399.
doi: 10.1109/ACCESS.2019.2897154. [107] A. Khan and S. Yan, ‘‘Composite hashing for data stream sketches,’’ 2018,
[87] P. H. Chia, D. Desfontaines, I. M. Perera, D. Simmons-Marengo, C. Li, arXiv:1808.06800.
W.-Y. Day, Q. Wang, and M. Guevara, ‘‘KHyperLogLog: Estimating [108] N. Seleznev, S. Kumar, and C. B. Bruss, ‘‘Double-hashing algorithm for
reidentifiability and joinability of large data at scale,’’ in Proc. IEEE frequency estimation in data streams,’’ 2022, arXiv:2204.00650.
Symp. Secur. Privacy (SP), May 2019, doi: 10.1109/SP.2019.00046. [109] P. Tyagi, M. C. Malta, and A. Dutta, ‘‘Hashing for cleaner
[88] X. Yang, A. Vernitski, and L. Carrea, ‘‘An approximate dynamic pro- reverse engineered queries for the entity comparison problem
gramming approach for improving accuracy of lossy data compression in RDF graphs,’’ in Proc. IEEE/WIC/ACM Int. Joint Conf. Web
by Bloom filters,’’ Eur. J. Oper. Res., vol. 252, no. 3, pp. 985–994, Intell. Intell. Agent Technol. (WI-IAT), Dec. 2020, pp. 177–186, doi:
Aug. 2016, doi: 10.1016/j.ejor.2016.01.042. 10.1109/WIIAT50758.2020.00028.

146076 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

[110] X. Zhu, G. Wu, H. Zhang, S. Wang, and B. Ma, ‘‘Dynamic count-min [131] M. Karakoy, O. Kislal, X. Tang, M. T. Kandemir, and M. Arunachalam,
sketch for analytical queries over continuous data streams,’’ in Proc. IEEE ‘‘Architecture-aware approximate computing,’’ Proc. ACM
25th Int. Conf. High Perform. Comput. (HiPC), Dec. 2018, pp. 225–234, Meas. Anal. Comput. Syst., vol. 3, no. 2, pp. 1–24, Jun. 2019, doi:
doi: 10.1109/HIPC.2018.00033. 10.1145/3341617.3326153.
[111] P. Jia, P. Wang, J. Zhao, J. Tao, Y. Yuan, and X. Guan, [132] S. Leroux, P. Molchanov, P. Simoens, B. Dhoedt, T. Breuel, and J. Kautz,
‘‘Erasable virtual HyperLogLog for approximating cumulative ‘‘IamNN: Iterative and adaptive mobile neural network for efficient image
distribution over data streams,’’ IEEE Trans. Knowl. Data Eng., classification,’’ 2018, arXiv:1804.10123.
vol. 34, no. 11, pp. 5336–5350, Nov. 2022. [Online]. Available: [133] J. Yang, Y. Bhalgat, S. Chang, F. Porikli, and N. Kwak, ‘‘Dynamic
https://ieeexplore.ieee.org/document/9328544 iterative refinement for efficient 3D hand pose estimation,’’ 2021,
[112] K. G. Paterson and M. Raynal, ‘‘HyperLogLog: Exponentially bad in arXiv:2111.06500.
adversarial settings,’’ in Proc. IEEE 7th Eur. Symp. Secur. Privacy, [134] Y. Yoo, D. Han, and S. Yun, ‘‘EXTD: Extremely tiny face detector via
Jun. 2022, pp. 154–170, doi: 10.1109/EuroSP53844.2022.00018. iterative filter reuse,’’ 2019, arXiv:1906.06579.
[113] S. Heule, M. Nunkesser, and A. Hall, ‘‘HyperLogLog in practice: Algo- [135] W. Jin, J. Wohlwend, R. Barzilay, and T. Jaakkola, ‘‘Iterative refinement
rithmic engineering of a state of the art cardinality estimation algorithm,’’ graph neural network for antibody sequence-structure co-design,’’ 2021,
in Proc. 16th Int. Conf. Extending Database Technol. New York, NY, arXiv:2110.04624.
USA: ACM, Mar. 2013, pp. 683–692, doi: 10.1145/2452376.2452456. [136] Y. Tian, Y. Zhang, and H. Zhang, ‘‘Recent advances in stochastic gradient
descent in deep learning,’’ Mathematics, vol. 11, no. 3, p. 682, Jan. 2023,
[114] M. Karppa and R. Pagh, ‘‘HyperLogLogLog: Cardinality estimation with
doi: 10.3390/math11030682.
one log more,’’ 2022, arXiv:2205.11327.
[137] C. Shieh, S. Ofner, and C. B. Draucker, ‘‘Reasons for and associated char-
[115] Q. Xiao, S. Chen, Y. Zhou, and J. Luo, ‘‘Estimating cardinality acteristics with early study termination: Analysis of ClinicalTrials.Gov
for arbitrarily large data stream with improved memory efficiency,’’ data on pregnancy topics,’’ Nursing Outlook, vol. 70, no. 2, pp. 271–279,
IEEE/ACM Trans. Netw., vol. 28, no. 2, pp. 433–446, Apr. 2020, doi: Mar. 2022, doi: 10.1016/j.outlook.2021.12.006.
10.1109/TNET.2020.2970860.
[138] M. Mahsereci, L. Balles, C. Lassner, and P. Hennig, ‘‘Early stopping
[116] J. Xu, ‘‘Cardinalities estimation under sliding time window by sharing without a validation set,’’ 2017, arXiv:1703.09580.
HyperLogLog counter,’’ 2018, arXiv:1810.13132. [139] M. V. Ferro, Y. D. Mosquera, F. J. R. Pena, and V. M. D. Bilbao,
[117] O. Ertl, ‘‘New cardinality estimation algorithms for HyperLogLog ‘‘Early stopping by correlating online indicators in neural
sketches,’’ 2017, arXiv:1702.01284. networks,’’ Neural Netw., vol. 159, pp. 109–124, Feb. 2023, doi:
[118] W. Li, Y. Zhang, Y. Sun, W. Wang, M. Li, W. Zhang, and X. Lin, ‘‘Approx- 10.1016/j.neunet.2022.11.035.
imate nearest neighbor search on high dimensional data—Experiments, [140] Y. Bai, E. Yang, B. Han, Y. Yang, J. Li, Y. Mao, G. Niu, and T. Liu,
analyses, and improvement,’’ IEEE Trans. Knowl. Data Eng., vol. 32, ‘‘Understanding and improving early stopping for learning with noisy
no. 8, pp. 1475–1488, Aug. 2020, doi: 10.1109/TKDE.2019.2909204. labels,’’ 2021, arXiv:2106.15853.
[119] O. Ertl, ‘‘SetSketch: Filling the gap between MinHash and Hyper- [141] Y.-W. Chen, C. Wang, A. Saied, and R. Zhuang, ‘‘ACE: Adaptive
LogLog,’’ Proc. VLDB Endowment, vol. 14, no. 11, pp. 2244–2257, constraint-aware early stopping in hyperparameter optimization,’’ 2022,
Jul. 2021, doi: 10.14778/3476249.3476276. arXiv:2208.02922.
[120] Y. William Yu and G. M. Weber, ‘‘HyperMinHash: MinHash in LogLog [142] Y. Matsubara, M. Levorato, and F. Restuccia, ‘‘Split computing and
space,’’ 2017, arXiv:1710.08436. early exiting for deep learning applications: Survey and research chal-
[121] T. Dunning, ‘‘The t-digest: Efficient estimates of distributions,’’ lenges,’’ ACM Comput. Surv., vol. 55, no. 5, pp. 1–30, Dec. 2022, doi:
Softw. Impacts, vol. 7, Feb. 2021, Art. no. 100049, doi: 10.1145/3527155.
10.1016/j.simpa.2020.100049. [143] S. Paguada, L. Batina, I. Buhan, and I. Armendariz, ‘‘Being patient and
[122] B. W. Ford. (2022). An Instruction Profiling Based Framework to Pro- persistent: Optimizing an early stopping strategy for deep learning in
mote Software Portability. Accessed: Mar. 6, 2023. [Online]. Available: profiled attacks,’’ 2021, arXiv:2111.14416.
https://digital.library.txstate.edu/handle/10877/15757 [144] E. Cetinic, T. Lipic, and S. Grgic, ‘‘Fine-tuning convolutional neu-
[123] A. Mercat, J. Bonnot, M. Pelcat, W. Hamidouche, and D. Menard, ral networks for fine art classification,’’ Expert Syst. Appl., vol. 114,
‘‘Exploiting computation skip to reduce energy consumption by approx- pp. 107–118, Dec. 2018, doi: 10.1016/j.eswa.2018.07.026.
imate computing, an HEVC encoder case study,’’ in Proc. Design, [145] E. Lattanzi, C. Contoli, and V. Freschi, ‘‘Do we need early exit networks in
Autom. Test Eur. Conf. Exhib. (DATE). Lausanne, Switzerland: IEEE, human activity recognition?’’ Eng. Appl. Artif. Intell., vol. 121, May 2023,
Mar. 2017, pp. 494–499, doi: 10.23919/DATE.2017.7927039. Art. no. 106035, doi: 10.1016/j.engappai.2023.106035.
[146] C. Yang and X. Ma, ‘‘Improving stability of fine-tuning pretrained
[124] Y. Lin, C. Sakr, Y. Kim, and N. Shanbhag, ‘‘PredictiveNet:
language models via component-wise gradient norm clipping,’’ 2022,
An energy-efficient convolutional neural network via zero prediction,’’
arXiv:2210.10325.
in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2017, pp. 1–4, doi:
[147] F. Liu, X. Huang, Y. Chen, and J. A. K. Suykens, ‘‘Random features
10.1109/ISCAS.2017.8050797.
for kernel approximation: A survey on algorithms, theory, and beyond,’’
[125] E. Eskandarnia, H. M. Al-Ammal, and R. Ksantini, ‘‘An embedded deep-
2020, arXiv:2004.11154.
clustering-based load profiling framework,’’ Sustain. Cities Soc., vol. 78,
[148] A. De Marchi, A. Dreves, M. Gerdts, S. Gottschalk, and S. Rogovs,
Mar. 2022, Art. no. 103618, doi: 10.1016/j.scs.2021.103618.
‘‘A function approximation approach for parametric optimization,’’
[126] S. Li, S. Park, and S. Mahlke, ‘‘Sculptor: Flexible approximation with J. Optim. Theory Appl., vol. 196, no. 1, pp. 56–77, Jan. 2023, doi:
selective dynamic loop perforation,’’ in Proc. Int. Conf. Supercomputing, 10.1007/s10957-022-02138-4.
vol. 156. New York, NY, USA: ACM, Jun. 2018, pp. 341–351, doi: [149] D. Dũng and V. K. Nguyen, ‘‘Deep ReLU neural networks in high-
10.1145/3205289.3205317. dimensional approximation,’’ Neural Netw., vol. 142, pp. 619–635,
[127] D. Maier and B. Juurlink, ‘‘Model-based loop perforation,’’ in Oct. 2021, doi: 10.1016/j.neunet.2021.07.027.
Proc. Eur. Conf. Parallel Process., Lisbon, Portugal. Berlin, [150] Z. Zainuddin and O. Pauline, ‘‘Function approximation using artificial
Germany: Springer, Aug. 2021, pp. 549–554, doi: 10.1007/978-3-031- neural networks,’’ WSEAS Trans. Math., vol. 7, no. 6, pp. 333–338,
06156-1_48. Jun. 2008.
[128] H. Omar, M. Ahmad, and O. Khan, ‘‘GraphTuner: An input dependence [151] T. De Ryck, S. Lanthaler, and S. Mishra, ‘‘On the approximation of
aware loop perforation scheme for efficient execution of approximated functions by tanh neural networks,’’ Neural Netw., vol. 143, pp. 732–750,
graph algorithms,’’ in Proc. IEEE Int. Conf. Comput. Design (ICCD), Nov. 2021, doi: 10.1016/j.neunet.2021.08.015.
Nov. 2017, pp. 201–208, doi: 10.1109/ICCD.2017.38. [152] S. S. Sawant, M. Wiedmann, S. Göb, N. Holzer, E. W. Lang, and T. Götz,
[129] O. Kislal and M. T. Kandemir, ‘‘Data access skipping for recursive ‘‘Compression of deep convolutional neural network using additional
partitioning methods,’’ Comput. Lang., Syst. Struct., vol. 53, pp. 143–162, importance-weight-based filter pruning approach,’’ Appl. Sci., vol. 12,
Sep. 2018, doi: 10.1016/j.cl.2018.03.003. no. 21, p. 11184, Nov. 2022, doi: 10.3390/app122111184.
[130] V. Y. Raparti and S. Pasricha, ‘‘Approximate NoC and memory [153] C.-T. Huang, J.-C. Chen, and J.-L. Wu, ‘‘Learning sparse neural net-
controller architectures for GPGPU accelerators,’’ IEEE Trans. Par- works through mixture-distributed regularization,’’ in Proc. IEEE/CVF
allel Distrib. Syst., vol. 31, no. 5, pp. 25–39, May 2020, doi: Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2020,
10.1109/TPDS.2019.2958344. pp. 2968–2977, doi: 10.1109/CVPRW50498.2020.00355.

VOLUME 12, 2024 146077


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

[154] C. Louizos, M. Welling, and D. P. Kingma, ‘‘Learning sparse neural [174] A. Tragoudaras, P. Stoikos, K. Fanaras, A. Tziouvaras, G. Floros,
networks through L0 regularization,’’ 2017, arXiv:1712.01312. G. Dimitriou, K. Kolomvatsos, and G. Stamoulis, ‘‘Design space explo-
[155] J. Luo, Y. Gan, C.-M. Vong, C.-M. Wong, and C. Chen, ‘‘Scalable ration of a sparse MobileNetV2 using high-level synthesis and sparse
and memory-efficient sparse learning for classification with approximate matrix techniques on FPGAs,’’ Sensors, vol. 22, no. 12, p. 4318,
Bayesian regularization priors,’’ Neurocomputing, vol. 457, pp. 106–116, Jun. 2022, doi: 10.3390/s22124318.
Oct. 2021, doi: 10.1016/j.neucom.2021.06.025. [175] P. Pinto and J. M. P. Cardoso, ‘‘A methodology and framework for
[156] Z. Izzo, M. A. Smart, K. Chaudhuri, and J. Zou, ‘‘Approxi- software memoization of functions,’’ in Proc. 18th ACM Int. Conf. Com-
mate data deletion from machine learning models,’’ in Proc. 24th put. Frontiers, vol. 13. New York, NY, USA: ACM, May 2021,
Int. Conf. Artif. Intell. Statist., 2011, pp. 2008–2016. Accessed: pp. 93–101, doi: 10.1145/3457388.3458668.
Jun. 30, 2023. [176] I. Brumar, M. Casas, M. Moreto, M. Valero, and G. S. Sohi, ‘‘ATM:
[157] S. Park, J. Lee, S. Mo, and J. Shin, ‘‘Lookahead: A far-sighted alternative Approximate task memoization in the runtime system,’’ in Proc. IEEE
of magnitude-based pruning,’’ 2020, arXiv:2002.04809. Int. Parallel Distrib. Process. Symp. (IPDPS), May 2017, pp. 1140–1150,
[158] N. Lee, T. Ajanthan, and P. H. S. Torr, ‘‘SNIP: Single- doi: 10.1109/IPDPS.2017.49.
shot network pruning based on connection sensitivity,’’ 2018, [177] A. Suresh, E. Rohou, and A. Seznec, ‘‘Compile-time function memoiza-
arXiv:1810.02340. tion,’’ in Proc. 26th Int. Conf. Compiler Construct. New York, NY, USA:
[159] X. Xiao, Z. Wang, and S. Rajasekaran, ‘‘AutoPrune: Automatic network ACM, Feb. 2017, pp. 45–54, doi: 10.1145/3033019.3033024.
pruning by regularizing auxiliary parameters,’’ in Proc. Adv. Neural [178] G. Zhang and D. Sanchez, ‘‘Leveraging hardware caches for memoiza-
Inf. Process. Syst., Red Hook, NY, USA: Curran Associates, Inc., tion,’’ IEEE Comput. Archit. Lett., vol. 17, no. 1, pp. 59–63, Jan. 2018,
2019, pp. 1–17. Accessed: Mar. 19, 2023. [Online]. Available: doi: 10.1109/LCA.2017.2762308.
https://proceedings.neurips.cc/paper/2019/hash/4efc9e02abdab6b61 [179] G. Tziantzioulis, N. Hardavellas, and S. Campanoni, ‘‘Temporal approx-
66251918570a307-Abstract.html imate function memoization,’’ IEEE Micro, vol. 38, no. 4, pp. 60–70,
[160] Z. Huang and N. Wang, ‘‘Data-driven sparse structure selection for deep Jul. 2018, doi: 10.1109/MM.2018.043191126.
neural networks,’’ 2017, arXiv:1707.01213. [180] P. Arundhati, S. K. Jena, and S. K. Pani, ‘‘Approximate function memo-
[161] S. Yu, Z. Yao, A. Gholami, Z. Dong, S. Kim, M. W. Mahoney, and ization,’’ Concurrency Comput., Pract. Exper., vol. 34, no. 23, p. e7204,
K. Keutzer, ‘‘Hessian-aware pruning and optimal neural implant,’’ 2021, Oct. 2022, doi: 10.1002/cpe.7204.
arXiv:2101.08940. [181] Z. Liu, A. Yazdanbakhsh, D. K. Wang, H. Esmaeilzadeh, and N. S. Kim,
[162] Y. Fu, C. Liu, D. Li, Z. Zhong, X. Sun, J. Zeng, and Y. Yao, ‘‘Explor- ‘‘AxMemo: Hardware-compiler co-design for approximate code
ing structural sparsity of deep networks via inverse scale spaces,’’ memoization,’’ in Proc. ACM/IEEE 46th Annu. Int. Symp. Com-
IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 1749–1765, put. Archit. (ISCA). New York, NY, USA: ACM, Jun. 2019,
Feb. 2023, doi: 10.1109/TPAMI.2022.3168881. pp. 685–697.
[163] N. J. Kim and H. Kim, ‘‘AGT: Channel pruning using adaptive [182] S. Bubeck, R. Eldan, Y. T. Lee, and D. Mikulincer, ‘‘Network size
gradient training for accelerating convolutional neural networks,’’ in and weights size for memorization with two-layers neural networks,’’ in
Proc. Int. Conf. Electron., Inf., Commun. (ICEIC), Feb. 2023, pp. 1–3, Proc. 34th Int. Conf. Neural Inf. Process. Syst. Red Hook, NY, USA:
doi: 10.1109/ICEIC57457.2023.10049943. Curran Associates Inc., 2020, pp. 4977–4986.
[164] S. Gao, P. Dong, Z. Pan, and X. You, ‘‘Lightweight deep learning based [183] A. M. Kassem, ‘‘Mitigating approximate memorization in language mod-
channel estimation for extremely large-scale massive MIMO systems,’’ els via dissimilarity learned policy,’’ 2023, arXiv:2305.01550.
IEEE Trans. Veh. Technol., vol. 73, no. 7, pp. 10750–10754, Jul. 2024, [184] G. Kyriakides and K. Margaritis, ‘‘An introduction to neural architecture
doi: 10.1109/TVT.2024.3364510. search for convolutional networks,’’ 2020, arXiv:2005.11074.
[165] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, ‘‘Filter pruning via geo- [185] J. Fabrício Filho, I. Felzmann, and L. Wanner, ‘‘SmartApprox: Learning-
metric median for deep convolutional neural networks acceleration,’’ based configuration of approximate memories for energy-efficient
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), execution,’’ Sustain. Comput., Informat. Syst., vol. 34, Apr. 2022,
Jun. 2019, doi: 10.1109/CVPR.2019.00447. Art. no. 100701, doi: 10.1016/j.suscom.2022.100701.
[166] C. Zhao, B. Ni, J. Zhang, Q. Zhao, W. Zhang, and Q. Tian, ‘‘Variational [186] T. Huang, W. Dong, F. Wu, X. Li, and G. Shi, ‘‘Uncertainty-driven
convolutional neural network pruning,’’ in Proc. IEEE/CVF Conf. Com- knowledge distillation for language model compression,’’ IEEE/ACM
put. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 2775–2784, doi: Trans. Audio, Speech, Language Process., vol. 31, pp. 2850–2858,
10.1109/CVPR.2019.00289. Jun. 2023, doi: 10.1109/TASLP.2023.3289303.
[167] M. Sabih, F. Hannig, and J. Teich, ‘‘DyFiP: Explainable AI-based [187] X. Liu, Z. Shi, Z. Wu, J. Chen, and G. Zhai, ‘‘GridDehazeNet+:
dynamic filter pruning of convolutional neural networks,’’ in Proc. 2nd An enhanced multi-scale network with intra-task knowledge trans-
Eur. Workshop Mach. Learn. Syst. New York, NY, USA: ACM, Apr. 2022, fer for single image dehazing,’’ IEEE Trans. Intell. Transp. Syst.,
pp. 109–115, doi: 10.1145/3517207.3526982. vol. 24, no. 1, pp. 870–884, Jan. 2023, doi: 10.1109/TITS.2022.
[168] J.-H. Luo, J. Wu, and W. Lin, ‘‘ThiNet: A filter level pruning method for 3210455.
deep neural network compression,’’ 2017, arXiv:1707.06342. [188] W. Tang, M. S. Shakeel, Z. Chen, H. Wan, and W. Kang, ‘‘Target-category
[169] K.-L. Du, M. N. S. Swamy, Z.-Q. Wang, and W. H. Mow, ‘‘Matrix agnostic knowledge distillation with frequency domain supervision,’’
factorization techniques in machine learning, signal processing, and IEEE Trans. Ind. Informat., vol. 19, no. 7, pp. 8462–8471, Jul. 2023, doi:
statistics,’’ Mathematics, vol. 11, no. 12, p. 2674, Jun. 2023, doi: 10.1109/TII.2022.3218635.
10.3390/math11122674. [189] C.-S. Lin and Y. F. Wang, ‘‘Describe, spot and explain: Inter-
[170] M. Sabih, A. Mishra, F. Hannig, and J. Teich, ‘‘MOSP: Multi-objective pretable representation learning for discriminative visual reasoning,’’
sensitivity pruning of deep neural networks,’’ in Proc. IEEE 13th IEEE Trans. Image Process., vol. 32, pp. 2481–2492, 2023, doi:
Int. Green Sustain. Comput. Conf. (IGSC), Oct. 2022, pp. 1–8, doi: 10.1109/TIP.2023.3268001.
10.1109/IGSC55832.2022.9969374. [190] Y. Zhao and N.-M. Cheung, ‘‘FS-BAN: Born-again networks for domain
[171] T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, ‘‘Spar- generalization few-shot classification,’’ IEEE Trans. Image Process.,
sity in deep learning: Pruning and growth for efficient inference and vol. 32, pp. 2252–2266, 2023, doi: 10.1109/TIP.2023.3266172.
training in neural networks,’’ J. Mach. Learn. Res., vol. 22, no. 1, [191] J. Li, X. Chen, P. Zheng, Q. Wang, and Z. Yu, ‘‘Deep generative
pp. 241:10882–241:11005, Jan. 2021. knowledge distillation by likelihood finetuning,’’ IEEE Access, vol. 11,
[172] J. Choquette and W. Gandhi, ‘‘NVIDIA A100 GPU: Performance pp. 46441–46453, 2023, doi: 10.1109/ACCESS.2023.3273952.
& innovation for GPU computing,’’ in Proc. IEEE Hot Chips 32 [192] J. Chen, X. Qu, J. Li, J. Wang, J. Wan, and J. Xiao, ‘‘Detecting out-
Symp. (HCS), Aug. 2020, pp. 1–43, doi: 10.1109/HCS49909.2020. of-distribution examples via class-conditional impressions reappearing,’’
9220622. in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP),
[173] L. Lu, J. Xie, R. Huang, J. Zhang, W. Lin, and Y. Liang, ‘‘An effi- Jun. 2023, pp. 1–5, doi: 10.1109/ICASSP49357.2023.10095909.
cient hardware accelerator for sparse convolutional neural networks on [193] Y. Wang, Z. Ge, Z. Chen, X. Liu, C. Ma, Y. Sun, and L. Qi, ‘‘Explicit
FPGAs,’’ in Proc. IEEE 27th Annu. Int. Symp. Field-Program. Cus- and implicit knowledge distillation via unlabeled data,’’ in Proc. IEEE
tom Comput. Mach. (FCCM). San Diego, CA, USA: IEEE, Apr. 2019, Int. Conf. Acoust., Speech Signal Process. (ICASSP), Jun. 2023, pp. 1–5,
pp. 17–25, doi: 10.1109/FCCM.2019.00013. doi: 10.1109/ICASSP49357.2023.10095175.

146078 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

[194] L. Yu, T. Hua, W. Yang, P. Ye, and Q. Liao, ‘‘CDHD: Con- [212] M. I. Gorinova, A. D. Gordon, and C. Sutton, ‘‘Probabilistic pro-
trastive dreamer for hint distillation,’’ in Proc. ICASSP - IEEE gramming with densities in SlicStan: Efficient, flexible, and determin-
Int. Conf. Acoust., Speech Signal Process. (ICASSP), Jun. 2023, pp. 1–5, istic,’’ Proc. ACM Program. Lang., vol. 3, pp. 1–30, Jan. 2019, doi:
doi: 10.1109/ICASSP49357.2023.10096829. 10.1145/3290348.
[195] H. Yin, P. Molchanov, J. M. Alvarez, Z. Li, A. Mallya, D. Hoiem, [213] M. Cusumano-Towner and V. K. Mansinghka, ‘‘A design proposal for
N. K. Jha, and J. Kautz, ‘‘Dreaming to distill: Data-free knowl- gen: Probabilistic programming with fast custom inference via code gen-
edge transfer via DeepInversion,’’ in Proc. IEEE/CVF Conf. Com- eration,’’ in Proc. 2nd ACM SIGPLAN Int. Workshop Mach. Learn. Pro-
put. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 8712–8721, doi: gram. Lang., vol. 16. New York, NY, USA: ACM, Jun. 2018, pp. 52–57,
10.1109/CVPR42600.2020.00874. doi: 10.1145/3211346.3211350.
[196] Z. Altuntaş, S. Arslan, and B. Boz, ‘‘Approximate execution and group- [214] O. Kiselyov, ‘‘Probabilistic programming language and its incremental
ing of critical sections for performance-accuracy tradeoff,’’ Concurrency evaluation,’’ in Programming Languages and Systems (Lecture Notes in
Comput., Pract. Exper., vol. 35, no. 24, p. e7614, Nov. 2023, doi: Computer Science), A. Igarashi, Ed., Cham, Switzerland: Springer, 2016,
10.1002/cpe.7614. pp. 357–376, doi: 10.1007/978-3-319-47958-3_19.
[197] S. K. Khatamifard, I. Akturk, and U. R. Karpuzcu, ‘‘On approxi- [215] J. Ai, N. S. Arora, N. Dong, B. Gokkaya, T. Jiang, A. Kubendran,
mate speculative lock elision,’’ IEEE Trans. Multi-Scale Comput. Syst., A. Kumar, M. Tingley, and N. Torabi, ‘‘HackPPL: A universal probabilis-
vol. 4, no. 2, pp. 141–151, Apr. 2018, doi: 10.1109/TMSCS.2017. tic programming language,’’ in Proc. 3rd ACM SIGPLAN Int. Workshop
2773488. Mach. Learn. Program. Lang., vol. 43. New York, NY, USA: ACM,
[198] L. Carpentieri and B. Cosenza, ‘‘Towards a SYCL API for approximate Jun. 2019, pp. 20–28, doi: 10.1145/3315508.3329974.
computing,’’ in Proc. Int. Workshop OpenCL, vol. 48. New York, NY, [216] D. Tolpin, J.-W. van de Meent, H. Yang, and F. Wood, ‘‘Design
USA: ACM, Apr. 2023, pp. 1–2, doi: 10.1145/3585341.3585374. and implementation of probabilistic programming language angli-
[199] K. Lee, R. Bhattacharya, J. Dass, V. N. S. P. Sakuru, and R. N. Mahapa- can,’’ in Proc. 28th Symp. Implement. Appl. Funct. Program. Lang.,
tra, ‘‘A relaxed synchronization approach for solving parallel quadratic vol. 48. New York, NY, USA: ACM, Aug. 2016, pp. 1–12, doi:
programming problems with guaranteed convergence,’’ in Proc. IEEE 10.1145/3064899.3064910.
Int. Parallel Distrib. Process. Symp. (IPDPS), May 2016, pp. 182–191, [217] D. Tolpin, ‘‘Deployable probabilistic programming,’’ in Proc. ACM
doi: 10.1109/IPDPS.2016.66. SIGPLAN Int. Symp. New Ideas, New Paradigms, Reflections Pro-
[200] G. Stitt and D. Campbell, ‘‘PANDORA: An architecture-independent par- gram. Softw., vol. 18. New York, NY, USA: ACM, Oct. 2019, pp. 1–16,
allelizing approximation-discovery framework,’’ ACM Trans. Embedded doi: 10.1145/3359591.3359727.
Comput. Syst., vol. 19, no. 5, pp. 1–17, Nov. 2020, doi: 10.1145/3391899. [218] K. Joshi, V. Fernando, and S. Misailovic, ‘‘Aloe: Verifying reliability
[201] A. L. C. Bueno, N. D. L. R. Rodriguez, and E. D. Sotelino, ‘‘Adaptive of approximate programs in the presence of recovery mechanisms,’’ in
relaxed synchronization through the use of supervised learning methods,’’ Proc. 18th ACM/IEEE Int. Symp. Code Gener. Optim. New York, NY,
Future Gener. Comput. Syst., vol. 106, pp. 260–269, May 2020, doi: USA: ACM, Feb. 2020, pp. 56–67, doi: 10.1145/3368826.3377924.
10.1016/j.future.2019.12.051. [219] E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Prad-
[202] A. Sampson, A. Baixo, B. Ransford, T. Moreau, J. Yip, L. Ceze, han, T. Karaletsos, R. Singh, P. Szerlip, P. Horsfall, and N. D.
and M. Oskin, ‘‘ACCEPT: A programmer-guided compiler framework Goodman, ‘‘Pyro: Deep universal probabilistic programming,’’ 2018,
for practical approximate computing,’’ Dept. Comput. Sci. Eng., Univ. arXiv:1810.09538.
Washington, Washington, DC, USA, Tech. Rep. UW-CSE-15-01, 2015. [220] A. Ścibior, O. Kammar, M. Vákár, S. Staton, H. Yang, Y. Cai,
[203] J. Ansel, Y. L. Wong, C. Chan, M. Olszewski, A. Edelman, K. Ostermann, S. K. Moss, C. Heunen, and Z. Ghahramani, ‘‘Deno-
and S. Amarasinghe, ‘‘Language and compiler support for auto- tational validation of higher-order Bayesian inference,’’ Proc. ACM
tuning variable-accuracy algorithms,’’ in Proc. Int. Symp. Code Program. Lang., vol. 2, pp. 1–29, Dec. 2017, doi: 10.1145/3158148.
Gener. Optim. (CGO), Apr. 2011, pp. 85–96, doi: 10.1109/CGO.2011. [221] A. K. Lew, M. F. Cusumano-Towner, B. Sherman, M. Carbin,
5764677. and V. K. Mansinghka, ‘‘Trace types and denotational seman-
[204] M. Carbin, S. Misailovic, and M. C. Rinard, ‘‘Verifying quantitative relia- tics for sound programmable inference in probabilistic languages,’’
bility for programs that execute on unreliable hardware,’’ Commun. ACM, Proc. ACM Program. Lang., vol. 4, pp. 1–32, Dec. 2019, doi: 10.1145/
vol. 59, no. 8, pp. 83–91, Jul. 2016, doi: 10.1145/2958738. 3371087.
[205] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and [222] S. Dylus, J. Christiansen, and F. Teegen, ‘‘Probabilistic functional logic
D. Grossman, ‘‘EnerJ: Approximate data types for safe and general programming,’’ in Practical Aspects of Declarative Languages (Lecture
low-power computation,’’ in Proc. 32nd ACM SIGPLAN Conf. Pro- Notes in Computer Science), K. Hamlen, and N. Leone, Eds., Cham,
gram. Lang. Design Implement. New York, NY, USA: ACM, Jun. 2011, Switzerland: Springer, 2018, pp. 3–19, doi: 10.1007/978-3-319-73305-
pp. 164–174, doi: 10.1145/1993498.1993518. 0_1.
[206] J. Park, H. Esmaeilzadeh, X. Zhang, M. Naik, and W. Harris, ‘‘FlexJava: [223] FACTORIE: Home. Accessed: Jul. 2, 2023. [Online]. Available:
Language support for safe and modular approximate programming,’’ http://factorie.cs.umass.edu/
in Proc. 10th Joint Meeting Found. Softw. Eng., vol. 2008. New [224] M. Biel and M. Johansson, ‘‘Efficient stochastic programming in Julia,’’
York, NY, USA: ACM, Aug. 2015, pp. 745–757, doi: 10.1145/2786805. INFORMS J. Comput., vol. 34, no. 4, pp. 1885–1902, Jul. 2022, doi:
2786807. 10.1287/ijoc.2022.1158.
[207] M. Nguyen, R. Perera, M. Wang, and N. Wu, ‘‘Modular probabilis- [225] A. Ścibior, O. Kammar, and Z. Ghahramani, ‘‘Functional programming
tic models via algebraic effects,’’ Proc. ACM Program. Lang., vol. 6, for modular Bayesian inference,’’ Proc. ACM Program. Lang., vol. 2,
pp. 381–410, Aug. 2022, doi: 10.1145/3547635. pp. 1–29, Jul. 2018, doi: 10.1145/3236778.
[208] A. McCallum, K. Schultz, and S. Singh, ‘‘FACTORIE: Probabilistic [226] A. Ścibior, Z. Ghahramani, and A. D. Gordon, ‘‘Practical
programming via imperatively defined factor graphs,’’ in Proc. Advances probabilistic programming with monads,’’ in Proc. ACM SIGPLAN
in Neural Information Processing Systems. Red Hook, NY, USA: Curran Symp. Haskell. New York, NY, USA: ACM, Aug. 2015, pp. 165–176,
Associates, 2009, pp. 1–12. Accessed: Jul. 2, 2023. [Online]. Available: doi: 10.1145/2804302.2804317.
https://proceedings.neurips.cc/paper_files/paper/2009/hash/847cc55b70 [227] C. Grazian and Y. Fan, ‘‘A review of approximate Bayesian computa-
32108eee6dd897f3bca8a5-Abstract.html tion methods via density estimation: Inference for simulator-models,’’
[209] V. Mansinghka, D. Selsam, and Y. Perov, ‘‘Venture: A higher-order prob- WIREs Comput. Statist., vol. 12, no. 4, p. e1486, Jul. 2020, doi:
abilistic programming platform with programmable inference,’’ 2014, 10.1002/wics.1486.
arXiv:1404.0099. [228] M. Barbareschi, S. Barone, N. Mazzocca, and A. Moriconi, ‘‘Design
[210] A. Todeschini, F. Caron, M. Fuentes, P. Legrand, and P. Del Moral, space exploration tools,’’ in Approximate Computing Techniques: From
‘‘Biips: Software for Bayesian inference with interacting particle sys- Component- to Application-Level. Cham, Switzerland: Springer, 2022,
tems,’’ 2014, arXiv:1412.3779. pp. 215–259, doi: 10.1007/978-3-030-94705-7_8.
[211] B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, [229] S. Misailovic, ‘‘Accuracy-aware compilers,’’ in Approximate Com-
M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell, ‘‘Stan: A puting Techniques: From Component- to Application-Level.Cham,
probabilistic programming language,’’ J. Stat. Softw., vol. 76, no. 1, Switzerland: Springer, 2022, pp. 177–214, doi: 10.1007/978-3-030-
p. 1, 2017, doi: 10.18637/jss.v076.i01. 94705-7_7.

VOLUME 12, 2024 146079


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

[230] D. Gadioli, E. Vitali, G. Palermo, and C. Silvano, ‘‘MARGOt: [247] B. Nongpoh, R. Ray, S. Dutta, and A. Banerjee, ‘‘AutoSense:
A dynamic autotuning framework for self-aware approximate comput- A framework for automated sensitivity analysis of program data,’’ IEEE
ing,’’ IEEE Trans. Comput., vol. 68, no. 5, pp. 713–728, May 2019, doi: Trans. Softw. Eng., vol. 43, no. 12, pp. 1110–1124, Dec. 2017, doi:
10.1109/TC.2018.2883597. 10.1109/TSE.2017.2654251.
[231] W.-C. Lee, Y. Liu, P. Liu, S. Ma, H. Choi, X. Zhang, and [248] P. Roy, R. Ray, C. Wang, and W. F. Wong, ‘‘ASAC: Automatic
R. Gupta, ‘‘White-box program tuning,’’ in Proc. IEEE/ACM Int. sensitivity analysis for approximate computing,’’ ACM SIGPLAN
Symp. Code Gener. Optim. (CGO), Feb. 2019, pp. 122–135, doi: Notices, vol. 49, no. 5, pp. 95–104, Jun. 2014, doi: 10.1145/2666357.
10.1109/CGO.2019.8661177. 2597812.
[232] T. T. Jost, Y. Durand, C. Fabre, A. Cohen, and F. Pérrot, ‘‘Seamless [249] K. Joshi, V. Fernando, and S. Misailovic, ‘‘Statistical algorithmic pro-
compiler integration of variable precision floating-point arithmetic,’’ in filing for randomized approximate programs,’’ in Proc. IEEE/ACM
Proc. IEEE/ACM Int. Symp. Code Gener. Optim. (CGO), Feb. 2021, 41st Int. Conf. Softw. Eng. (ICSE). Piscataway, NJ, USA: IEEE Press,
pp. 65–76, doi: 10.1109/CGO51591.2021.9370331. May 2019, pp. 608–618, doi: 10.1109/ICSE.2019.00071.
[233] H. Sharif, Y. Zhao, M. Kotsifakou, A. Kothari, B. Schreiber, E. Wang, [250] C. Gonzalez, H. Liu, M. Noh, E. Karl, T. Toifl, and S. Hsu, ‘‘f5: Enabling
Y. Sarita, N. Zhao, K. Joshi, V. S. Adve, S. Misailovic, and S. Adve, new system architectures with 2.5D, 3D, and chiplets,’’ in IEEE Int. Solid-
‘‘ApproxTuner: A compiler and runtime system for adaptive approxi- State Circuits Conf. (ISSCC) Dig. Tech. Papers, vol. 64, Feb. 2021,
mations,’’ in Proc. 26th ACM SIGPLAN Symp. Princ. Pract. Parallel pp. 529–532, doi: 10.1109/ISSCC42613.2021.9365834.
Program., vol. 216. New York, NY, USA: ACM, Feb. 2021, pp. 262–277, [251] A. Zeitak and A. Morrison, ‘‘Cuckoo trie: Exploiting memory-level
doi: 10.1145/3437801.3446108. parallelism for efficient DRAM indexing,’’ in Proc. ACM SIGOPS 28th
[234] L. Liu, S. Isaacman, and U. Kremer, ‘‘An adaptive application framework Symp. Operating Syst. Princ. New York, NY, USA: ACM, Oct. 2021,
with customizable quality metrics,’’ ACM Trans. Design Autom. Elec- pp. 147–162, doi: 10.1145/3477132.3483551.
tron. Syst., vol. 27, no. 2, pp. 1–33, Nov. 2021, doi: 10.1145/3477428. [252] R. Kumar, M. Alipour, and D. Black-Schaffer, ‘‘Freeway: Maximizing
[235] R. Venkatagiri, K. Ahmed, A. Mahmoud, S. Misailovic, D. Marinov, MLP for slice-out-of-order execution,’’ in Proc. IEEE Int. Symp. High
C. W. Fletcher, and S. V. Adve, ‘‘Gem5-approxilyzer: An open- Perform. Comput. Archit. (HPCA), Feb. 2019, pp. 558–569, doi:
source tool for application-level soft error analysis,’’ in Proc. 49th 10.1109/HPCA.2019.00009.
Annu. IEEE/IFIP Int. Conf. Dependable Syst. Netw. (DSN), Jun. 2019, [253] K. Dimple, S. Guglani, A. Dasgupta, R. Sharma, S. Roy, and
pp. 214–221, doi: 10.1109/DSN.2019.00033. B. K. Kaushik, ‘‘Modified knowledge-based neural networks using
[236] D. Danopoulos, G. Zervakis, K. Siozios, D. Soudris, and J. Henkel, control variates for the fast uncertainty quantification of on-chip
‘‘AdaPT: Fast emulation of approximate DNN accelerators in PyTorch,’’ MWCNT interconnects,’’ IEEE Trans. Electromagn. Compat.,
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 42, no. 6, vol. 65, no. 4, pp. 1232–1246, Aug. 2023, doi: 10.1109/TEMC.2023.
pp. 2074–2078, Jun. 2023, doi: 10.1109/TCAD.2022.3212645. 3279695.
[237] M. Schramm, S. Bhowmik, and K. Rothermel, ‘‘Flexible application- [254] H. Liu, L. Zheng, Y. Huang, C. Liu, X. Ye, J. Yuan, X. Liao, H. Jin,
aware approximation for modern distributed graph processing frame- and J. Xue, ‘‘Accelerating personalized recommendation with cross-
works,’’ in Proc. 5th ACM SIGMOD Joint Int. Workshop Graph Data level near-memory processing,’’ in Proc. 50th Annu. Int. Symp. Com-
Manage. Experiences Syst. (GRADES) Netw. Data Analytics (NDA), put. Archit., vol. 42. New York, NY, USA: ACM, Jun. 2023, pp. 1–13,
vol. 12. New York, NY, USA: ACM, Jun. 2022, pp. 1–10, doi: doi: 10.1145/3579371.3589101.
10.1145/3534540.3534693. [255] J. Song, X. Tang, H. Luo, H. Zhang, X. Qiao, Z. Sun, X. Yang,
[238] S. De, S. Mohamed, D. Goswami, and H. Corporaal, ‘‘Approximation- Y. Wang, R. Wang, and R. Huang, ‘‘A calibration-free 15-level/cell
aware design of an image-based control system,’’ IEEE Access, vol. 8, eDRAM computing-in-memory macro with 3T1C current-programmed
pp. 174568–174586, 2020, doi: 10.1109/ACCESS.2020.3023047. dynamic-cascoded MLC achieving 233-to-304-TOPS/W 4b MAC,’’ in
[239] M. A. Johnston and V. Vassiliadis, ‘‘Towards an approximation- Proc. IEEE Custom Integr. Circuits Conf. (CICC), Apr. 2023, pp. 1–2,
aware computational workflow framework for accelerating large-scale doi: 10.1109/CICC57935.2023.10121207.
discovery tasks: Invited paper,’’ in Proc. Workshop Adv. Tools, Pro- [256] A. Ranjan, A. Raha, V. Raghunathan, and A. Raghunathan,
gram. Lang., PLatforms Implementing Evaluating Algorithms Dis- ‘‘Approximate memory compression,’’ IEEE Trans. Very Large
trib. Syst., vol. 11. New York, NY, USA: ACM, Jul. 2022, pp. 7–14, doi: Scale Integr. (VLSI) Syst., vol. 28, no. 4, pp. 980–991, Apr. 2020, doi:
10.1145/3524053.3542746. 10.1109/TVLSI.2020.2970041.
[240] M. A. Hanif, R. Hafiz, and M. Shafique, ‘‘Error resilience analysis [257] G. Singh, L. Chelini, S. Corda, A. J. Awan, S. Stuijk, R. Jordans,
for systematically employing approximate computing in convolutional H. Corporaal, and A.-J. Boonstra, ‘‘Near-memory computing: Past,
neural networks,’’ in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), present, and future,’’ Microprocessors Microsyst., vol. 71, Nov. 2019,
Mar. 2018, pp. 913–916, doi: 10.23919/DATE.2018.8342139. Art. no. 102868, doi: 10.1016/j.micpro.2019.102868.
[241] K. Parasyris, J. Diffenderfer, H. Menon, I. Laguna, J. Vanover, [258] B. W. Denkinger, F. Ponzina, S. S. Basu, A. Bonetti, S. Balási,
R. Vogt, and D. Osei-Kuffuor, ‘‘Approximate computing through the M. Ruggiero, M. Peón-Quirós, D. Rossi, A. Burg, and D. Atienza,
lens of uncertainty quantification,’’ in Proc. Int. Conf. for High ‘‘Impact of memory voltage scaling on accuracy and resilience
Perform. Comput., Netw., Storage Anal., Nov. 2022, pp. 1–14, doi: of deep learning based edge devices,’’ IEEE Des. Test. IEEE
10.1109/SC41404.2022.00072. Design Test. Comput., vol. 37, no. 2, pp. 84–92, Apr. 2020, doi:
[242] A. Bernstein, A. Dudeja, and Z. Langley, ‘‘A framework for 10.1109/MDAT.2019.2947282.
dynamic matching in weighted graphs,’’ in Proc. 53rd Annu. ACM [259] A. Raha, S. Sutar, H. Jayakumar, and V. Raghunathan, ‘‘Quality
SIGACT Symp. Theory Comput., Jun. 2021, pp. 668–681, doi: configurable approximate DRAM,’’ IEEE Trans. Comput.,
10.1145/3406325.3451113. vol. 66, no. 7, pp. 1172–1187, Jul. 2017, doi: 10.1109/TC.2016.
[243] X. Fang, N. Han, G. Zhou, S. Teng, Y. Xu, and S. Xie, ‘‘Dynamic 2640296.
double classifiers approximation for cross-domain recognition,’’ IEEE [260] A. Ranjan, A. Raha, V. Raghunathan, and A. Raghunathan, ‘‘Approxi-
Trans. Cybern., vol. 52, no. 4, pp. 2618–2629, Apr. 2022, doi: mate memory compression for energy-efficiency,’’ in Proc. IEEE/ACM
10.1109/TCYB.2020.3004398. Int. Symp. Low Power Electron. Design (ISLPED), Jul. 2017, pp. 1–6,
[244] M. Gao and G. Qu, ‘‘Estimate and recompute: A novel paradigm for doi: 10.1109/ISLPED.2017.8009173.
approximate computing on data flow graphs,’’ IEEE Trans. Comput.- [261] R. V. W. Putra, M. A. Hanif, and M. Shafique, ‘‘An off-chip mem-
Aided Design Integr. Circuits Syst., vol. 39, no. 2, pp. 335–345, Feb. 2020, ory access optimization for embedded deep learning systems,’’ in
doi: 10.1109/TCAD.2018.2889662. Embedded Machine Learning for Cyber-Physical, IoT, and Edge Com-
[245] Y. Wang, J. Dong, Y. Liu, C. Wang, and G. Qu, ‘‘RMLIM: A runtime puting: Hardware Architectures, S. Pasricha and M. Shafique, Eds.,
machine learning based identification model for approximate computing Cham, Switzerland: Springer, 2024, pp. 175–198, doi: 10.1007/978-3-
on data flow graphs,’’ IEEE Trans. Sustain. Comput., vol. 7, no. 1, 031-19568-6_6.
pp. 201–210, Jan. 2022, doi: 10.1109/TSUSC.2021.3074292. [262] R. V. W. Putra, M. A. Hanif, and M. Shafique, ‘‘EnforceSNN: Enabling
[246] M. Soni, A. Pal, and J. S. Miguel, ‘‘As-is approximate computing,’’ ACM resilient and energy-efficient spiking neural network inference consider-
Trans. Archit. Code Optim., vol. 20, no. 1, pp. 1–26, Nov. 2022, doi: ing approximate DRAMs for embedded systems,’’ Frontiers Neurosci.,
10.1145/3559761. vol. 16, Aug. 2022, Art. no. 937782.

146080 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

[263] L. Orosa, Y. Wang, I. Puddu, M. Sadrosadati, K. Razavi, J. Gómez-Luna, [279] A. M. H. Monazzah, M. Shoushtari, S. G. Miremadi, A. M. Rahmani,
H. Hassan, N. Mansouri-Ghiasi, A. Tavakkol, M. Patel, J. Kim, and N. Dutt, ‘‘QuARK: Quality-configurable approximate STT-
V. Seshadri, U. Kang, S. Ghose, R. Azevedo, and O. Mutlu, ‘‘Dataplant: MRAM cache by fine-grained tuning of reliability-energy knobs,’’
Enhancing system security with low-cost in-DRAM value generation in Proc. IEEE/ACM Int. Symp. Low Power Electron. Design
primitives,’’ 2019, arXiv:1902.07344. (ISLPED). Milwaukee, WI, USA: Quality, Jul. 2017, pp. 1–6, doi:
[264] R. V. Wicaksana Putra, M. Abdullah Hanif, and M. Shafique, 10.1109/ISLPED.2017.8009198.
‘‘DRMap: A generic DRAM data mapping policy for energy- [280] A. M. Hosseini Monazzah, A. M. Rahmani, A. Miele, and N. Dutt,
efficient processing of convolutional neural networks,’’ in Proc. 57th ‘‘CAST: Content-aware STT-MRAM cache write management for dif-
ACM/IEEE Design Autom. Conf. (DAC), Jul. 2020, pp. 1–6, doi: ferent levels of approximation,’’ IEEE Trans. Comput.-Aided Design
10.1109/DAC18072.2020.9218672. Integr. Circuits Syst., vol. 39, no. 12, pp. 4385–4398, Dec. 2020, doi:
[265] G. Stazi, A. Mastrandrea, M. Olivieri, and F. Menichelli, ‘‘Quality aware 10.1109/TCAD.2020.2986320.
selective ECC for approximate DRAM,’’ in Applications in Electronics [281] K. Desnos, M. Pelcat, J.-F. Nezan, and S. Aridhi, ‘‘On memory reuse
Pervading Industry, Environment and Society (Lecture Notes in Electri- between inputs and outputs of dataflow actors,’’ ACM Trans. Embed-
cal Engineering). Cham, Switzerland: Springer, 2020, pp. 109–116, doi: ded Comput. Syst., vol. 15, no. 2, pp. 1–25, Feb. 2016, doi: 10.1145/
10.1007/978-3-030-37277-4_13. 2871744.
[266] N. Gupta, A. P. Shah, S. Khan, S. K. Vishvakarma, M. Waltl, and P. Girard, [282] S. Minakova and T. Stefanov, ‘‘Buffer sizes reduction for memory-
‘‘Error-tolerant reconfigurable VDD 10T SRAM architecture for IoT efficient CNN inference on mobile and embedded devices,’’ in Proc. 23rd
applications,’’ Electronics, vol. 10, no. 14, Jan. 2021, Art. no. 14, doi: Euromicro Conf. Digit. Syst. Design (DSD), Aug. 2020, pp. 133–140, doi:
10.3390/electronics10141718. 10.1109/DSD51259.2020.00031.
[267] E. Russo, M. Palesi, D. Patti, H. Lahdhiri, S. Monteleone, G. Ascia, [283] S. Minakova and T. Stefanov, ‘‘Memory-throughput trade-off for CNN-
and V. Catania, ‘‘Combined application of approximate computing tech- based applications at the edge,’’ ACM Trans. Design Autom. Elec-
niques in DNN hardware accelerators,’’ in Proc. IEEE Int. Parallel tron. Syst., vol. 28, no. 1, pp. 1–26, Dec. 2022, doi: 10.1145/3527457.
Distrib. Process. Symp. Workshops (IPDPSW), May 2022, pp. 16–23, doi: [284] D. Blalock, J. J. G. Ortiz, J. Frankle, and J. Guttag, ‘‘What is the state of
10.1109/IPDPSW55747.2022.00013. neural network pruning?’’ 2020, arXiv:2003.03033.
[268] D. T. Nguyen, N. H. Hung, H. Kim, and H.-J. Lee, ‘‘An approximate [285] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, ‘‘A survey of model
memory architecture for energy saving in deep learning applications,’’ compression and acceleration for deep neural networks,’’ 2017,
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 5, pp. 1588–1601, arXiv:1710.09282.
May 2020, doi: 10.1109/TCSI.2019.2962516. [286] H. Miomandre, J.-F. Nezan, D. Menard, A. Campbell, A. Griffin,
[269] A. Teman, G. Karakonstantis, R. Giterman, P. Meinerzhagen, and S. Hall, and A. Ensor, ‘‘Approximate buffers for reducing
A. Burg, ‘‘Energy versus data integrity trade-offs in embedded memory requirements: Case study on SKA,’’ in Proc. IEEE
high-density logic compatible dynamic memories,’’ in Proc. Design, Workshop Signal Process. Syst. (SiPS), Oct. 2020, pp. 1–6, doi:
Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2015, pp. 489–494, doi: 10.1109/SiPS50750.2020.9195262.
10.7873/DATE.2015.0783. [287] H. Miomandre, J.-F. Nezan, and D. Ménard, ‘‘Design space exploration
[270] G. Stazi, L. Adani, A. Mastrandrea, M. Olivieri, and F. Menichelli, for memory-oriented approximate computing techniques,’’ in Proc. IEEE
‘‘Impact of approximate memory data allocation on a H.264 software 33rd Int. Conf. Appl.-specific Syst., Architectures Processors (ASAP),
video encoder,’’ in High Performance Computing (Lecture Notes in Jul. 2022, pp. 122–125, doi: 10.1109/ASAP54787.2022.00028.
Computer Science), R. Yokota, M. Weiland, J. Shalf, and S. Alam, Eds., [288] C. Gao, X. Xin, Y. Lu, Y. Zhang, J. Yang, and J. Shu, ‘‘ParaBit: Processing
Cham, Switzerland: Springer, 2018, pp. 545–553, doi: 10.1007/978-3- parallel bitwise operations in NAND flash memory based SSDs,’’ in
030-02465-9_38. Proc. 54th Annu. IEEE/ACM Int. Symp. Microarchitecture. New York,
[271] F. Menichelli, G. Stazi, A. Mastrandrea, and M. Olivieri, ‘‘An emulator NY, USA: ACM, Oct. 2021, pp. 59–70, doi: 10.1145/3466752.3480078.
for approximate memory platforms based on QEmu,’’ in Applications [289] J. Choi, H.-J. Lee, and C. E. Rhee, ‘‘ADC-PIM: Accelerating convolu-
in Electronics Pervading Industry, Environment and Society (Lecture tion on the GPU via in-memory approximate data comparison,’’ IEEE
Notes in Electrical Engineering), A. De Gloria, Ed., Cham, Switzerland: J. Emerg. Sel. Topics Circuits Syst., vol. 12, no. 2, pp. 458–471, Jun. 2022,
Springer, 2018, pp. 153–159, doi: 10.1007/978-3-319-55071-8_20. doi: 10.1109/JETCAS.2022.3167391.
[272] M. Liu, H. Que, X. Yang, K. Zhang, Q. Yu, L. Yan, T. Wang, Y. Jin, [290] G. H. Lee, S. Hwang, J. Yu, and H. Kim, ‘‘Architecture and process
and N. Zhou, ‘‘A selective bit dropping and encoding co-strategy in integration overview of 3D NAND flash technologies,’’ Appl. Sci., vol. 11,
image processing for low-power design in DRAM and SRAM,’’ IEEE no. 15, p. 6703, Jul. 2021, doi: 10.3390/app11156703.
J. Emerg. Sel. Topics Circuits Syst., vol. 13, no. 1, pp. 48–57, Mar. 2023, [291] S.-J. Byun, D.-G. Kim, K.-D. Park, Y.-J. Choi, P. Kumar, I. Ali,
doi: 10.1109/JETCAS.2023.3234402. D.-G. Kim, J.-M. Yoo, H.-K. Huh, Y.-J. Jung, S.-K. Kim, Y.-G. Pu, and
[273] Z. Shao, X. Chen, L. Du, L. Chen, Y. Du, W. Zhuang, H. Wei, C. Xie, and K.-Y. Lee, ‘‘A low-power analog processor-in-memory-based convolu-
Z. Wang, ‘‘Memory-efficient CNN accelerator based on interlayer feature tional neural network for biosensor applications,’’ Sensors, vol. 22, no. 12,
map compression,’’ 2021, arXiv:2110.06155. p. 4555, Jun. 2022, doi: 10.3390/s22124555.
[274] A. Raha, S. Ghosh, D. Mohapatra, D. A. Mathaikutty, R. Sung, C. Brick, [292] H. Jin, C. Liu, H. Liu, R. Luo, J. Xu, F. Mao, and X. Liao, ‘‘ReHy:
and V. Raghunathan, ‘‘Special session: Approximate TinyML systems: A ReRAM-based digital/analog hybrid PIM architecture for accelerating
Full system approximations for extreme energy-efficiency in intelligent CNN training,’’ IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 11,
edge devices,’’ in Proc. IEEE 39th Int. Conf. Comput. Design (ICCD), pp. 2872–2884, Nov. 2022, doi: 10.1109/TPDS.2021.3138087.
Oct. 2021, pp. 13–16, doi: 10.1109/ICCD53106.2021.00015. [293] N. Hajinazar, G. F. Oliveira, S. Gregorio, J. D. Ferreira, N. M. Ghiasi,
[275] A. Raha, S. Venkataramani, V. Raghunathan, and A. Raghunathan, M. Patel, M. Alser, S. Ghose, J. Gómez-Luna, and O. Mutlu, ‘‘SIM-
‘‘Energy-efficient reduce-and-rank using input-adaptive approxima- DRAM: A framework for bit-serial SIMD processing using DRAM,’’ in
tions,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 2, Proc. 26th ACM Int. Conf. Architectural Support Program. Lang. Oper-
pp. 462–475, Feb. 2017, doi: 10.1109/TVLSI.2016.2586379. ating Syst. New York, NY, USA: ACM, Apr. 2021, pp. 329–345, doi:
[276] D.-T. Nguyen, N.-M. Ho, M.-S. Le, W.-F. Wong, and I.-J. Chang, 10.1145/3445814.3446749.
‘‘ZEM: Zero-cycle bit-masking module for deep learning refresh- [294] H. Zhang, Y. Shu, Q. Deng, H. Sun, W. Zhao, and Y. Ha, ‘‘WDVR-
less DRAM,’’ IEEE Access, vol. 9, pp. 93723–93733, 2021, doi: RAM: A 0.25–1.2 V, 2.6–76 POPS/W charge-domain in-memory-
10.1109/ACCESS.2021.3088893. computing binarized CNN accelerator for dynamic AIoT workloads,’’
[277] S. Pal, S. Bose, W.-H. Ki, and A. Islam, ‘‘Characterization of half- IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 70, no. 10, pp. 3964–3977,
select free write assist 9T SRAM cell,’’ IEEE Trans. Electron Devices, Oct. 2023, doi: 10.1109/tcsi.2023.3294296.
vol. 66, no. 11, pp. 4745–4752, Nov. 2019, doi: 10.1109/TED.2019. [295] A. Ehrmann, T. Blachowicz, G. Ehrmann, and T. Grethe, ‘‘Recent devel-
2942493. opments in phase-change memory,’’ Appl. Res., vol. 1, no. 4, Dec. 2022,
[278] M. Imani and T. S. Rosing, ‘‘Approximate CPU and GPU design using Art. no. e202200024, doi: 10.1002/appl.202200024.
emerging memory technologies,’’ in Approximate Circuits: Methodolo- [296] E. Garzón, L. Yavits, A. Teman, and M. Lanuzza, ‘‘Approximate content-
gies and CAD, S. Reda and M. Shafique, Eds., Cham, Switzerland: addressable memories: A review,’’ Chips, vol. 2, no. 2, pp. 70–82,
Springer, 2019, pp. 383–398, doi: 10.1007/978-3-319-99322-5_19. Mar. 2023, doi: 10.3390/chips2020005.

VOLUME 12, 2024 146081


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

[297] Y. Fu and Y. Wu, ‘‘CARAM: A content-aware hybrid PCM/DRAM main [314] D. D’Agostino, I. Merelli, M. Aldinucci, and D. Cesini, ‘‘Hardware
memory system framework,’’ in Proc. IFIP Int. Conf. Netw. Parallel and software solutions for energy-efficient computing in scientific
Comput., Zhengzhou, China. Heidelberg, Germany: Springer, Sep. 2020, programming,’’ Sci. Program., vol. 2021, pp. 1–9, Jun. 2021, doi:
pp. 243–248, doi: 10.1007/978-3-030-79478-1_21. 10.1155/2021/5514284.
[298] T. V. Mahendra, S. W. Hussain, S. Mishra, and A. Dandapat, [315] A. S. Baroughi, S. Huemer, H. S. Shahhoseini, and N. TaheriNejad,
‘‘Energy-efficient precharge-free ternary content addressable memory ‘‘AxE: An approximate-exact multi-processor system-on-chip platform,’’
(TCAM) for high search rate applications,’’ IEEE Trans. Circuits in Proc. 25th Euromicro Conf. Digit. Syst. Design (DSD), Maspalomas,
Syst. I, Reg. Papers, vol. 67, no. 7, pp. 2345–2357, Jul. 2020, doi: Spain, Aug. 2022, pp. 60–66, doi: 10.1109/DSD57027.2022.00018.
10.1109/TCSI.2020.2978295. [316] A. Aponte-Moreno, F. Restrepo-Calle, and C. Pedraza, ‘‘A low-cost fault
[299] H. Zhan, C. Wang, H. Cui, X. Liu, F. Liu, and X. Cheng, ‘‘High-speed and tolerance method for ARM and RISC-V microprocessor-based systems
energy-efficient single-port content addressable memory to achieve dual- using temporal redundancy and approximate computing through sim-
port operation,’’ in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), plified iterations,’’ J. Integr. Circuits Syst., vol. 16, no. 3, pp. 1–14,
vol. 56, Apr. 2023, pp. 1–6, doi: 10.23919/date56975.2023.10137206. Apr. 2022, doi: 10.29292/jics.v16i3.539.
[300] G. Stazi, A. Mastrandrea, M. Olivieri, and F. Menichelli, ‘‘Full sys- [317] İ. Taştan, M. Karaca, and A. Yurdakul, ‘‘Approximate CPU design for IoT
tem emulation of approximate memory platforms with AppropinQuo,’’ end-devices with learning capabilities,’’ Electronics, vol. 9, no. 1, p. 125,
J. Low Power Electron., vol. 15, no. 1, pp. 30–39, Mar. 2019, doi: Jan. 2020, doi: 10.3390/electronics9010125.
10.1166/jolpe.2019.1595. [318] I. M. A. Jawarneh, P. Bellavista, A. Corradi, L. Foschini, and
[301] M. Yayla, Z. Valipour Dehnoo, M. Masoudinejad, and J.-J. Chen, R. Montanari, ‘‘SpatialSSJP: QoS-aware adaptive approximate stream-
‘‘TREAM: A tool for evaluating error resilience of tree-based models static spatial join processor,’’ IEEE Trans. Parallel Distrib. Syst., vol. 35,
using approximate memory,’’ in Embedded Computer Systems: Architec- no. 1, pp. 73–88, Jan. 2024, doi: 10.1109/TPDS.2023.3330669.
tures, Modeling, and Simulation (Lecture Notes in Computer Science), [319] M. E. Elbtity, P. S. Chandarana, B. Reidy, J. K. Eshraghian, and R. Zand,
A. Orailoglu, M. Reichenbach, and M. Jung, Eds., Cham, Switzerland: ‘‘APTPU: Approximate computing based tensor processing unit,’’ IEEE
Springer, 2022, pp. 61–73, doi: 10.1007/978-3-031-15074-6_4. Trans. Circuits Syst. I, Reg. Papers, vol. 69, no. 12, pp. 5135–5146,
[302] R. Yarmand, M. Kamal, A. Afzali-Kusha, and M. Pedram, Dec. 2022, doi: 10.1109/TCSI.2022.3206262.
‘‘DART: A framework for determining approximation levels in an [320] A. Siddique, K. Basu, and K. A. Hoque, ‘‘Exploring fault-energy
approximable memory hierarchy,’’ IEEE Trans. Very Large Scale trade-offs in approximate DNN hardware accelerators,’’ in Proc. 22nd
Integr. (VLSI) Syst., vol. 28, no. 1, pp. 273–286, Jan. 2020, doi: Int. Symp. Quality Electron. Design (ISQED), Apr. 2021, pp. 343–348,
10.1109/TVLSI.2019.2935832. doi: 10.1109/ISQED51717.2021.9424345.
[303] F. Ferdaus, B. M. S. B. Talukder, and M. T. Rahman, ‘‘Approximate [321] M. H. Ahmadilivani, M. Barbareschi, S. Barone, A. Bosio,
MRAM: High-performance and power-efficient computing with MRAM M. Daneshtalab, S. D. Torca, G. Gavarini, M. Jenihhin, J. Raik,
chips for error-tolerant applications,’’ IEEE Trans. Comput., vol. 72, no. 3, A. Ruospo, E. Sanchez, and M. Taheri, ‘‘Special session:
pp. 668–681, Mar. 2023, doi: 10.1109/TC.2022.3174584. Approximation and fault resiliency of DNN accelerators,’’ in
Proc. IEEE 41st VLSI Test Symp. (VTS), Apr. 2023, pp. 1–10, doi:
[304] K. Kim, S.-J. Jang, J. Park, E. Lee, and S.-S. Lee, ‘‘Lightweight and
10.1109/VTS56346.2023.10140043.
energy-efficient deep learning accelerator for real-time object detection
[322] J. N. Mitchell, ‘‘Computer multiplication and division using binary loga-
on edge devices,’’ Sensors, vol. 23, no. 3, p. 1185, Jan. 2023, doi:
rithms,’’ IRE Trans. Electron. Comput., vols. EC–11, no. 4, pp. 512–517,
10.3390/s23031185.
Aug. 1962, doi: 10.1109/TEC.1962.5219391.
[305] J. Bonnot, A. Mercat, E. Nogues, and D. Ménard, ‘‘Approximate com-
[323] A. B. Kahng and S. Kang, ‘‘Accuracy-configurable adder for approximate
puting at the algorithmic level,’’ in Approximate Computing Techniques:
arithmetic designs,’’ in Proc. DAC Design Autom. Conf., Jun. 2012,
From Component- to Application-Level. Cham, Switzerland: Springer,
pp. 820–825, doi: 10.1145/2228360.2228509.
2022, pp. 109–142, doi: 10.1007/978-3-030-94705-7_5.
[324] M. Shafique, W. Ahmad, R. Hafiz, and J. Henkel, ‘‘A low latency generic
[306] J. Murray, P. Wettin, P. P. Pande, and B. Shirazi, ‘‘Dynamic voltage and
accuracy configurable adder,’’ in Proc. 52nd ACM/EDAC/IEEE Design
frequency scaling,’’ in Sustainable Wireless Network-on-Chip Architec-
Autom. Conf. (DAC), Jun. 2015, pp. 1–6, doi: 10.1145/2744769.2744778.
tures, J. Murray, P. Wettin, P. P. Pande, and B. Shirazi, Eds., Boston,
[325] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and C. Lucas, ‘‘Bio-inspired
MA, USA: Morgan Kaufmann, 2016, pp. 79–105, doi: 10.1016/B978-0-
imprecise computational blocks for efficient VLSI implementation of
12-803625-9.00014-5.
soft-computing applications,’’ IEEE Trans. Circuits Syst. I, Reg. Papers,
[307] H. Ali, U. U. Tariq, J. Hardy, X. Zhai, L. Lu, Y. Zheng, F. Bensaali, vol. 57, no. 4, pp. 850–862, Apr. 2010, doi: 10.1109/TCSI.2009.2027626.
A. Amira, K. Fatema, and N. Antonopoulos, ‘‘A survey on system
[326] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, ‘‘Low-power dig-
level energy optimisation for MPSoCs in IoT and consumer elec-
ital signal processing using approximate adders,’’ IEEE Trans. Comput.-
tronics,’’ Comput. Sci. Rev., vol. 41, Aug. 2021, Art. no. 100416, doi:
Aided Design Integr. Circuits Syst., vol. 32, no. 1, pp. 124–137, Jan. 2013,
10.1016/j.cosrev.2021.100416.
doi: 10.1109/TCAD.2012.2217962.
[308] A. Garg and P. Kulkarni, ‘‘Dynamic memory management for GPU- [327] K. Du, P. Varman, and K. Mohanram, ‘‘High performance
based training of deep neural networks,’’ in Proc. IEEE Int. Par- reliable variable latency carry select addition,’’ in Proc. Design,
allel Distrib. Process. Symp. (IPDPS), May 2019, pp. 200–209, doi: Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2012, pp. 1257–1262, doi:
10.1109/IPDPS.2019.00030. 10.1109/DATE.2012.6176685.
[309] S. Pandey, L. Siddhu, and P. R. Panda, ‘‘NeuroCool: Dynamic thermal [328] S. Singh and Y. B. Shukla, ‘‘Low power carry select adder using FinFET
management of 3D DRAM for deep neural networks through customized technology,’’ in Proc. 6th Int. Conf. Devices, Circuits Syst. (ICDCS),
prefetching,’’ ACM Trans. Design Autom. Electron. Syst., vol. 29, no. 1, Apr. 2022, pp. 152–155, doi: 10.1109/ICDCS54290.2022.9780840.
pp. 1–35, Jan. 2024, doi: 10.1145/3630012. [329] C. R. N. Praneeth, Ch. U. Kumari, T. Padma, and N. A. Vignesh,
[310] S. Li, Z. Zhang, R. Mao, J. Xiao, L. Chang, and J. Zhou, ‘‘Low-Energy-Consumption design: 16 bit block based carry
‘‘A fast and energy-efficient SNN processor with adaptive clock/event- speculative approximate adder,’’ in Proc. IEEE 3rd Global
driven computation scheme and online learning,’’ IEEE Trans. Circuits Conf. for Advancement Technol. (GCAT), Oct. 2022, pp. 1–4, doi:
Syst. I, Reg. Papers, vol. 68, no. 4, pp. 1543–1552, Apr. 2021, doi: 10.1109/GCAT55367.2022.9972197.
10.1109/TCSI.2021.3052885. [330] H. Ghabeli, A. S. Molahosseini, A. A. E. Zarandi, and L. Sousa, ‘‘Variable
[311] R. N. Tadros and P. A. Beerel, ‘‘A robust and self-adaptive clocking latency carry speculative adders with input-based dynamic configura-
technique for SFQ circuits,’’ IEEE Trans. Appl. Supercond., vol. 28, no. 7, tion,’’ Comput. Electr. Eng., vol. 93, Jul. 2021, Art. no. 107247, doi:
pp. 1–11, Oct. 2018, doi: 10.1109/TASC.2018.2856836. 10.1016/j.compeleceng.2021.107247.
[312] S. Vangal, S. Paul, S. Hsu, A. Agarwal, R. Krishnamurthy, J. Tschanz, [331] A. Najafi, M. Weißbrich, G. Payá-Vayá, and A. Garcia-Ortiz, ‘‘Coherent
and V. De, ‘‘Near-threshold voltage design techniques for heterogenous design of hybrid approximate adders: Unified design framework and met-
manycore system-on-chips,’’ J. Low Power Electron. Appl., vol. 10, no. 2, rics,’’ IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 8, no. 4, pp. 736–745,
p. 16, May 2020, doi: 10.3390/jlpea10020016. Dec. 2018, doi: 10.1109/JETCAS.2018.2833284.
[313] J. Baik, J. Lee, and K. Kang, ‘‘Task migration and scheduler for mixed- [332] G. Giustolisi and G. Palumbo, ‘‘Hybrid full adders: Optimized design,
criticality systems,’’ Sensors, vol. 22, no. 5, p. 1926, Mar. 2022, doi: critical review and comparison in the energy-delay space,’’ Electronics,
10.3390/s22051926. vol. 11, no. 19, p. 3220, Oct. 2022, doi: 10.3390/electronics11193220.

146082 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

[333] H.-R. basireddy, K. Challa, and T. Nikoubin, ‘‘Hybrid logical effort for [352] R. Roy, J. Raiman, N. Kant, I. Elkin, R. Kirby, M. Siu, S. Oberman,
hybrid logic style full adders in multistage structures,’’ IEEE Trans. Very S. Godil, and B. Catanzaro, ‘‘PrefixRL: Optimization of parallel
Large Scale Integr. (VLSI) Syst., vol. 27, no. 5, pp. 1138–1147, May 2019, prefix circuits using deep reinforcement learning,’’ in Proc. 58th
doi: 10.1109/TVLSI.2018.2889833. ACM/IEEE Design Autom. Conf. (DAC). Piscataway, NJ, USA:
[334] A. Nandal and M. Kumar, ‘‘Design and implementation of CMOS IEEE Press, Dec. 2021, pp. 853–858, doi: 10.1109/DAC18074.2021.
full adder circuit with ECRL and sleepy keeper technique,’’ in 9586094.
Proc. Int. Conf. Adv. Comput., Commun. Control Netw. (ICACCCN), [353] P. Balasubramanian, R. Nayar, and D. L. Maskell, ‘‘Gate-level static
Oct. 2018, pp. 733–738, doi: 10.1109/ICACCCN.2018.8748336. approximate adders: A comparative analysis,’’ Electronics, vol. 10,
[335] M. Agarwal, N. Agrawal, and Md. A. Alam, ‘‘A new design of no. 23, p. 2917, Nov. 2021, doi: 10.3390/electronics10232917.
low power high speed hybrid CMOS full adder,’’ in Proc. Int. [354] S. Xu and B. C. Schafer, ‘‘Exposing approximate computing optimiza-
Conf. Signal Process. Integr. Netw. (SPIN), Feb. 2014, pp. 448–452, doi: tions at different levels: From behavioral to gate-level,’’ IEEE Trans. Very
10.1109/SPIN.2014.6776995. Large Scale Integr. (VLSI) Syst., vol. 25, no. 11, pp. 3077–3088,
[336] A. M. Hassani, M. Rezaalipour, and M. Dehyadegari, ‘‘A novel ultra Nov. 2017, doi: 10.1109/TVLSI.2017.2735299.
low power accuracy configurable adder at transistor level,’’ in Proc. 8th [355] V. Camus, M. Cacciotti, J. Schlachter, and C. Enz, ‘‘Design of
Int. Conf. Comput. Knowl. Eng. (ICCKE), Oct. 2018, pp. 165–170, doi: approximate circuits by fabrication of false timing paths: The
10.1109/ICCKE.2018.8566643. carry cut-back adder,’’ IEEE J. Emerg. Sel. Topics Circuits Syst.,
[337] M. Kumar, S. K. Arya, and S. Pandey, ‘‘Single bit full adder design using 8 vol. 8, no. 4, pp. 746–757, Dec. 2018, doi: 10.1109/JETCAS.2018.
transistors with novel 3 transistors XNOR gate,’’ 2012, arXiv:1201.1966. 2851749.
[338] P. Kumar and R. K. Sharma, ‘‘Low voltage high performance hybrid full [356] M. Pashaeifar, M. Kamal, A. Afzali-Kusha, and M. Pedram,
adder,’’ Eng. Sci. Technol., Int. J., vol. 19, no. 1, pp. 559–565, Mar. 2016, ‘‘Approximate reverse carry propagate adder for energy-efficient
doi: 10.1016/j.jestch.2015.10.001. DSP applications,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
[339] M. Hasan, Md. J. Hossein, M. Hossain, H. U. Zaman, and S. Islam, vol. 26, no. 11, pp. 2530–2541, Nov. 2018, doi: 10.1109/TVLSI.2018.
‘‘Design of a scalable low-power 1-Bit hybrid full adder for fast 2859939.
computation,’’ IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 67, [357] D. Esposito, A. G. M. Strollo, E. Napoli, D. De Caro, and N. Petra,
no. 8, pp. 1464–1468, Aug. 2020, doi: 10.1109/TCSII.2019. ‘‘Approximate multipliers based on new approximate compressors,’’
2940558. IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 65, no. 12, pp. 4169–4182,
[340] J. Kandpal, A. Tomar, M. Agarwal, and K. K. Sharma, ‘‘High-speed Dec. 2018, doi: 10.1109/TCSI.2018.2839266.
hybrid-logic full adder using high-performance 10-T XOR–XNOR cell,’’ [358] F. Sabetzadeh, M. H. Moaiyeri, and M. Ahmadinejad, ‘‘An ultra-
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 28, no. 6, efficient approximate multiplier with error compensation for error-
pp. 1413–1422, Jun. 2020, doi: 10.1109/TVLSI.2020.2983850. resilient applications,’’ IEEE Trans. Circuits Syst. II, Exp. Briefs,
[341] V. Gupta, D. Mohapatra, S. P. Park, A. Raghunathan, and K. Roy, vol. 70, no. 2, pp. 776–780, Feb. 2023, doi: 10.1109/TCSII.2022.
‘‘IMPACT: IMPrecise adders for low-power approximate computing,’’ in 3215065.
Proc. IEEE/ACM Int. Symp. Low Power Electron. Design, Aug. 2011, [359] F. Farshchi, M. S. Abrishami, and S. M. Fakhraie, ‘‘New approximate
pp. 409–414, doi: 10.1109/ISLPED.2011.5993675. multiplier for low power digital signal processing,’’ in Proc. 17th CSI
[342] H. Naseri and S. Timarchi, ‘‘Low-power and fast full adder by explor- Int. Symp. Comput. Archit. Digit. Syst. (CADS), Oct. 2013, pp. 25–30,
ing new XOR and XNOR gates,’’ IEEE Trans. Very Large Scale doi: 10.1109/CADS.2013.6714233.
Integr. (VLSI) Syst., vol. 26, no. 8, pp. 1481–1493, Aug. 2018, doi: [360] A. S. Roy, H. Agrawal, and A. S. Dhar, ‘‘ACBAM-accuracy-
10.1109/TVLSI.2018.2820999. configurable sign inclusive broken array booth multiplier design,’’ IEEE
[343] T. Nirmalraj, S. K. Pandiyan, R. K. Karan, R. Sivaraman, and Trans. Emerg. Topics Comput., vol. 10, no. 4, pp. 2072–2078, Oct. 2022,
R. Amirtharajan, ‘‘Design of low-power 10-transistor full adder using doi: 10.1109/TETC.2021.3107509.
GDI technique for energy-efficient arithmetic applications,’’ Circuits, [361] Z. Wang, G. A. Jullien, and W. C. Miller, ‘‘A new design technique for
Syst., Signal Process., vol. 42, no. 6, pp. 3649–3667, Jan. 2023, doi: column compression multipliers,’’ IEEE Trans. Comput., vol. 44, no. 8,
10.1007/s00034-022-02287-x. pp. 962–970, Aug. 1995, doi: 10.1109/12.403712.
[344] A. Bhargav and P. Huynh, ‘‘Design and analysis of low-power and high [362] P. J. Edavoor, S. Raveendran, and A. D. Rahulkar, ‘‘Approximate
speed approximate adders using CNFETs,’’ Sensors, vol. 21, no. 24, multiplier design using novel dual-stage 4:2 compressors,’’ IEEE
p. 8203, Dec. 2021, doi: 10.3390/s21248203. Access, vol. 8, pp. 48337–48351, 2020, doi: 10.1109/ACCESS.2020.
[345] J. Lee, H. Seo, H. Seok, and Y. Kim, ‘‘A novel approximate 2978773.
adder design using error reduced carry prediction and constant [363] R. Dornelles, G. Paim, B. Silveira, M. Fonseca, E. Costa, and S. Bampi,
truncation,’’ IEEE Access, vol. 9, pp. 119939–119953, 2021, doi: ‘‘A power-efficient 4-2 adder compressor topology,’’ in Proc. 15th IEEE
10.1109/ACCESS.2021.3108443. Int. New Circuits Syst. Conf. (NEWCAS), Jun. 2017, pp. 281–284, doi:
[346] K. Chen, W. Liu, J. Han, and F. Lombardi, ‘‘Profile-based output error 10.1109/NEWCAS.2017.8010160.
compensation for approximate arithmetic circuits,’’ IEEE Trans. Circuits [364] W.-C. Yeh and C.-W. Jen, ‘‘High-speed booth encoded parallel multiplier
Syst. I, Reg. Papers, vol. 67, no. 12, pp. 4707–4718, Dec. 2020, doi: design,’’ IEEE Trans. Comput., vol. 49, no. 7, pp. 692–701, Jul. 2000, doi:
10.1109/TCSI.2020.2996567. 10.1109/12.863039.
[347] P. Albicocco, G. C. Cardarilli, A. Nannarelli, M. Petricca, and [365] Y. Zhu, W. Liu, P. Yin, T. Cao, J. Han, and F. Lombardi, ‘‘Design,
M. Re, ‘‘Imprecise arithmetic for low power image processing,’’ evaluation and application of approximate-truncated booth multipliers,’’
in Proc. Conf. Rec. 46th Asilomar Conf. Signals, Syst. Com- IET Circuits, Devices Syst., vol. 14, no. 8, pp. 1305–1317, Nov. 2020, doi:
put. (ASILOMAR), Nov. 2012, pp. 983–987, doi: 10.1109/ACSSC.2012. 10.1049/iet-cds.2019.0398.
6489164. [366] M. H. Haider, H. Zhang, and S.-B. Ko, ‘‘Decoder reduction approxima-
[348] P. Balasubramanian and D. L. Maskell, ‘‘Hardware optimized and tion scheme for booth multipliers,’’ IEEE Trans. Comput., vol. 73, no. 3,
error reduced approximate adder,’’ Electronics, vol. 8, no. 11, p. 1212, pp. 735–746, Mar. 2024, doi: 10.1109/tc.2023.3343093.
Oct. 2019, doi: 10.3390/electronics8111212. [367] P. Kulkarni, P. Gupta, and M. Ercegovac, ‘‘Trading accuracy
[349] P. Balasubramanian, R. Nayar, D. L. Maskell, and N. E. Mastorakis, for power with an underdesigned multiplier architecture,’’ in
‘‘An approximate adder with a near-normal error distribution: Design, Proc. 24th Int. Conf. VLSI Design, Jan. 2011, pp. 346–351, doi:
error analysis and practical application,’’ IEEE Access, vol. 9, 10.1109/VLSID.2011.51.
pp. 4518–4530, 2021, doi: 10.1109/ACCESS.2020.3047651. [368] H. Waris, C. Wang, W. Liu, J. Han, and F. Lombardi, ‘‘Hybrid par-
[350] H. Seo, Y. S. Yang, and Y. Kim, ‘‘Design and analysis of an approximate tial product-based high-performance approximate recursive multipliers,’’
adder with hybrid error reduction,’’ Electronics, vol. 9, no. 3, p. 471, IEEE Trans. Emerg. Topics Comput., vol. 10, no. 1, pp. 507–513,
Mar. 2020, doi: 10.3390/electronics9030471. Jan. 2022, doi: 10.1109/TETC.2020.3013977.
[351] M. M. A. D. Rosa, G. Paim, P. U. L. D. Costa, E. A. C. D. Costa, [369] S. Rehman, W. El-Harouni, M. Shafique, A. Kumar, J. Henkel, and
R. I. Soares, and S. Bampi, ‘‘AxPPA: Approximate parallel prefix J. Henkel, ‘‘Architectural-space exploration of approximate multipli-
adders,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 31, no. 1, ers,’’ in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD),
pp. 17–28, Jan. 2023, doi: 10.1109/TVLSI.2022.3218021. Nov. 2016, pp. 1–8, doi: 10.1145/2966986.2967005.

VOLUME 12, 2024 146083


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

[370] S. Venkatachalam and S.-B. Ko, ‘‘Design of power and area [387] O. I. Bureneva and O. U. Kaidanovich, ‘‘FPGA-based hardware
efficient approximate multipliers,’’ IEEE Trans. Very Large Scale implementation of fixed-point division using Newton–Raphson
Integr. (VLSI) Syst., vol. 25, no. 5, pp. 1782–1786, May 2017, doi: method,’’ in Proc. IV Int. Conf. Neural Netw. Neurotechnologies
10.1109/TVLSI.2016.2643639. (NeuroNT), Jun. 2023, pp. 45–47, doi: 10.1109/NeuroNT58640.2023.
[371] S. Ullah, S. S. Sahoo, N. Ahmed, D. Chaudhury, and A. Kumar, 10175844.
‘‘AppAxO: Designing application-specific approximate operators for [388] Z. Ebrahimi, M. Zaid, M. Wijtvliet, and A. Kumar, ‘‘RAPID:
FPGA-based embedded systems,’’ ACM Trans. Embedded Com- Approximate pipelined soft multipliers and dividers for high through-
put. Syst., vol. 21, no. 3, pp. 1–31, May 2022, doi: 10.1145/ put and energy efficiency,’’ IEEE Trans. Comput.-Aided Design
3513262. Integr. Circuits Syst., vol. 42, no. 3, pp. 712–725, Mar. 2023, doi:
[372] T. Zhang, Z. Niu, and J. Han, ‘‘A brief review of logarithmic multiplier 10.1109/TCAD.2022.3184928.
designs,’’ in Proc. IEEE 23rd Latin Amer. Test Symp. (LATS), Sep. 2022, [389] H. Wang, K. Chen, B. Wu, C. Wang, W. Liu, and F. Lombardi, ‘‘HEADiv:
pp. 1–4, doi: 10.1109/LATS57337.2022.9936921. A high-accuracy energy-efficient approximate divider with error com-
[373] Y. Wu, C. Chen, W. Xiao, X. Wang, C. Wen, J. Han, X. Yin, pensation,’’ in Proc. 17th ACM Int. Symp. Nanosc. Architectures. New
W. Qian, and C. Zhuo, ‘‘A survey on approximate multiplier designs York, NY, USA: ACM, May 2023, pp. 1–6, doi: 10.1145/3565478.
for energy efficiency: From algorithms to circuits,’’ ACM Trans. Design 3572324.
Autom. Electron. Syst., vol. 29, no. 1, pp. 1–37, Jan. 2024, doi: 10.1145/ [390] C. Jha and J. Mekie, ‘‘Design of novel CMOS based inexact subtractors
3610291. and dividers for approximate computing: An in-depth comparison with
[374] R. Pilipovic, P. Bulic, and U. Lotric, ‘‘A two-stage operand trimming PTL based designs,’’ in Proc. 22nd Euromicro Conf. Digit. Syst. Design
approximate logarithmic multiplier,’’ IEEE Trans. Circuits Syst. I, (DSD), Aug. 2019, pp. 174–181, doi: 10.1109/DSD.2019.00034.
Reg. Papers, vol. 68, no. 6, pp. 2535–2545, Jun. 2021, doi: [391] W. Liu, T. Xu, J. Li, C. Wang, P. Montuschi, and F. Lombardi,
10.1109/TCSI.2021.3069168. ‘‘Design of unsigned approximate hybrid dividers based on restoring
[375] R. Makimoto, T. Imagawa, and H. Ochi, ‘‘Approximate logarith- array and logarithmic dividers,’’ IEEE Trans. Emerg. Topics Com-
mic multipliers using half compensation with two line segments,’’ in put., vol. 10, no. 1, pp. 339–350, Jan. 2022, doi: 10.1109/TETC.2020.
Proc. IEEE 36th Int. Syst.-Chip Conf. (SOCC), Sep. 2023, pp. 1–6, doi: 3022290.
10.1109/socc58585.2023.10256796. [392] Y. Wu, H. Jiang, Z. Ma, P. Gou, Y. Lu, J. Han, S. Yin, S. Wei, and
[376] S. Yu, M. Tasnim, and S. X.-D. Tan, ‘‘HEALM: Hardware- L. Liu, ‘‘An energy-efficient approximate divider based on logarithmic
efficient approximate logarithmic multiplier with reduced error,’’ conversion and piecewise constant approximation,’’ IEEE Trans. Cir-
in Proc. 27th Asia South Pacific Design Autom. Conf. (ASP- cuits Syst. I, Reg. Papers, vol. 69, no. 7, pp. 2655–2668, Jul. 2022, doi:
DAC), Jan. 2022, pp. 37–42, doi: 10.1109/ASP-DAC52403. 10.1109/TCSI.2022.3167894.
2022.9712543. [393] L. Chen, J. Han, W. Liu, P. Montuschi, and F. Lombardi, ‘‘Design,
[377] M. S. Ansari, H. Jiang, B. F. Cockburn, and J. Han, ‘‘Low-power evaluation and application of approximate high-radix dividers,’’ IEEE
approximate multipliers using encoded partial products and approx- Trans. Multi-Scale Comput. Syst., vol. 4, no. 3, pp. 299–312, Jul. 2018,
imate compressors,’’ IEEE J. Emerg. Sel. Topics Circuits Syst., doi: 10.1109/TMSCS.2018.2817608.
vol. 8, no. 3, pp. 404–416, Sep. 2018, doi: 10.1109/JETCAS.2018. [394] D. Hendrycks and K. Gimpel, ‘‘Gaussian error linear units (GELUs),’’
2832204. 2016, arXiv:1606.08415.
[378] P. Choudhary, L. Bhargava, M. Fujita, and V. Singh, ‘‘LUT-based arith- [395] Apply Gaussian Error Linear Unit (GELU) Activation—MATLAB
metic circuit approximation with formal guarantee on worst case relative Gelu. Accessed: Aug. 30, 2023. [Online]. Available:
error,’’ in Proc. IEEE 24th Latin Amer. Test Symp. (LATS), Mar. 2023, https://www.mathworks.com/help/deeplearning/ref/dlarray.gelu.html
pp. 1–2, doi: 10.1109/LATS58125.2023.10154494. [396] Papers With Code—GELU Explained. Accessed: Aug. 29, 2023. [Online].
[379] U. S. Patankar, M. E. Flores, and A. Koel, ‘‘Novel data dependent Available: https://paperswithcode.com/method/gelu
divider circuit block implementation for complex division and area crit- [397] P.-T.-P. Tang, ‘‘Table-driven implementation of the exponential
ical applications,’’ Sci. Rep., vol. 13, no. 1, Feb. 2023, Art. no. 1, doi: function in IEEE floating-point arithmetic,’’ ACM Trans.
10.1038/s41598-023-28343-3. Math. Softw., vol. 15, no. 2, pp. 144–157, Jun. 1989, doi: 10.1145/
[380] T. Deepa, P. Kaumudi, K. Sonali, and P. Saraswat, ‘‘Design and imple- 63522.214389.
mentation of approximate divider for error-resilient image processing [398] H. de Lassus Saint-Geniès, D. Defour, and G. Revy, ‘‘Exact lookup
applications,’’ in Proc. 2nd Int. Conf. Electr., Electron., Inf. Com- tables for the evaluation of trigonometric and hyperbolic functions,’’
mun. Technol. (ICEEICT), Apr. 2023, pp. 1–5, doi: 10.1109/ICEE- IEEE Trans. Comput., vol. 66, no. 12, pp. 2058–2071, Dec. 2017, doi:
ICT56924.2023.10157050. 10.1109/TC.2017.2703870.
[381] D. Piso, J. A. Pineiro, and J. D. Bruguera, ‘‘Analysis of the impact [399] A. G. M. Strollo, D. De Caro, and N. Petra, ‘‘Elementary func-
of different methods for division/square root computation in the tions hardware implementation using constrained piecewise-polynomial
performance of a superscalar microprocessor,’’ in Proc. Euromicro approximations,’’ IEEE Trans. Comput., vol. 60, no. 3, pp. 418–432,
Symp. Digit. Syst. Design. Archit., Methods Tools, vol. 39, Oct. 2002, Mar. 2011, doi: 10.1109/TC.2010.127.
pp. 218–225, doi: 10.1109/DSD.2002.1115372. [400] P. Nilsson, A. U. R. Shaik, R. Gangarajaiah, and E. Hertz,
[382] J. Oelund and S. Kim, ‘‘ILAFD: Accuracy-configurable floating-point ‘‘Hardware implementation of the exponential function using
divider using an approximate reciprocal and an iterative logarithmic Taylor series,’’ in Proc. NORCHIP, Oct. 2014, pp. 1–4, doi:
multiplier,’’ in Proc. Great Lakes Symp. VLSI. New York, NY, USA: 10.1109/NORCHIP.2014.7004740.
ACM, Jun. 2023, pp. 639–644, doi: 10.1145/3583781.3590262. [401] B. Xiong, Y. Sui, Z. Jia, S. Li, and Y. Chang, ‘‘Utilize the shift- and-add
[383] S. Vahdat, M. Kamal, A. Afzali-Kusha, M. Pedram, and Z. Navabi, architecture to approximate multiple nonlinear functions,’’ Int. J. Cir-
‘‘TruncApp: A truncation-based approximate divider for energy cuit Theory Appl., vol. 49, no. 7, pp. 2290–2297, Jul. 2021, doi:
efficient DSP applications,’’ in Proc. Design, Autom. Test 10.1002/cta.2994.
Eur. Conf. Exhib. (DATE), Mar. 2017, pp. 1635–1638, doi: [402] B. Lakshmi and A. S. Dhar, ‘‘CORDIC architectures: A survey,’’ VLSI
10.23919/DATE.2017.7927254. Design, vol. 2010, pp. 1–19, Mar. 2010, doi: 10.1155/2010/794891.
[384] A. Shriram, A. Tiwari, U. A. Kumar, B. R. T. Karri, S. Veera- [403] J. E. Volder, ‘‘The CORDIC trigonometric computing technique,’’ IRE
machaneni, and S. E. Ahmed, ‘‘Power efficient approximate divider Trans. Electron. Comput., vol. EC-8, no. 3, pp. 330–334, Sep. 1959, doi:
architecture for error resilient applications,’’ in Proc. IEEE 6th 10.1109/TEC.1959.5222693.
Conf. Inf. Commun. Technol. (CICT), Nov. 2022, pp. 1–6, doi: [404] A. M. Dalloo, A. J. Humaidi, A. K. A. Mhdawi, and H. Al-Raweshidy,
10.1109/CICT56698.2022.9997960. ‘‘Low-power and low-latency hardware implementation of approx-
[385] S. Behroozi, J. Li, J. Melchert, and Y. Kim, ‘‘SAADI: A scalable accuracy imate hyperbolic and exponential functions for embedded system
approximate divider for dynamic energy-quality scaling,’’ in Proc. 24th applications,’’ IEEE Access, vol. 12, pp. 24151–24163, 2024, doi:
Asia South Pacific Design Autom. Conf. New York, NY, USA: ACM, 10.1109/access.2024.3364361.
Jan. 2019, pp. 481–486, doi: 10.1145/3287624.3287668. [405] Y. Liu and K. K. Parhi, ‘‘Computing hyperbolic tangent and
[386] P. Malík, ‘‘High throughput floating point exponential function imple- sigmoid functions using stochastic logic,’’ in Proc. 50th Asilomar
mented in FPGA,’’ in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, Conf. Signals, Syst. Comput., Nov. 2016, pp. 1580–1585, doi:
Jul. 2015, pp. 97–100, doi: 10.1109/ISVLSI.2015.61. 10.1109/ACSSC.2016.7869645.

146084 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

[406] K. K. Parhi and Y. Liu, ‘‘Computing arithmetic functions using [423] G. Liu and Z. Zhang, ‘‘Statistically certified approximate logic
stochastic logic by series expansion,’’ IEEE Trans. Emerg. Topics Com- synthesis,’’ in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design
put., vol. 7, no. 1, pp. 44–59, Jan. 2019, doi: 10.1109/TETC.2016. (ICCAD), Nov. 2017, pp. 344–351, doi: 10.1109/ICCAD.2017.
2618750. 8203798.
[407] L. Huai, P. Li, G. E. Sobelman, and D. J. Lilja, ‘‘Stochastic com- [424] W. Zeng, A. Davoodi, and R. O. Topaloglu, ‘‘Sampling-based approx-
puting implementation of trigonometric and hyperbolic functions,’’ in imate logic synthesis: An explainable machine learning approach,’’
Proc. IEEE 12th Int. Conf. ASIC (ASICON), Oct. 2017, pp. 553–556, doi: in Proc. IEEE/ACM Int. Conf. Comput. Aided Design (ICCAD),
10.1109/ASICON.2017.8252535. vol. 30. Piscataway, NJ, USA: IEEE Press, Nov. 2021, pp. 1–9, doi:
[408] L. Chen, F. Lombardi, J. Han, and W. Liu, ‘‘A fully parallel approximate 10.1109/ICCAD51958.2021.9643484.
CORDIC design,’’ in Proc. IEEE/ACM Int. Symp. Nanosc. Architec- [425] I. Scarabottolo, G. Ansaloni, and L. Pozzi, ‘‘Circuit carving: A method-
tures (NANOARCH), Jul. 2016, pp. 197–202, doi: 10.1145/2950067. ology for the design of approximate hardware,’’ in Proc. Design,
2950076. Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2018, pp. 545–550, doi:
[409] S. Rai and R. Srivastava, ‘‘FPGA realization of scale-free CORDIC 10.23919/DATE.2018.8342067.
algorithm-based window functions,’’ in Recent Trends in Commu- [426] J. Castro-Godínez, H. Barrantes-García, M. Shafique, and J. Henkel,
nication, Computing, and Electronics (Lecture Notes in Electrical ‘‘AxLS: A framework for approximate logic synthesis based on
Engineering), A. Khare, U. S. Tiwary, I. K. Sethi, and N. Singh, netlist transformations,’’ IEEE Trans. Circuits Syst. II, Exp. Briefs,
Eds., Singapore: Springer, 2019, pp. 245–257, doi: 10.1007/978-981-13- vol. 68, no. 8, pp. 2845–2849, Aug. 2021, doi: 10.1109/TCSII.2021.
2685-1_24. 3068757.
[410] L. Chen, J. Han, W. Liu, and F. Lombardi, ‘‘Algorithm and [427] L. Witschen, T. Wiersema, M. Artmann, and M. Platzner, ‘‘MUSCAT:
design of a fully parallel approximate coordinate rotation digital MUS-based circuit approximation technique,’’ in Proc. Design,
computer (CORDIC),’’ IEEE Trans. Multi-Scale Comput. Syst., Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2022, pp. 172–177, doi:
vol. 3, no. 3, pp. 139–151, Jul. 2017, doi: 10.1109/TMSCS. 10.23919/DATE54114.2022.9774604.
2017.2696003. [428] A. Chandrasekharan, M. Soeken, D. Große, and R. Drechsler,
[411] R. K. Yousif, I. A. Hashim, and B. H. Abd, ‘‘Low area FPGA implemen- ‘‘Approximation-aware rewriting of AIGs for error tolerant
tation of hyperbolic tangent function,’’ in Proc. 6th Int. Conf. Eng. Tech- applications,’’ in Proc. IEEE/ACM Int. Conf. Comput.-Aided
nol. Appl. (IICETA), vol. 3, Jul. 2023, pp. 596–602, doi: 10.1109/ Design (ICCAD), Nov. 2016, pp. 1–8, doi: 10.1145/2966986.
iiceta57613.2023.10351345. 2967003.
[412] F. Ortega-Zamorano, J. M. Jerez, G. Juárez, J. O. Pérez, and L. Franco, [429] V. Mrazek, R. Hrbacek, Z. Vasicek, and L. Sekanina, ‘‘EvoAp-
‘‘High precision FPGA implementation of neural network activation prox8b: Library of approximate adders and multipliers for circuit
functions,’’ in Proc. IEEE Symp. Intell. Embedded Syst. (IES), Dec. 2014, design and benchmarking of approximation methods,’’ in Proc. Design,
pp. 55–60, doi: 10.1109/INTELES.2014.7008986. Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2017, pp. 258–261, doi:
[413] S. Sorayassa, M. Ahmadi, S. Sorayassa, and M. Ahmadi, ‘‘A mem- 10.23919/DATE.2017.7926993.
ory based approach for digital implementation of tanh using LUT [430] S. Hashemi, H. Tann, and S. Reda, ‘‘BLASYS: Approximate
and RALUT,’’ Comput. Sci. Inf. Technol., vol. 12, no. 22, Dec. 2022, logic synthesis using Boolean matrix factorization,’’ in Proc. 55th
Art. no. 22, doi: 10.5121/csit.2022.122204. ACM/ESDA/IEEE Design Autom. Conf. (DAC), Jun. 2018, pp. 1–6, doi:
[414] F. de Dinechin and A. Tisserand, ‘‘Multipartite table methods,’’ 10.1109/DAC.2018.8465702.
IEEE Trans. Comput., vol. 54, no. 3, pp. 319–330, Mar. 2005, doi: [431] M. Rezaalipour, M. Biasion, I. Scarabottolo, G. A. Constantinides,
10.1109/TC.2005.54. and L. Pozzi, ‘‘A parametrizable template for approximate logic
[415] A. Raha and V. Raghunathan, ‘‘Q LUT: Input-aware quantized synthesis,’’ in Proc. 53rd Annu. IEEE/IFIP Int. Conf. Depend-
table lookup for energy-efficient approximate accelerators,’’ ACM able Syst. Netw. Workshops (DSN-W), Jun. 2023, pp. 175–178, doi:
Trans. Embedded Comput. Syst., vol. 16, no. 5s, pp. 1–23, Sep. 2017, doi: 10.1109/dsn-w58399.2023.00049.
10.1145/3126531. [432] G. Ammes, W. L. Neto, P. Butzen, P.-E. Gaillardon, and
[416] Y. Xie, A. N. Joseph Raj, Z. Hu, S. Huang, Z. Fan, and M. Joler, R. P. Ribas, ‘‘A two-level approximate logic synthesis combining
‘‘A twofold lookup table architecture for efficient approximation cube insertion and removal,’’ IEEE Trans. Comput.-Aided Design
of activation functions,’’ IEEE Trans. Very Large Scale Integr. Circuits Syst., vol. 41, no. 11, pp. 5126–5130, Nov. 2022, doi:
Integr. (VLSI) Syst., vol. 28, no. 12, pp. 2540–2550, Dec. 2020, 10.1109/TCAD.2022.3143489.
doi: 10.1109/TVLSI.2020.3015391. [433] Y. Wu and W. Qian, ‘‘ALFANS: Multilevel approximate logic
[417] Z. Hajduk and G. R. Dec, ‘‘Very high accuracy hyperbolic tan- synthesis framework by approximate node simplification,’’ IEEE
gent function implementation in FPGAs,’’ IEEE Access, vol. 11, Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 39,
pp. 23701–23713, 2023, doi: 10.1109/ACCESS.2023.3253668. no. 7, pp. 1470–1483, Jul. 2020, doi: 10.1109/TCAD.2019.
[418] R. Yousif, I. Hashim, and B. Abd, ‘‘Implementation of hyper- 2915328.
bolic sine and cosine functions based on FPGA using different [434] M. Barbareschi, S. Barone, N. Mazzocca, and A. Moriconi,
approaches,’’ Eng. Technol. J., vol. 41, no. 8, pp. 1–16, Aug. 2023, doi: ‘‘A catalog-based AIG-rewriting approach to the design of
10.30684/etj.2023.139756.1440. approximate components,’’ IEEE Trans. Emerg. Topics Comput.,
[419] T.-K. Luong, V.-T. Nguyen, A.-T. Nguyen, and E. Popovici, ‘‘Efficient vol. 11, no. 1, pp. 70–81, Jan. 2023, doi: 10.1109/TETC.2022.
architectures and implementation of arithmetic functions approximation 3170502.
based stochastic computing,’’ in Proc. IEEE 30th Int. Conf. Appl.- [435] C. Meng, W. Qian, and A. Mishchenko, ‘‘ALSRAC: Approximate logic
Specific Syst., Archit. Processors (ASAP), Jul. 2019, pp. 281–287, doi: synthesis by resubstitution with approximate care set,’’ in Proc. 57th
10.1109/ASAP.2019.00018. ACM/IEEE Design Autom. Conf. (DAC), Jul. 2020, pp. 1–6, doi:
[420] M. Osta, A. Ibrahim, and M. Valle, ‘‘FPGA implementation of 10.1109/DAC18072.2020.9218627.
approximate CORDIC circuits for energy efficient applications,’’ [436] K. Nepal, S. Hashemi, H. Tann, R. I. Bahar, and S. Reda, ‘‘Automated
in Proc. 26th IEEE Int. Conf. Electron., Circuits Syst. (ICECS), high-level generation of low-power approximate computing circuits,’’
Nov. 2019, pp. 127–128, doi: 10.1109/ICECS46596.2019. IEEE Trans. Emerg. Topics Comput., vol. 7, no. 1, pp. 18–30, Jan. 2019,
8964758. doi: 10.1109/TETC.2016.2598283.
[421] A. Changela, Y. Kumar, M. Woźniak, J. Shafi, and M. F. Ijaz, [437] M. T. Leipnitz and G. L. Nazar, ‘‘Constraint-aware multi-technique
‘‘Radix-4 CORDIC algorithm based low-latency and hardware effi- approximate high-level synthesis for FPGAs,’’ ACM Trans. Recon-
cient VLSI architecture for N th root and N th power computations,’’ figurable Technol. Syst., vol. 16, no. 4, pp. 1–28, Oct. 2023, doi:
Sci. Rep., vol. 13, no. 1, p. 20918, Nov. 2023, doi: 10.1038/s41598-023- 10.1145/3624481.
47890-3. [438] J. Castro-Godínez, J. Mateus-Vargas, M. Shafique, and J. Henkel,
[422] I. Scarabottolo, G. Ansaloni, G. A. Constantinides, L. Pozzi, and ‘‘AxHLS: Design space exploration and high-level synthesis of approxi-
S. Reda, ‘‘Approximate logic synthesis: A survey,’’ Proc. IEEE, mate accelerators using approximate functional units and analytical mod-
vol. 108, no. 12, pp. 2195–2213, Dec. 2020, doi: 10.1109/JPROC.2020. els,’’ in Proc. IEEE/ACM Int. Conf. Comput. Aided Design (ICCAD). New
3014430. York, NY, USA: ACM, Nov. 2020, pp. 1–9.

VOLUME 12, 2024 146085


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

[439] R. Ranjan, S. Ullah, S. S. Sahoo, and A. Kumar, ‘‘SyFAxO-GeN: [456] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre,
Synthesizing FPGA-based approximate operators with generative net- and K. Vissers, ‘‘FINN: A framework for fast, scalable binarized
works,’’ in Proc. 28th Asia South Pacific Design Autom. Conf. (ASP- neural network inference,’’ in Proc. ACM/SIGDA Int. Symp. Field-
DAC). New York, NY, USA: ACM, Jan. 2023, pp. 402–409, doi: Program. Gate Arrays. New York, NY, USA: ACM, Feb. 2017, pp. 65–74,
10.1145/3566097.3567891. doi: 10.1145/3020078.3021744.
[440] Y. Wu, C. Shen, Y. Jia, and W. Qian, ‘‘Approximate logic synthesis for [457] N. R. Shanbhag, N. Verma, Y. Kim, A. D. Patil, and L. R. Varshney,
FPGA by wire removal and local function change,’’ in Proc. 22nd Asia ‘‘Shannon-inspired statistical computing for the nanoscale
South Pacific Design Autom. Conf. (ASP-DAC), Jan. 2017, pp. 163–169, era,’’ Proc. IEEE, vol. 107, no. 1, pp. 90–107, Jan. 2019, doi:
doi: 10.1109/ASPDAC.2017.7858314. 10.1109/JPROC.2018.2869867.
[441] G. Pasandi, S. Pratty, and J. Forsyth, ‘‘AISYN: AI-driven reinforcement [458] H. Kim and N. R. Shanbhag, ‘‘Enhancing the accuracy of 6T SRAM-
learning-based logic synthesis framework,’’ 2023, arXiv:2302.06415. based in-memory architecture via maximum likelihood detection,’’
[442] G. Pasandi, M. Peterson, M. Herrera, S. Nazarian, and M. Pedram, IEEE Trans. Signal Process., vol. 72, pp. 2799–2811, 2024, doi:
‘‘Deep-PowerX: A deep learning-based framework for low-power 10.1109/tsp.2024.3394656.
approximate logic synthesis,’’ in Proc. ACM/IEEE Int. Symp. Low Power [459] A. Pantazi, B. Rajendran, O. Simeone, and E. Neftci, ‘‘Editorial:
Electron. Design. New York, NY, USA: ACM, Aug. 2020, pp. 73–78, doi: Neuro-inspired computing for next-gen AI: Computing model, architec-
10.1145/3370748.3406555. tures and learning algorithms,’’ Frontiers Neurosci., vol. 16, Jul. 2022,
[443] C.-T. Lee, Y.-T. Li, Y.-C. Chen, and C.-Y. Wang, ‘‘Approximate Art. no. 974627, doi: 10.3389/fnins.2022.974627.
logic synthesis by genetic algorithm with an error rate guarantee,’’ in [460] A. Mehonic and A. J. Kenyon, ‘‘Brain-inspired computing needs a
Proc. 28th Asia South Pacific Design Autom. Conf. (ASP-DAC). New master plan,’’ Nature, vol. 604, no. 7905, pp. 255–260, Apr. 2022, doi:
York, NY, USA: ACM, Jan. 2023, pp. 146–151, doi: 10.1145/3566097. 10.1038/s41586-021-04362-w.
3567890. [461] S. Yu, ‘‘Neuro-inspired computing with emerging nonvolatile mem-
[444] R. Kalkreuth, Z. Vašíč ek, J. Husa, D. Vermetten, F. Ye, and orys,’’ Proc. IEEE, vol. 106, no. 2, pp. 260–285, Feb. 2018, doi:
T. Bäck, ‘‘Towards a general Boolean function benchmark suite,’’ in 10.1109/JPROC.2018.2790840.
Proc. Companion Conf. Genetic Evol. Comput., vol. 2. New York, [462] S. Sen, S. Venkataramani, and A. Raghunathan, ‘‘Approximate
NY, USA: ACM, Jul. 2023, pp. 591–594, doi: 10.1145/3583133. computing for spiking neural networks,’’ in Proc. Design,
3590685. Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2017, pp. 193–198,
[445] L. Sekanina, Z. Vasicek, and V. Mrazek, ‘‘Automated search-based doi: 10.23919/DATE.2017.7926981.
functional approximation for digital circuits,’’ in Approximate Cir- [463] A. R. Nasser, A. M. Hasan, A. J. Humaidi, A. Alkhayyat, L. Alzubaidi,
cuits: Methodologies and CAD, S. Reda and M. Shafique, Eds., Cham, M. A. Fadhel, J. Santamaría, and Y. Duan, ‘‘IoT and cloud computing
Switzerland: Springer, 2019, pp. 175–203, doi: 10.1007/978-3-319- in health-care: A new wearable device and cloud-based deep learning
99322-5_9. algorithm for monitoring of diabetes,’’ Electronics, vol. 10, no. 21,
p. 2719, Nov. 2021, doi: 10.3390/electronics10212719.
[446] Z. Vasicek and L. Sekanina, ‘‘Evolutionary approach to approximate
digital circuits design,’’ IEEE Trans. Evol. Comput., vol. 19, [464] S. K. Ghosh, A. Raha, and V. Raghunathan, ‘‘Energy-efficient
no. 3, pp. 432–444, Jun. 2015, doi: 10.1109/TEVC.2014. approximate edge inference systems,’’ ACM Trans. Embedded
2336175. Comput. Syst., vol. 22, no. 4, pp. 1–50, Jul. 2023, doi: 10.1145/
3589766.
[447] S. Su, Y. Wu, and W. Qian, ‘‘Efficient batch statistical error estima-
[465] M. Fabjanč ič, O. Machidon, H. Sharif, Y. Zhao, S. Misailović, and
tion for iterative multi-level approximate logic synthesis,’’ in Proc. 55th
V. Pejović, ‘‘Mobiprox: Supporting dynamic approximate computing on
ACM/ESDA/IEEE Design Autom. Conf. (DAC), Jun. 2018, pp. 1–6, doi:
mobiles,’’ 2023, arXiv:2303.11291.
10.1109/DAC.2018.8465838.
[466] V. Pejović, ‘‘Towards approximate mobile computing,’’ GetMobile,
[448] S. Su, C. Meng, F. Yang, X. Shen, L. Ni, W. Wu, Z. Wu, J. Zhao,
Mobile Comput. Commun., vol. 22, no. 4, pp. 9–12, May 2019, doi:
and W. Qian, ‘‘VECBEE: A versatile efficiency–accuracy configurable
10.1145/3325867.3325871.
batch error estimation method for greedy approximate logic syn-
[467] W. B. Qaim, A. Ometov, A. Molinaro, I. Lener, C. Campolo,
thesis,’’ IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.,
E. S. Lohan, and J. Nurmi, ‘‘Towards energy efficiency in the
vol. 41, no. 11, pp. 5085–5099, Nov. 2022, doi: 10.1109/TCAD.2022.
Internet of Wearable Things: A systematic review,’’ IEEE Access,
3149717.
vol. 8, pp. 175412–175435, 2020, doi: 10.1109/ACCESS.2020.
[449] M. Rezaalipour, L. Ferretti, I. Scarabottolo, G. A. Constantinides, and
3025270.
L. Pozzi, ‘‘Multi-metric SMT-based evaluation of worst-case-error for
[468] A. Das, S. K. Ghosh, A. Raha, and V. Raghunathan, ‘‘Toward energy-
approximate circuits,’’ in Proc. 53rd Annu. IEEE/IFIP Int. Conf. Depend-
efficient collaborative inference using multisystem approximations,’’
able Syst. Netw. Workshops (DSN-W), vol. 7, Jun. 2023, pp. 199–202, doi:
IEEE Internet Things J., vol. 11, no. 10, pp. 17989–18004, May 2024,
10.1109/dsn-w58399.2023.00055.
doi: 10.1109/JIOT.2024.3365306.
[450] Z. Vasicek, ‘‘Formal methods for exact analysis of approximate [469] S. Jain, S. Venkataramani, V. Srinivasan, J. Choi, P. Chuang, and
circuits,’’ IEEE Access, vol. 7, pp. 177309–177331, 2019, doi: L. Chang, ‘‘Compensated-DNN: Energy efficient low-precision deep
10.1109/ACCESS.2019.2958605. neural networks by compensating quantization errors,’’ in Proc. 55th
[451] G. Ammes, P. F. Butzen, A. I. Reis, and R. Ribas, ‘‘Two-level and ACM/ESDA/IEEE Design Autom. Conf. (DAC), Jun. 2018, pp. 1–6, doi:
multilevel approximate logic synthesis,’’ J. Integr. Circuits Syst., vol. 17, 10.1109/DAC.2018.8465893.
no. 3, pp. 1–14, Dec. 2022, doi: 10.29292/jics.v17i3.661. [470] S. S. Sarwar, S. Venkataramani, A. Ankit, A. Raghunathan, and K. Roy,
[452] P. Choudhary, L. Bhargava, V. Singh, and A. Kumar Suhag, ‘‘Approxi- ‘‘Energy-efficient neural computing with approximate multipliers,’’ ACM
mate computing: Evolutionary methods for functional approximation of J. Emerg. Technol. Comput. Syst., vol. 14, no. 2, pp. 1–23, Jul. 2018, doi:
digital circuits,’’ Mater. Today, Proc., vol. 66, pp. 3487–3492, Jan. 2022, 10.1145/3097264.
doi: 10.1016/j.matpr.2022.06.386. [471] Z. Peng, X. Chen, C. Xu, N. Jing, X. Liang, C. Lu, and L. Jiang,
[453] A. Raha and V. Raghunathan, ‘‘Approximating beyond the proces- ‘‘AXNet: ApproXimate computing using an end-to-end trainable
sor: Exploring full-system energy-accuracy tradeoffs in a smart cam- neural network,’’ in Proc. IEEE/ACM Int. Conf. Comput.-Aided
era system,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., Design (ICCAD). New York, NY, USA: ACM, Nov. 2018,
vol. 26, no. 12, pp. 2884–2897, Dec. 2018, doi: 10.1109/TVLSI.2018. pp. 1–8.
2864269. [472] N. Ashar, G. Raut, V. Trivedi, S. K. Vishvakarma, and A. Kumar,
[454] S. Hashemi, H. Tann, F. Buttafuoco, and S. Reda, ‘‘Approximate com- ‘‘QuantMAC: Enhancing hardware performance in DNNs with
puting for biometric security systems: A case study on iris scanning,’’ quantize enabled multiply-accumulate unit,’’ IEEE Access,
in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2018, vol. 12, pp. 43600–43614, 2024, doi: 10.1109/ACCESS.2024.
pp. 319–324, doi: 10.23919/DATE.2018.8342029. 3379906.
[455] B. Srinivas Prabakaran, V. Mrazek, Z. Vasicek, L. Sekanina, and [473] X. Sui, Q. Lv, Y. Bai, B. Zhu, L. Zhi, Y. Yang, and Z. Tan, ‘‘A hardware-
M. Shafique, ‘‘Xel-FPGAs: An end-to-end automated exploration frame- friendly low-bit power-of-two quantization method for CNNs and its
work for approximate accelerators in FPGA-based systems,’’ 2023, FPGA implementation,’’ Sensors, vol. 22, no. 17, p. 6618, Sep. 2022, doi:
arXiv:2303.04734. 10.3390/s22176618.

146086 VOLUME 12, 2024


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

[474] X. Sui, Q. Lv, L. Zhi, B. Zhu, Y. Yang, Y. Zhang, and Z. Tan, [493] B. Miller and R. Pozo. Java SciMark 2.0. Accessed: Mar. 13, 2024.
‘‘A hardware-friendly high-precision CNN pruning method and its [Online]. Available: https://math.nist.gov/scimark2/
FPGA implementation,’’ Sensors, vol. 23, no. 2, p. 824, Jan. 2023, doi: [494] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and
10.3390/s23020824. K. Skadron, ‘‘Rodinia: A benchmark suite for heterogeneous computing,’’
[475] M. P. Véstias, R. P. Duarte, J. T. de Sousa, and H. C. Neto, ‘‘Fast in Proc. IEEE Int. Symp. Workload Characterization (IISWC), Oct. 2009,
convolutional neural networks in low density FPGAs using zero-skipping pp. 44–54, doi: 10.1109/IISWC.2009.5306797.
and weight pruning,’’ Electronics, vol. 8, no. 11, p. 1321, Nov. 2019, doi: [495] A. Yazdanbakhsh, D. Mahajan, H. Esmaeilzadeh, and P. Lotfi-Kamran,
10.3390/electronics8111321. ‘‘AxBench: A multiplatform benchmark suite for approximate comput-
[476] S. Jang, W. Liu, and Y. Cho, ‘‘Convolutional neural network model ing,’’ IEEE Des. Test. IEEE Des. Test. Comput., vol. 34, no. 2, pp. 60–68,
compression method for software—Hardware co-design,’’ Information, Apr. 2017, doi: 10.1109/MDAT.2016.2630270.
vol. 13, no. 10, p. 451, Sep. 2022, doi: 10.3390/info13100451. [496] S. Ullah, S. S. Murthy, and A. Kumar, ‘‘SMApproxLib:
[477] M. Zhang, L. Li, H. Wang, Y. Liu, H. Qin, and W. Zhao, ‘‘Optimized Library of FPGA-based approximate multipliers,’’ in Proc. 55th
compression for implementing convolutional neural networks on FPGA,’’ ACM/ESDA/IEEE Design Autom. Conf. (DAC), Jun. 2018, pp. 1–6, doi:
Electronics, vol. 8, no. 3, p. 295, Mar. 2019, doi: 10.3390/electron- 10.1109/DAC.2018.8465845.
ics8030295. [497] L. Witschen, M. Awais, H. Ghasemzadeh Mohammadi, T. Wiersema,
[478] J. Bai, S. Sun, W. Zhao, and W. Kang, ‘‘CIMQ: A hardware-efficient and M. Platzner, ‘‘CIRCA: Towards a modular and extensible
quantization framework for computing-in-memory-based neural net- framework for approximate circuit generation,’’ Microelectron. Rel.,
work accelerators,’’ IEEE Trans. Comput.-Aided Design Integr. Cir- vol. 99, pp. 277–290, Aug. 2019, doi: 10.1016/j.microrel.
cuits Syst., vol. 43, no. 1, pp. 189–202, Jan. 2024, doi: 10.1109/tcad. 2019.04.003.
2023.3298705. [498] M. V. Bordin, D. Griebler, G. Mencagli, C. F. R. Geyer,
[479] M. Hafezan and E. Atoofian, ‘‘Mixed-precision architecture for GPU and L. G. L. Fernandes, ‘‘DSPBench: A suite of benchmark
tensor cores,’’ in Proc. IEEE Smart World Congr. (SWC), Aug. 2023, applications for distributed data stream processing systems,’’ IEEE
pp. 1–8, doi: 10.1109/swc57546.2023.10448789. Access, vol. 8, pp. 222900–222917, 2020, doi: 10.1109/ACCESS.
[480] S. Tabrizchi, A. Nezhadi, S. Angizi, and A. Roohi, ‘‘AppCiP: 2020.3043948.
Energy-efficient approximate convolution-in-pixel scheme for neural [499] M. Bakhshalipour, M. Likhachev, and P. B. Gibbons, ‘‘RTRBench:
network acceleration,’’ IEEE J. Emerg. Sel. Topics Circuits Syst., A benchmark suite for real-time robotics,’’ in Proc. IEEE Int. Symp. Per-
vol. 13, no. 1, pp. 225–236, Mar. 2023, doi: 10.1109/JETCAS.2023. form. Anal. Syst. Softw. (ISPASS), May 2022, pp. 175–186, doi:
3242167. 10.1109/ISPASS55109.2022.00024.
[481] S. A. K. Gharavi and S. Safari, ‘‘Performance improvement of pro- [500] H. C. Prashanth and M. Rao, ‘‘SOMALib: Library of exact and
cessor through configurable approximate arithmetic units in multi- approximate activation functions for hardware-efficient neural network
core systems,’’ IEEE Access, vol. 12, pp. 43907–43917, 2024, doi: accelerators,’’ in Proc. IEEE 40th Int. Conf. Comput. Design
10.1109/ACCESS.2024.3380912. (ICCD), Oct. 2022, pp. 746–753, doi: 10.1109/ICCD56317.
[482] H. Younes, A. Ibrahim, M. Rizk, and M. Valle, ‘‘Algorithmic-level 2022.00114.
approximate tensorial SVM using high-level synthesis on FPGA,’’ [501] M. Item, J. Gómez-Luna, Y. Guo, G. F. Oliveira, M. Sadrosadati,
Electronics, vol. 10, no. 2, p. 205, Jan. 2021, doi: 10.3390/electron- and O. Mutlu, ‘‘TransPimLib: A library for efficient transcenden-
ics10020205. tal functions on processing-in-memory systems,’’ 2023, arXiv:2304.
[483] I. D. Mienye and N. Jere, ‘‘A survey of decision trees: Concepts, algo- 01951.
rithms, and applications,’’ IEEE Access, vol. 12, pp. 86716–86727, 2024, [502] OpenBenchmarking.org—Cross-Platform, Open-Source Automated
doi: 10.1109/ACCESS.2024.3416838. Benchmarking Platform. Accessed: Mar. 13, 2024. [Online]. Available:
[484] K. K. Pandey and D. Shukla, ‘‘Stratification to improve systematic sam- https://openbenchmarking.org/
pling for big data mining using approximate clustering,’’ in Machine [503] Papers With Code—The Latest in Machine Learning. Accessed:
Intelligence and Smart Systems, S. Agrawal, K. Kumar Gupta, J. H. Chan, Mar. 13, 2024. [Online]. Available: https://paperswithcode.com/
J. Agrawal, and M. Gupta, Eds., Singapore: Springer, 2021, pp. 337–351, [504] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun,
doi: 10.1007/978-981-33-4893-6_30. W. Zhang, and J. Cong, ‘‘FP-DNN: An automated framework
[485] J. Liu, W. Yinchai, F. Wei, Q. Han, Y. Tao, L. Zhao, X. Li, and H. Sun, for mapping deep neural networks onto FPGAs with RTL-HLS
‘‘Secure cloud-aided approximate nearest neighbor search on high- hybrid templates,’’ in Proc. IEEE 25th Annu. Int. Symp. Field-
dimensional data,’’ IEEE Access, vol. 11, pp. 109027–109037, 2023, doi: Program. Custom Comput. Mach. (FCCM), Apr. 2017, pp. 152–159,
10.1109/ACCESS.2023.3321457. doi: 10.1109/FCCM.2017.25.
[486] S. Khan, S. Singh, H. V. Simhadri, and J. Vedurada, ‘‘BANG: Billion- [505] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, ‘‘An automatic
scale approximate nearest neighbor search using a single GPU,’’ 2024, RTL compiler for high-throughput FPGA implementation of
arXiv:2401.11324. diverse deep convolutional neural networks,’’ in Proc. 27th
[487] D. Vanderkam, R. Schonberger, H. Rowley, and S. Kumar, Int. Conf. Field Program. Log. Appl. (FPL), Sep. 2017, pp. 1–8,
‘‘Nearest neighbor search in Google correlate,’’ Google, Inc., doi: 10.23919/FPL.2017.8056824.
Mountain View, CA, USA, Tech. Rep., 2013. [Online]. Available: [506] FINN. Accessed: May 15, 2024. [Online]. Available:
http://research.google.com/pubs/archive/41694.pdf https://xilinx.github.io/finn/
[488] F. Regazzoni, C. Alippi, and I. Polian, ‘‘Security: The dark side of [507] M. T. Arafin and Z. Lu, ‘‘Security challenges of processing-in-
approximate computing?’’ in Proc. IEEE/ACM Int. Conf. Comput.- memory systems,’’ in Proc. Great Lakes Symp. VLSI, vol. 99. New
Aided Design (ICCAD), Nov. 2018, pp. 1–6, doi: 10.1145/3240765. York, NY, USA: ACM, Sep. 2020, pp. 229–234, doi: 10.1145/3386263.
3243497. 3411365.
[489] P. Yellu, L. Buell, M. Mark, M. A. Kinsy, D. Xu, and Q. Yu, ‘‘Secu- [508] S. Lee and A. Gerstlauer, ‘‘Approximate high-level synthesis of custom
rity threat analyses and attack models for approximate computing hardware,’’ in Approximate Circuits: Methodologies and CAD, S. Reda
systems: From hardware and micro-architecture perspectives,’’ ACM and M. Shafique, Eds., Cham, Switzerland: Springer, 2019, pp. 205–223,
Trans. Design Autom. Electron. Syst., vol. 26, no. 4, pp. 1–31, Apr. 2021, doi: 10.1007/978-3-319-99322-5_10.
doi: 10.1145/3442380. [509] T. Alan, A. Gerstlauer, and J. Henkel, ‘‘Runtime accuracy-
[490] P. Yellu and Q. Yu, ‘‘Securing approximate computing systems via configurable approximate hardware synthesis using logic gating
obfuscating approximate-precise boundary,’’ IEEE Trans. Comput.-Aided and relaxation,’’ in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE),
Design Integr. Circuits Syst., vol. 42, no. 1, pp. 27–40, Jan. 2023, doi: Mar. 2020, pp. 1578–1581, doi: 10.23919/DATE48585.
10.1109/TCAD.2022.3168261. 2020.9116272.
[491] S. Ariful Islam, ‘‘On the (In)security of approximate computing synthe- [510] R. Zhao, Z. Yang, H. Zheng, Y. Wu, F. Liu, Z. Wu, L. Li, F. Chen,
sis,’’ 2019, arXiv:1912.01209. S. Song, J. Zhu, W. Zhang, H. Huang, M. Xu, K. Sheng, Q. Yin,
[492] D.-E.-S. Kundi, A. Khalid, S. Bian, C. Wang, M. O’Neill, and J. Pei, G. Li, Y. Zhang, M. Zhao, and L. Shi, ‘‘A framework for the
W. Liu, ‘‘AxRLWE: A multilevel approximate ring-LWE co-processor general design and computation of hybrid neural networks,’’ Nature
for lightweight IoT applications,’’ IEEE Internet Things J., vol. 9, no. 13, Commun., vol. 13, no. 1, p. 3427, Jun. 2022, doi: 10.1038/s41467-
pp. 10492–10501, Jul. 2022, doi: 10.1109/JIOT.2021.3122276. 022-30964-7.

VOLUME 12, 2024 146087


A. M. Dalloo et al.: AC: Concepts, Architectures, Challenges, Applications, and Future Directions

AYAD M. DALLOO received the B.Sc. and AMMAR K. AL MHDAWI received the Ph.D.
M.Sc. degrees in electronic and communication degree in electronic and electrical engineering
engineering. He is currently pursuing the Ph.D. from Brunel University London. He is currently a
degree with the Electrical Engineering Depart- Lecturer with the School of Engineering and Sus-
ment, University of Technology, Iraq. He is also a tainable Development, De Montfort University,
Faculty Member of the Communication Engineer- U.K. His postdoctoral research from Newcastle
ing Department, University of Technology. His University, U.K. He is also a Freelance Consultant
research interests include approximate computing Engineer with more than 15 years of experience
and machine learning. in control engineering and robotics. His research
interests include control systems, (AUV, AGV, and
UAV) robotic systems, intelligent and automatic control, and interconnected
systems. Moreover, he is a Guest Editor of Actuators (MDPI) special session.

HAMED AL-RAWESHIDY (Senior Member,


IEEE) is currently a renowned Professor in com-
munications engineering with the University of
Technology, Baghdad, and advanced qualifica-
tions with the University of Glasgow, U.K.,
and Strathclyde University, U.K. He has a rich
career, including roles with the Space and Astron-
omy Research Centre, Iraq; PerkinElmer, USA;
Carl Zeiss, Germany; British Telecom, U.K.;
and various universities, such as Oxford Univer-
AMJAD JALEEL HUMAIDI received the B.Sc. sity, Manchester Metropolitan University, and Kent University. Currently,
and M.Sc. degrees in control engineering from the he directs the Wireless Networks and Communications Centre and Post-
Al-Rasheed College of Engineering and Science, graduate Studies in Electronic and Computer Engineering, Brunel University
in 1992 and 1997, respectively, and the Ph.D. London. He has published over 370 articles and edited the first book Radio
degree, in 2006, with specialization in control and Over Fiber Technologies for Mobile Communications Networks. He is also
automation. He is currently a Professor with the a Consultant for global telecom companies and a Principal Investigator for
Engineering College, University of Technology, significant research projects. His current research interests include advanced
Iraq. His research interests include adaptive, non- technologies in communications engineering, including 5G and 6G develop-
linear and intelligent control, optimization, and ments, quantum computing, AI, and the IoT applications.
real-time image processing.

146088 VOLUME 12, 2024

You might also like